Voice-driven editing for structured documents

What we built

We built a voice-driven editing prototype where voice is integrated into the editor, not a separate chat surface.

With the Tiptap AI Toolkit, the assistant can read the structure of your document and apply changes as real editing operations, not separate text you have to integrate back into the editor. Edits stream in as suggestions in place, so you can accept or reject them and stay in control.

Why voice editing is hard

For many people, voice is a natural and effective way to work with a document, but making it reliable for precise edits is still hard.

Dictating text is easy. Editing is the hard part: restructure, rewrite, move blocks, format, and stay consistent. That’s where most voice assistants fall back to the same workflow: talk to a chatbot, copy, paste, compare, repeat.

This pattern exists for a reason: it’s difficult to improve upon. Most voice tools live outside the editor, so they return text instead of changes. When someone says, “Make this shorter and turn the last part into bullets,” the system still has to decide what to change, where to apply it, and how to do it without breaking structure.

Where voice fits

Voice is becoming a core input method for content creation, not because typing is going away, but because writing doesn’t happen only at a keyboard.

Ideas show up in moments where you don’t have a clean writing setup: walking, commuting, driving, training, moving between meetings. Voice is also essential when your hands are unavailable or when it is simply the most practical tool.

The real reason we care is simple: voice can reduce the distance between having an idea and capturing it. The hard part is what comes next, turning that input into precise, reviewable edits.

Core idea: AI Toolkit enables document-aware voice

Voice becomes useful for editing when it can operate on the document, not just generate text, but doing that reliably requires structure, targeting, and changes that stay valid.

The Tiptap AI Toolkit makes this easier by giving the assistant structured access to your document model, so it can apply controlled, reviewable editing operations instead of returning text you have to integrate back into the editor.

This experiment tests what happens when you combine that document-aware editing layer with voice input.

How it works in the prototype

The prototype supports two interaction patterns.

In Voice Assistant, you speak to the assistant and it proposes document-aware changes directly in the editor. Edits stream in as suggestions, so you can accept or reject them.

In Transcribe, you dictate text with a small set of spoken commands for structure and formatting.

To make edits precise, the assistant needs a target. In the demo, you can select content by voice (for example, a paragraph or a section) and then apply an action to that selection.

Understanding intent

This is the hardest part: deciding what a spoken phrase means in the moment.

In voice editing, people naturally mix content and instructions. They think out loud, revise mid-sentence, and reference parts of the document with “this” and “that.” A system that treats everything as dictation will miss commands. A system that treats everything as commands will interrupt writing.

Our prototype uses two interaction patterns to reduce ambiguity. It works, but it’s only a starting point. The real goal is a single, natural voice flow where the system can infer intent, ask when it’s unsure, and confirm high-impact operations without slowing you down.

The direction we care about is clear: intent detection that uses document context, confirms high-impact operations, and recovers gracefully when it gets it wrong.

Why we’re excited

Even with the rough edges, the prototype demonstrates something important: when voice is integrated into the editor – and paired with a toolkit that understands documents – the experience starts to feel less like talking to a bot and more like editing with a collaborator.

For us, that’s the point. Voice is another layer for interacting with structured content, and structure is where Tiptap is strongest.

Closing thought

Typing will remain the default for many workflows. But voice has unique strengths: speed, immediacy, and accessibility in moments where keyboards don’t work, and for people who think better when talking.

This experiment asks a simple question:

What if voice wasn’t just a way to input text, but a way to operate a document?

We don’t have all the answers yet. But we think it’s worth exploring—because the future of writing will be multi-modal, and editors should meet writers where ideas actually happen.

If you’re building with Tiptap, the AI Toolkit handles the document-aware editing layer, so you can focus on the voice experience.