Markdown Glossary for Tiptap
Before we dive into the details, here are some key terms we'll be using throughout this guide:
Token
A plain JavaScript object that represents a piece of the parsed Markdown. For example, a heading token might look like { type: "heading", depth: 2, text: "Hello" }
. Tokens are the “lego bricks” that describe the document’s structure.
- A token is a structured representation of a piece of Markdown syntax produced by the MarkedJS parser.
- Each token has a
type
(likeheading
,paragraph
,list
, etc.) - Each token may include additional properties relevant to that type (like
depth
for headings,ordered
for lists, etc.). - Tokens can also contain nested tokens in properties like
tokens
oritems
, representing the hierarchical structure of the Markdown content. - A token is not directly usable by Tiptap; it needs to be transformed into Tiptap's JSON format.
- Tokens are created via a Tokenizer.
- We can create our own tokens by implementing a Custom Tokenizer.
Note: MarkedJS comes with built-in tokenizers for standard Markdown syntax, but you can extend or replace these by providing custom tokenizers to the MarkdownManager.
You can find the list of default tokens in the MarkedJS types.
Tiptap JSON
- Has nothing to do with Markdown and is the JSON format used by Tiptap and ProseMirror to represent the document structure.
- Tiptap JSON consists of nodes and marks, each with a
type
, optionalattrs
, and optionalcontent
ortext
. - Nodes represent block-level elements (like paragraphs, headings, lists), while marks represent inline formatting (like bold, italic, links).
- Tiptap JSON is hierarchical, with nodes containing other nodes or text, reflecting the document's structure.
- We can use token to create Tiptap JSON that the editor can understand.
Now that we understand the difference between a Token and Tiptap JSON, let's dive into how to parse tokens and serialize Tiptap content.
Tokenizer
The set of functions (or rules) that scan the raw Markdown text and decide how to turn chunks of it into tokens. For example, it recognizes ## Heading
and produces a heading
token. You can customize or override tokenizers to change how Markdown is interpreted.
You can find out how to create custom tokenizers in the Custom Tokenizers guide.
Lexer
The orchestrator that runs through the entire Markdown string, applies the tokenizers in sequence, and produces the full list of tokens. Think of it as the machine that repeatedly feeds text into the tokenizers until the whole input is tokenized.
You don't need to touch the lexer directly, because Tiptap is already creating a lexer instance that will be reused for the lifetime of your editor as part of the MarkedJS instance.
This lexer instance will automatically register all tokenizers from your extensions.