Markdown Glossary for Tiptap

Beta

Before we dive into the details, here are some key terms we'll be using throughout this guide:

Token

A plain JavaScript object that represents a piece of the parsed Markdown. For example, a heading token might look like { type: "heading", depth: 2, text: "Hello" }. Tokens are the “lego bricks” that describe the document’s structure.

  • A token is a structured representation of a piece of Markdown syntax produced by the MarkedJS parser.
  • Each token has a type (like heading, paragraph, list, etc.)
  • Each token may include additional properties relevant to that type (like depth for headings, ordered for lists, etc.).
  • Tokens can also contain nested tokens in properties like tokens or items, representing the hierarchical structure of the Markdown content.
  • A token is not directly usable by Tiptap; it needs to be transformed into Tiptap's JSON format.
  • Tokens are created via a Tokenizer.
  • We can create our own tokens by implementing a Custom Tokenizer.

Note: MarkedJS comes with built-in tokenizers for standard Markdown syntax, but you can extend or replace these by providing custom tokenizers to the MarkdownManager.

You can find the list of default tokens in the MarkedJS types.

Tiptap JSON

  • Has nothing to do with Markdown and is the JSON format used by Tiptap and ProseMirror to represent the document structure.
  • Tiptap JSON consists of nodes and marks, each with a type, optional attrs, and optional content or text.
  • Nodes represent block-level elements (like paragraphs, headings, lists), while marks represent inline formatting (like bold, italic, links).
  • Tiptap JSON is hierarchical, with nodes containing other nodes or text, reflecting the document's structure.
  • We can use token to create Tiptap JSON that the editor can understand.

Now that we understand the difference between a Token and Tiptap JSON, let's dive into how to parse tokens and serialize Tiptap content.

Tokenizer

The set of functions (or rules) that scan the raw Markdown text and decide how to turn chunks of it into tokens. For example, it recognizes ## Heading and produces a heading token. You can customize or override tokenizers to change how Markdown is interpreted.

You can find out how to create custom tokenizers in the Custom Tokenizers guide.

Lexer

The orchestrator that runs through the entire Markdown string, applies the tokenizers in sequence, and produces the full list of tokens. Think of it as the machine that repeatedly feeds text into the tokenizers until the whole input is tokenized.

You don't need to touch the lexer directly, because Tiptap is already creating a lexer instance that will be reused for the lifetime of your editor as part of the MarkedJS instance.

This lexer instance will automatically register all tokenizers from your extensions.