Custom Markdown Tokenizers
Custom tokenizers extend the Markdown parser to support non-standard or custom syntax. This guide explains how tokenizers work and how to create your own.
Tip: For standard patterns like Pandoc blocks or shortcodes, check the Utility Functions first—they provide ready-made tokenizers.
What are Tokenizers?
Tokenizers are functions that identify and parse custom Markdown syntax into tokens. They're registered with MarkedJS and run during the lexing phase, before Tiptap's parse handlers process the tokens.
Note: Want to learn more about Tokenizers? Check out the Glossary.
The Tokenization Flow
Markdown String
↓
Custom Tokenizers (identify custom syntax)
↓
Standard MarkedJS Lexer
↓
Markdown Tokens
↓
Extension Parse Handlers
↓
Tiptap JSON
When to Use Custom Tokenizers
Use custom tokenizers when you want to support:
- Custom inline syntax (e.g.,
++inserted text++
,==highlighted==
) - Custom block syntax (e.g.,
:::note
,!!!warning
) - Shortcodes (e.g.,
[[embed:video-id]]
) - Custom Markdown extensions
- Domain-specific notation
Tokenizer Structure
A tokenizer is an object with these properties:
type MarkdownTokenizer = {
name: string // Token name (must be unique)
level?: 'block' | 'inline' // Level: block or inline
start?: (src: string) => number // Where the token starts
tokenize: (src, tokens, lexer) => MarkdownToken | undefined
}
Properties Explained
name
(required)
A unique identifier for your token type:
{
name: 'highlight',
// ...
}
This name will be used when registering parse handlers.
level
(optional)
Whether this tokenizer operates at block or inline level:
{
level: 'inline', // 'block' or 'inline'
// ...
}
inline
: For inline elements like bold, italic, custom marks (default)block
: For block elements like custom containers, admonitions
start
(optional)
A function that returns the index where your token might start in the source string. This is an optimization to avoid unnecessary parsing attempts:
{
start: (src) => {
// Find where '==' appears in the source
return src.indexOf('==')
},
// ...
}
This optimization helps MarkedJS skip irrelevant parts of the text. If omitted, MarkedJS will try your tokenizer at every position.
tokenize
(required)
The main parsing function that identifies and tokenizes your syntax:
{
tokenize: (src, tokens, lexer) => {
// Try to match your syntax at the start of src
const match = /^==(.+?)==/.exec(src)
if (match) {
return {
type: 'highlight',
raw: match[0], // Full matched string
text: match[1], // Captured content
tokens: lexer.inlineTokens(match[1]), // Parsed content
}
}
// Return undefined if no match
return undefined
},
}
The function receives:
src
: Remaining source text to parsetokens
: Previously parsed tokens (usually not needed)lexer
: Helper functions for tokenizing child content
So as described above the flow of your Markdown content will be:
Markdown => Tokenizer => Lexer => Token => markdown.parse() => Tiptap JSON
And from Tiptap JSON back to Markdown:
Tiptap JSON => markdown.render() => Markdown
Creating a Simple Inline Tokenizer
Let's create a tokenizer for highlight syntax (==text==
).
import { Node } from '@tiptap/core'
const Highlight = Node.create({
name: 'highlight',
// ... other config (parseHTML, renderHTML, etc.)
// Define the custom tokenizer
// note - this is turning Markdown strings to **tokens**
markdownTokenizer: {
name: 'highlight', // the token name you want to give to the token - must be unique and will be picked up by the parse function
level: 'inline', // the tokenizer level - inline or block
// This function should return the index of your syntax in the src string
// or -1 if not found. This is an optimization to avoid running the tokenizer unnecessarily
start: src => {
return src.indexOf('==')
},
// The tokenize function extracts information from the src string and returns a token object
// or undefined if the syntax is not matched
tokenize: (src, tokens, lexer) => {
// Match ==text== at the start of src
const match = /^==([^=]+)==/.exec(src)
if (!match) {
return undefined
}
return {
type: 'highlight',
raw: match[0], // '==text=='
text: match[1], // 'text'
tokens: lexer.inlineTokens(match[1]), // Parse inline content
}
},
},
// Parse the token to Tiptap JSON
// note - this is consuming **Tokens** and transforms them into Tiptap JSON
parseMarkdown: (token, helpers) => {
return helpers.applyMark('highlight', helpers.parseInline(token.tokens || []))
},
// Render back to Markdown
renderMarkdown: (node, helpers) => {
const content = helpers.renderChildren(node)
return `==${content}==`
},
})
Using the Extension
import { Editor } from '@tiptap/core'
import StarterKit from '@tiptap/starter-kit'
import { Markdown } from '@tiptap/markdown'
import Highlight from './Highlight'
const editor = new Editor({
extensions: [StarterKit, Markdown, Highlight],
})
// Parse Markdown with custom syntax
editor.commands.setContent('This is ==highlighted text==!', { contentType: 'markdown' })
// Get Markdown back
console.log(editor.getMarkdown())
// This is ==highlighted text==!
Creating a Block-Level Tokenizer
Let's create a tokenizer for admonition blocks:
:::note
This is a note
:::
import { Node } from '@tiptap/core'
const Admonition = Node.create({
name: 'admonition',
group: 'block',
content: 'block+',
addAttributes() {
return {
type: {
default: 'note',
},
}
},
parseHTML() {
return [
{
tag: 'div[data-admonition]',
getAttrs: node => ({
type: node.getAttribute('data-type'),
}),
},
]
},
renderHTML({ node, HTMLAttributes }) {
return [
'div',
{ 'data-admonition': '', 'data-type': node.attrs.type },
0, // Content
]
},
markdownTokenizer: {
name: 'admonition',
level: 'block',
start: src => {
return src.indexOf(':::')
},
tokenize: (src, tokens, lexer) => {
// Match :::type\ncontent\n:::
const match = /^:::(\w+)\n([\s\S]*?)\n:::/.exec(src)
if (!match) {
return undefined
}
return {
type: 'admonition',
raw: match[0],
admonitionType: match[1], // 'note', 'warning', etc.
text: match[2], // Content
tokens: lexer.blockTokens(match[2]), // Parse block content
}
},
},
parseMarkdown: (token, helpers) => {
return {
type: 'admonition',
attrs: {
type: token.admonitionType || 'note',
},
content: helpers.parseChildren(token.tokens || []),
}
},
renderMarkdown: (node, helpers) => {
const type = node.attrs?.type || 'note'
const content = helpers.renderChildren(node.content || [])
return `:::${type}\n${content}\n:::\n\n`
},
})
Using Block-Level Tokenizers
const markdown = `
# Document
:::note
This is a note with **bold** text.
:::
:::warning
This is a warning!
:::
`
editor.commands.setContent(markdown, { contentType: 'markdown' })
Tokenizer with Nested Content
Let's create a tokenizer that supports nested inline parsing:
const Emoji = Node.create({
name: 'emoji',
group: 'inline',
inline: true,
addAttributes() {
return {
name: { default: null },
}
},
parseHTML() {
return [
{
tag: 'emoji',
getAttrs: node => ({ name: node.getAttribute('data-name') }),
},
]
},
renderHTML({ node }) {
return ['emoji', { 'data-name': node.attrs.name }]
},
markdownTokenizer: {
name: 'emoji',
level: 'inline',
start: src => {
return src.indexOf(':')
},
tokenize: (src, tokens, lexer) => {
// Match :emoji_name:
const match = /^:([a-z0-9_+]+):/.exec(src)
if (!match) {
return undefined
}
return {
type: 'emoji',
raw: match[0],
emojiName: match[1],
}
},
},
parseMarkdown: (token, helpers) => {
return {
type: 'emoji',
attrs: {
name: token.emojiName,
},
}
},
renderMarkdown: (node, helpers) => {
return `:${node.attrs?.name || 'unknown'}:`
},
})
Using the Lexer Helpers
The lexer
parameter provides helper functions to parse nested content:
lexer.inlineTokens(src)
Parse inline content (for inline-level tokenizers):
tokenize: (src, tokens, lexer) => {
const match = /^\[\[([^\]]+)\]\]/.exec(src)
if (match) {
return {
type: 'custom',
raw: match[0],
tokens: lexer.inlineTokens(match[1]), // Parse inline content
}
}
}
lexer.blockTokens(src)
Parse block-level content (for block-level tokenizers):
tokenize: (src, tokens, lexer) => {
const match = /^:::\w+\n([\s\S]*?)\n:::/.exec(src)
if (match) {
return {
type: 'container',
raw: match[0],
tokens: lexer.blockTokens(match[1]), // Parse block content
}
}
}
Regular Expression Best Practices
Use ^
to Match from Start
Always anchor your regex to the start of the string:
// ✅ Good - matches from start
/^==(.+?)==/
// ❌ Bad - can match anywhere
/==(.+?)==/
Use Non-Greedy Matching
Use +?
or *?
instead of +
or *
for better control:
// ✅ Good - stops at first closing
/^==(.+?)==/
// ❌ Bad - matches too much
/^==(.+)==/
Test Edge Cases
Test your regex with:
- Empty content:
====
- Nested syntax:
==text **bold** text==
- Multiple occurrences:
==one== ==two==
- Unclosed syntax:
==text
// Handle unclosed syntax
const match = /^==([^=]+)==/.exec(src)
if (!match) {
return undefined // Not matched, let standard parser handle it
}
Debugging Tokenizers
Log the Token Output
tokenize: (src, tokens, lexer) => {
const match = /^==(.+?)==/.exec(src)
if (match) {
const token = {
type: 'highlight',
raw: match[0],
tokens: lexer.inlineTokens(match[1]),
}
console.log('Tokenized:', token)
return token
}
console.log('No match for:', src.substring(0, 20))
return undefined
}
Test in Isolation
Test your tokenizer independently:
const src = '==highlighted text== and more'
const match = /^==(.+?)==/.exec(src)
console.log('Match:', match)
// ['==highlighted text==', 'highlighted text==']
// Adjust regex
const betterMatch = /^==([^=]+)==/.exec(src)
console.log('Better match:', betterMatch)
// ['==highlighted text==', 'highlighted text']
Check Token Registry
Verify your tokenizer is registered:
console.log(editor.markdown.instance)
// Check the MarkedJS instance configuration
Common Pitfalls
1. Forgetting to Return undefined
Always return undefined
when your syntax doesn't match:
// ✅ Good
tokenize: (src, tokens, lexer) => {
const match = /^==(.+?)==/.exec(src)
if (!match) {
return undefined // Important!
}
return {
/* token */
}
}
// ❌ Bad - returns falsy value
tokenize: (src, tokens, lexer) => {
const match = /^==(.+?)==/.exec(src)
return match
? {
/* token */
}
: null // Should be undefined
}
2. Not Including raw
Always include the full matched string in raw
:
return {
type: 'highlight',
raw: match[0], // Full match including delimiters
text: match[1], // Content only
}
3. Wrong Level
Make sure level
matches your tokenizer's purpose:
// Inline element (within text)
{
level: 'inline'
}
// Block element (standalone)
{
level: 'block'
}
4. Consuming Too Much
Be careful not to consume content beyond your syntax:
// ✅ Good - stops at closing delimiter
/^==([^=]+)==/
// ❌ Bad - might consume multiple blocks
/^==([\s\S]+)==/
Advanced: Stateful Tokenizers
For complex syntax, maintain state across tokenization:
let nestedLevel = 0
const tokenizer = {
name: 'nested',
level: 'block',
tokenize: (src, tokens, lexer) => {
if (src.startsWith('{{')) {
nestedLevel++
// Handle opening
}
if (src.startsWith('}}')) {
nestedLevel--
// Handle closing
}
// Process based on state
},
}
See also
- Try Utility Functions for standard patterns before creating custom tokenizers