Custom Markdown Tokenizers

Beta

Custom tokenizers extend the Markdown parser to support non-standard or custom syntax. This guide explains how tokenizers work and how to create your own.

Tip: For standard patterns like Pandoc blocks or shortcodes, check the Utility Functions first—they provide ready-made tokenizers.

What are Tokenizers?

Tokenizers are functions that identify and parse custom Markdown syntax into tokens. They're registered with MarkedJS and run during the lexing phase, before Tiptap's parse handlers process the tokens.

Note: Want to learn more about Tokenizers? Check out the Glossary.

The Tokenization Flow

Markdown String
      ↓
Custom Tokenizers (identify custom syntax)
      ↓
Standard MarkedJS Lexer
      ↓
Markdown Tokens
      ↓
Extension Parse Handlers
      ↓
Tiptap JSON

When to Use Custom Tokenizers

Use custom tokenizers when you want to support:

Custom inline syntax (e.g., ++inserted text++, ==highlighted==)
Custom block syntax (e.g., :::note, !!!warning)
Shortcodes (e.g., [[embed:video-id]])
Custom Markdown extensions
Domain-specific notation

Tokenizer Structure

A tokenizer is an object with these properties:

type MarkdownTokenizer = {
  name: string // Token name (must be unique)
  level?: 'block' | 'inline' // Level: block or inline
  start?: (src: string) => number // Where the token starts
  tokenize: (src, tokens, lexer) => MarkdownToken | undefined
}

Properties Explained

`name` (required)

A unique identifier for your token type:

{
  name: 'highlight',
  // ...
}

This name will be used when registering parse handlers.

`level` (optional)

Whether this tokenizer operates at block or inline level:

{
  level: 'inline', // 'block' or 'inline'
  // ...
}

inline: For inline elements like bold, italic, custom marks (default)
block: For block elements like custom containers, admonitions

`start` (optional)

A function that returns the index where your token might start in the source string. This is an optimization to avoid unnecessary parsing attempts:

{
  start: (src) => {
    // Find where '==' appears in the source
    return src.indexOf('==')
  },
  // ...
}

This optimization helps MarkedJS skip irrelevant parts of the text. If omitted, MarkedJS will try your tokenizer at every position.

`tokenize` (required)

The main parsing function that identifies and tokenizes your syntax:

{
  tokenize: (src, tokens, lexer) => {
    // Try to match your syntax at the start of src
    const match = /^==(.+?)==/.exec(src)

    if (match) {
      return {
        type: 'highlight',
        raw: match[0],        // Full matched string
        text: match[1],       // Captured content
        tokens: lexer.inlineTokens(match[1]), // Parsed content
      }
    }

    // Return undefined if no match
    return undefined
  },
}

The function receives:

src: Remaining source text to parse
tokens: Previously parsed tokens (usually not needed)
lexer: Helper functions for tokenizing child content

So as described above the flow of your Markdown content will be:

Markdown => Tokenizer => Lexer => Token => markdown.parse() => Tiptap JSON

And from Tiptap JSON back to Markdown:

Tiptap JSON => markdown.render() => Markdown

Creating a Simple Inline Tokenizer

Let's create a tokenizer for highlight syntax (==text==).

import { Node } from '@tiptap/core'

const Highlight = Node.create({
  name: 'highlight',

  // ... other config (parseHTML, renderHTML, etc.)

  // Define the custom tokenizer
  // note - this is turning Markdown strings to **tokens**
  markdownTokenizer: {
    name: 'highlight', // the token name you want to give to the token - must be unique and will be picked up by the parse function
    level: 'inline', // the tokenizer level - inline or block

    // This function should return the index of your syntax in the src string
    // or -1 if not found. This is an optimization to avoid running the tokenizer unnecessarily
    start: src => {
      return src.indexOf('==')
    },

    // The tokenize function extracts information from the src string and returns a token object
    // or undefined if the syntax is not matched
    tokenize: (src, tokens, lexer) => {
      // Match ==text== at the start of src
      const match = /^==([^=]+)==/.exec(src)

      if (!match) {
        return undefined
      }

      return {
        type: 'highlight',
        raw: match[0], // '==text=='
        text: match[1], // 'text'
        tokens: lexer.inlineTokens(match[1]), // Parse inline content
      }
    },
  },

  // Parse the token to Tiptap JSON
  // note - this is consuming **Tokens** and transforms them into Tiptap JSON
  parseMarkdown: (token, helpers) => {
    return helpers.applyMark('highlight', helpers.parseInline(token.tokens || []))
  },

  // Render back to Markdown
  renderMarkdown: (node, helpers) => {
    const content = helpers.renderChildren(node)
    return `==${content}==`
  },
})

Using the Extension

import { Editor } from '@tiptap/core'
import StarterKit from '@tiptap/starter-kit'
import { Markdown } from '@tiptap/markdown'
import Highlight from './Highlight'

const editor = new Editor({
  extensions: [StarterKit, Markdown, Highlight],
})

// Parse Markdown with custom syntax
editor.commands.setContent('This is ==highlighted text==!', { contentType: 'markdown' })

// Get Markdown back
console.log(editor.getMarkdown())
// This is ==highlighted text==!

Creating a Block-Level Tokenizer

Let's create a tokenizer for admonition blocks:

:::note
This is a note
:::

import { Node } from '@tiptap/core'

const Admonition = Node.create({
  name: 'admonition',
  group: 'block',
  content: 'block+',

  addAttributes() {
    return {
      type: {
        default: 'note',
      },
    }
  },

  parseHTML() {
    return [
      {
        tag: 'div[data-admonition]',
        getAttrs: node => ({
          type: node.getAttribute('data-type'),
        }),
      },
    ]
  },

  renderHTML({ node, HTMLAttributes }) {
    return [
      'div',
      { 'data-admonition': '', 'data-type': node.attrs.type },
      0, // Content
    ]
  },

  markdownTokenizer: {
    name: 'admonition',
    level: 'block',

    start: src => {
      return src.indexOf(':::')
    },

    tokenize: (src, tokens, lexer) => {
      // Match :::type\ncontent\n:::
      const match = /^:::(\w+)\n([\s\S]*?)\n:::/.exec(src)

      if (!match) {
        return undefined
      }

      return {
        type: 'admonition',
        raw: match[0],
        admonitionType: match[1], // 'note', 'warning', etc.
        text: match[2], // Content
        tokens: lexer.blockTokens(match[2]), // Parse block content
      }
    },
  },

  parseMarkdown: (token, helpers) => {
    return {
      type: 'admonition',
      attrs: {
        type: token.admonitionType || 'note',
      },
      content: helpers.parseChildren(token.tokens || []),
    }
  },

  renderMarkdown: (node, helpers) => {
    const type = node.attrs?.type || 'note'
    const content = helpers.renderChildren(node.content || [])

    return `:::${type}\n${content}\n:::\n\n`
  },
})

Using Block-Level Tokenizers

const markdown = `
# Document

:::note
This is a note with **bold** text.
:::

:::warning
This is a warning!
:::
`

editor.commands.setContent(markdown, { contentType: 'markdown' })

Tokenizer with Nested Content

Let's create a tokenizer that supports nested inline parsing:

const Emoji = Node.create({
  name: 'emoji',
  group: 'inline',
  inline: true,

  addAttributes() {
    return {
      name: { default: null },
    }
  },

  parseHTML() {
    return [
      {
        tag: 'emoji',
        getAttrs: node => ({ name: node.getAttribute('data-name') }),
      },
    ]
  },

  renderHTML({ node }) {
    return ['emoji', { 'data-name': node.attrs.name }]
  },

  markdownTokenizer: {
    name: 'emoji',
    level: 'inline',

    start: src => {
      return src.indexOf(':')
    },

    tokenize: (src, tokens, lexer) => {
      // Match :emoji_name:
      const match = /^:([a-z0-9_+]+):/.exec(src)

      if (!match) {
        return undefined
      }

      return {
        type: 'emoji',
        raw: match[0],
        emojiName: match[1],
      }
    },
  },

  parseMarkdown: (token, helpers) => {
    return {
      type: 'emoji',
      attrs: {
        name: token.emojiName,
      },
    }
  },

  renderMarkdown: (node, helpers) => {
    return `:${node.attrs?.name || 'unknown'}:`
  },
})

Using the Lexer Helpers

The lexer parameter provides helper functions to parse nested content:

`lexer.inlineTokens(src)`

Parse inline content (for inline-level tokenizers):

tokenize: (src, tokens, lexer) => {
  const match = /^\[\[([^\]]+)\]\]/.exec(src)

  if (match) {
    return {
      type: 'custom',
      raw: match[0],
      tokens: lexer.inlineTokens(match[1]), // Parse inline content
    }
  }
}

`lexer.blockTokens(src)`

Parse block-level content (for block-level tokenizers):

tokenize: (src, tokens, lexer) => {
  const match = /^:::\w+\n([\s\S]*?)\n:::/.exec(src)

  if (match) {
    return {
      type: 'container',
      raw: match[0],
      tokens: lexer.blockTokens(match[1]), // Parse block content
    }
  }
}

Regular Expression Best Practices

Use `^` to Match from Start

Always anchor your regex to the start of the string:

// ✅ Good - matches from start
/^==(.+?)==/

// ❌ Bad - can match anywhere
/==(.+?)==/

Use Non-Greedy Matching

Use +? or *? instead of + or * for better control:

// ✅ Good - stops at first closing
/^==(.+?)==/

// ❌ Bad - matches too much
/^==(.+)==/

Test Edge Cases

Test your regex with:

Empty content: ====
Nested syntax: ==text **bold** text==
Multiple occurrences: ==one== ==two==
Unclosed syntax: ==text

// Handle unclosed syntax
const match = /^==([^=]+)==/.exec(src)
if (!match) {
  return undefined // Not matched, let standard parser handle it
}

Debugging Tokenizers

Log the Token Output

tokenize: (src, tokens, lexer) => {
  const match = /^==(.+?)==/.exec(src)

  if (match) {
    const token = {
      type: 'highlight',
      raw: match[0],
      tokens: lexer.inlineTokens(match[1]),
    }

    console.log('Tokenized:', token)
    return token
  }

  console.log('No match for:', src.substring(0, 20))
  return undefined
}

Test in Isolation

Test your tokenizer independently:

const src = '==highlighted text== and more'
const match = /^==(.+?)==/.exec(src)

console.log('Match:', match)
// ['==highlighted text==', 'highlighted text==']

// Adjust regex
const betterMatch = /^==([^=]+)==/.exec(src)
console.log('Better match:', betterMatch)
// ['==highlighted text==', 'highlighted text']

Check Token Registry

Verify your tokenizer is registered:

console.log(editor.markdown.instance)
// Check the MarkedJS instance configuration

Common Pitfalls

1. Forgetting to Return `undefined`

Always return undefined when your syntax doesn't match:

// ✅ Good
tokenize: (src, tokens, lexer) => {
  const match = /^==(.+?)==/.exec(src)
  if (!match) {
    return undefined // Important!
  }
  return {
    /* token */
  }
}

// ❌ Bad - returns falsy value
tokenize: (src, tokens, lexer) => {
  const match = /^==(.+?)==/.exec(src)
  return match
    ? {
        /* token */
      }
    : null // Should be undefined
}

2. Not Including `raw`

Always include the full matched string in raw:

return {
  type: 'highlight',
  raw: match[0], // Full match including delimiters
  text: match[1], // Content only
}

3. Wrong Level

Make sure level matches your tokenizer's purpose:

// Inline element (within text)
{
  level: 'inline'
}

// Block element (standalone)
{
  level: 'block'
}

4. Consuming Too Much

Be careful not to consume content beyond your syntax:

// ✅ Good - stops at closing delimiter
/^==([^=]+)==/

// ❌ Bad - might consume multiple blocks
/^==([\s\S]+)==/

Advanced: Stateful Tokenizers

For complex syntax, maintain state across tokenization:

let nestedLevel = 0

const tokenizer = {
  name: 'nested',
  level: 'block',

  tokenize: (src, tokens, lexer) => {
    if (src.startsWith('{{')) {
      nestedLevel++
      // Handle opening
    }

    if (src.startsWith('}}')) {
      nestedLevel--
      // Handle closing
    }

    // Process based on state
  },
}