Skip to main content

tokenizer

Native tokenizer module.

This module provides the native tokenizer backend used by core.tokenizer when native tokenization is enabled.

tokenizer.pattern_stats

Per-pattern native tokenizer compilation and runtime counters.

close_code

(field) close_code: string?

Closer pattern code used by the native tokenizer.


close_fast_kind

(field) close_fast_kind: integer

Native fast-path kind used by the closer pattern.


code

(field) code: string?

Opener pattern code used by the native tokenizer.


fallback_match_calls

(field) fallback_match_calls: integer

Number of fallback matcher calls for this pattern.


fast_kind

(field) fast_kind: integer

Native fast-path kind used by the opener pattern.


pattern

(field) pattern: string?

Display pattern from the syntax definition.


skipped_by_starter

(field) skipped_by_starter: integer

Number of matches skipped by starter filtering.


unknown_starter

(field) unknown_starter: boolean

True when the opener pattern has unknown start bytes.


tokenizer.resume

Resume information returned by tokenizer.tokenize() when tokenization does not finish within the current frame budget.

i

(field) i: integer

Next character position to continue tokenizing from.


res

(field) res: string[]

Accumulated tokens in the form \{ type, text, ... \}.


state

(field) state: string

Tokenizer state that should be reused on resume.


tokenizer.syntax_stats

Native tokenizer compilation and runtime counters for a syntax.

compiled_patterns

(field) compiled_patterns: integer

Number of patterns with a native fast path.


fallback_match_calls

(field) fallback_match_calls: integer

Number of fallback matcher calls for this syntax.


fallback_patterns

(field) fallback_patterns: integer

Number of patterns using the fallback matcher.


has_unknown_starters

(field) has_unknown_starters: boolean

True when any pattern has unknown start bytes.


normal_run_skips

(field) normal_run_skips: integer

Number of normal text runs skipped by starter filtering.


pattern_stats

(field) pattern_stats: tokenizer.pattern_stats[]

Per-pattern counters.


patterns

(field) patterns: integer

Number of patterns imported from the syntax.


skipped_by_starter

(field) skipped_by_starter: integer

Number of matches skipped by starter filtering.


extract_subsyntaxes

function tokenizer.extract_subsyntaxes(base_syntax: core.syntax.syntax, state: string)
-> syntaxes: core.syntax.syntax[]

Return the list of syntaxes active for the given tokenizer state.

@param base_syntax — The base syntax of the document.

@param state — Tokenizer state previously returned by tokenize.

@return syntaxes — Array of syntaxes starting from the innermost one.


get_syntax_stats

function tokenizer.get_syntax_stats(syntax: core.syntax.syntax)
-> stats: tokenizer.syntax_stats

Return native tokenizer compilation and runtime counters for a syntax.

@param syntax — The syntax to inspect.

@return stats — Native compilation and runtime counters.


tokenize

function tokenizer.tokenize(incoming_syntax: core.syntax.syntax, text: string, state?: string, resume?: tokenizer.resume)
-> tokens: string[]
2. state: string
3. resume: (tokenizer.resume)?

Tokenize a single line of text using the given syntax and state.

Returns tokens in the form \{ type, text, ... \}. If the tokenizer runs out of time, it returns a third value containing the resume data to continue tokenizing the same line later.

@param incoming_syntax — The syntax to tokenize against.

@param text — The line text to tokenize.

@param state — Current tokenizer state.

@param resume — Resume data from a previous incomplete call.

@return tokens — Tokens in the form \{ type, text, ... \}.

@return state — Updated tokenizer state.

@return resume — Resume data when tokenization yields before finishing.