Tokens

Definition

Tokens represent the fundamental units into which raw text is divided for processing by language models and other NLP systems. Rather than processing raw characters or words directly, modern language models operate on discrete token sequences, where each token maps to a numerical identifier enabling efficient computation. The tokenization process bridges human language and machine computation, determining what granularity of linguistic information models see and process. This seemingly simple preprocessing step profoundly impacts model architecture choices, computational efficiency, vocabulary coverage, and downstream performance across diverse tasks and languages.

Historical Evolution

Tokenization has evolved dramatically alongside NLP methodology. Early NLP systems operated at the word level, treating space-separated sequences as atomic units. This word-level approach faced fundamental limitations: handling unknown words (out-of-vocabulary or OOV terms) proved problematic, as systems either discarded rare words or replaced them with a universal unknown token, destroying valuable semantic information. Additionally, morphologically rich languages with productive inflection and derivation produced massive vocabularies where rare word forms appeared only once or twice in training data.

The neural era brought character-level models that processed text as sequences of individual characters, avoiding vocabulary limitations entirely but creating excessively long sequences that amplified computational costs and made learning long-range dependencies difficult. Hybrid approaches emerged using character information to augment word embeddings—models like fastText and ELMo used character sequences within or below word-level models to handle novel words and spelling variations while maintaining reasonable sequence lengths.

The transformative breakthrough came with Byte Pair Encoding (BPE), introduced by Sennrich and colleagues in 2016 for neural machine translation. BPE discovered an elegant middle path: automatically learned subword units balancing vocabulary coverage with sequence length. Rather than manually designing tokenization schemes, BPE iteratively merged the most frequent adjacent character pairs, discovering linguistically-motivated subunits like morphemes and common prefixes/suffixes without explicit linguistic guidance. This algorithmic approach proved immediately successful, scaling to billions of parameters while maintaining fixed-size vocabularies.

Subword Tokenization Methods

Modern language models predominantly employ subword tokenization algorithms learned from training corpora. Three approaches dominate:

Byte Pair Encoding (BPE) starts with a character-level vocabulary, then iteratively merges pairs of adjacent tokens that appear most frequently in training data. After N merge operations (typically 30,000-50,000 for large models), the resulting vocabulary contains individual characters plus discovered frequent subword units. BPE's primary mechanism—merging adjacent pairs—creates sequences corresponding closely to linguistic morphemes and common subword patterns. A word like "understanding" might tokenize as ["understand", "ing"], while rare words decompose into smaller units.

WordPiece, developed for BERT, uses maximum likelihood instead of frequency to select merges. Rather than greedily merging the most frequent pair, WordPiece evaluates which merge increases the language model likelihood most substantially. This statistical foundation can better balance frequent and rare units. WordPiece represents words with a special ## prefix on continuation tokens (e.g., "understanding" becomes ["understand", "##ing"]), explicitly marking subword boundaries.

SentencePiece combines both BPE and unigram language modeling approaches, with a crucial innovation: it operates without requiring pre-tokenization into words. Rather than assuming whitespace separates words, SentencePiece treats spaces as regular characters with special handling, enabling truly unsupervised segmentation directly on raw text. This space-agnostic approach proves especially valuable for languages without whitespace delimiters like Chinese, Japanese, and Korean. Additionally, SentencePiece can be trained to optimize a unigram language model—starting with a large vocabulary and iteratively removing low-probability units—enabling different vocabularies for different tasks and languages without retraining.

Handling Out-of-Vocabulary Words

A critical advantage of subword tokenization: any word—even those never seen during training—can be represented by combining subword units. If a model encounters the novel word "unfathomable," it decomposes it using learned subword units like ["un", "fath", "om", "able"]. This open-vocabulary property enables models to handle misspellings, new words, and domain-specific terminology without degradation.

However, tokenization choices significantly impact how well models handle OOV words. Recent research reveals that models employing BPE sometimes make suboptimal segmentation choices—the algorithm greedily selects the largest substring matching the vocabulary from the left, which can lead to poor compositions for rare word combinations. When a vocabulary word appears as a substring within a novel OOV word, the model's learned representations for that vocabulary word may not transfer appropriately to the OOV context. Specialized techniques like LOVE (Learning Out-of-Vocabulary Embeddings) enable models to generate robust embeddings for completely unseen words by learning from how spelling and context patterns predict embeddings.

Multilingual and Morphologically Rich Language Considerations

Tokenization design significantly affects model performance across languages. English morphology is relatively simple compared to languages like Turkish, Finnish, and Arabic with extensive inflection and derivation. For these languages:

Morphological analyzers using finite-state transducers can systematically decompose words into roots and affixes, enabling sharing of learned representations across word families. Rather than learning separate embeddings for "run," "runs," "running," and "ran," morphological tokenization shares representations across related forms.

Morpheme-enriched tokenization integrates morphological knowledge into learned tokenizers, improving both generalization and interpretability. Domain-specific approaches like biomedical tokenization can benefit from understanding that words composed according to linguistic rules should decompose consistently.

Multilingual vocabulary sharing presents tradeoffs: shared vocabularies enable knowledge transfer between languages but often disadvantage low-resource languages. High-resource languages like English need finer-grained tokenization (more frequent words require fewer tokens), while low-resource languages end up with finer subword decomposition, creating asymmetric computational costs and potentially degrading performance.

Token Embeddings and Semantic Representation

Once text is tokenized, each token receives a learned vector embedding—a continuous representation capturing semantic and syntactic properties. In transformer models, token embeddings serve as model inputs, combined with positional encodings indicating sequence position. Through training, embedding space captures remarkable structure: synonyms cluster nearby, and arithmetic operations on embeddings reveal analogical relationships.

Importantly, token embeddings are contextual in modern models: the same token receives different embeddings depending on surrounding context. This contextuality enables models to handle polysemy (multiple word senses) naturally—"bank" means something different in "river bank" versus "savings bank," and contextual embeddings reflect these semantic differences.

Computational and Efficiency Implications

Token count directly determines computational cost: longer token sequences require more transformer layers to attend over all positions, quadratically increasing self-attention computation. Efficient tokenization reducing token count speeds models while potentially preserving semantic information. This efficiency-accuracy tradeoff shapes tokenization choices:

Larger vocabularies (100,000+ tokens) compress text into fewer tokens but require larger embedding matrices and consume more memory. Smaller vocabularies (10,000-30,000 tokens) result in longer sequences but reduce parameter overhead. Most modern models use vocabularies in the 30,000-50,000 range, balancing these concerns.

Byte-level tokenization using 256-byte vocabulary avoids the vocabulary learning step but creates very long sequences. Recent models like ByT5 and GPT-3 demonstrate byte-level models can match subword performance despite longer sequences, suggesting architectural innovations enabling efficient processing of finer granularities.

Vocabulary Size and Task Performance

Surprisingly, optimal vocabulary size varies by language and task. Empirical studies show:

English LLMs typically use 50,000-100,000 token vocabularies. Recent work on Less-is-Better tokenization suggests smaller vocabularies learning integrated subword-word-multiword-expression vocabularies could improve efficiency while maintaining performance.

Low-resource languages benefit from smaller vocabularies (5,000-10,000 tokens) capturing language-specific patterns, while high-resource languages enable larger vocabularies capturing fine-grained distinctions.

Domain-specific tokenization can improve performance—medical domains benefit from biomedical tokenizers decomposing specialized terminology according to domain conventions rather than general corpus statistics.

Modern Innovations and Challenges

Learned tokenization treats tokenization as a learnable component rather than fixed preprocessing, with end-to-end models learning optimal segmentations jointly with downstream tasks. Models like Charformer use gradient-based soft tokenization, learning position-wise scores for candidate subword blocks.

Factorized tokenization decomposes subwords into discrete latent factors, enabling more structured representations and potentially improved generalization.

Multi-granularity tokenization represents text at multiple scales simultaneously—characters, subwords, and words—enabling models to leverage linguistic structure at different levels without forcing commitment to single granularities.

No-tokenization approaches process raw bytes or characters directly with efficient architectures, questioning whether separate tokenization steps are necessary. However, language-agnostic approaches sometimes perform worse on morphologically rich languages, suggesting tokenization's importance persists.

Practical Implications

The tokenization choice affects:

Reproducibility: Different tokenizers produce different token sequences, potentially making model outputs non-deterministic if tokenization includes randomized components like subword-dropout.

Interpretability: Token boundaries make it easier to understand what language models "see," though subword tokenization obscures word-level phenomena.

Transfer Learning: Models pretrained with one tokenizer require conversion when applied with different tokenizers, potentially degrading performance if conversions are suboptimal.

Multilingual Processing: Shared tokenizers create asymmetric treatment across languages; fine-tuning tokenizers per language improves performance but increases model size.