Skip to content

7.1.2 Tokenization and Tokenizer

Tokenizer Subword Splitting Flowchart

A neural network does not read strings directly. It receives tensors. A tokenizer is the contract between human text and model tensors:

raw texttokensinput_idsmodel

Most LLM issues that look mysterious become easier once you inspect this contract:

  • a word may become several tokens;
  • punctuation, casing, Chinese, code, and emojis may change token count a lot;
  • padding makes examples in one batch the same length;
  • truncation silently removes content if the sequence is too long;
  • chat templates add hidden structure tokens around system, user, and assistant messages.

Tokenizer Granularity Trade-off Diagram

There are three common choices:

MethodExampleStrengthWeakness
Character-levelr e f u n dalmost no unknown wordsvery long sequences
Word-levelrefund policyintuitive meaning unitsmany out-of-vocabulary words
Subword-leveltoken ##izationpractical balanceharder to read by eye

Modern LLMs usually use subword tokenization. BPE, WordPiece, and SentencePiece are different ways to learn reusable fragments from a corpus. The important idea is the same: frequent fragments get stable IDs, rare words can still be composed from smaller pieces.

Lab 1: Build a Tiny WordPiece-Style Tokenizer

Section titled “Lab 1: Build a Tiny WordPiece-Style Tokenizer”

Run this first. It is small enough to understand line by line, but it contains the same objects you see in real model APIs.

import re
VOCAB = {
"[PAD]": 0,
"[UNK]": 1,
"[CLS]": 2,
"[SEP]": 3,
"refund": 4,
"policy": 5,
"reset": 6,
"password": 7,
"transform": 8,
"##er": 9,
"##s": 10,
"token": 11,
"##ization": 12,
"please": 13,
"help": 14,
"need": 15,
"evidence": 16,
}
def words(text):
return re.findall(r"[A-Za-z]+", text.lower())
def split_wordpiece(word):
if word in VOCAB:
return [word]
pieces = []
start = 0
while start < len(word):
match = None
for end in range(len(word), start, -1):
piece = word[start:end] if start == 0 else "##" + word[start:end]
if piece in VOCAB:
match = piece
break
if match is None:
return ["[UNK]"]
pieces.append(match)
start = end
return pieces
def encode(text, max_length=10):
tokens = ["[CLS]"]
for word in words(text):
tokens.extend(split_wordpiece(word))
tokens.append("[SEP]")
original_len = len(tokens)
if len(tokens) > max_length:
tokens = tokens[:max_length]
tokens[-1] = "[SEP]"
input_ids = [VOCAB.get(token, VOCAB["[UNK]"]) for token in tokens]
attention_mask = [1] * len(input_ids)
while len(input_ids) < max_length:
tokens.append("[PAD]")
input_ids.append(VOCAB["[PAD]"])
attention_mask.append(0)
return {
"text": text,
"original_len": original_len,
"tokens": tokens,
"input_ids": input_ids,
"attention_mask": attention_mask,
}
for example in [
"Please help reset password",
"Transformers refund policy",
"Tokenization needs evidence",
]:
row = encode(example, max_length=10)
print("-" * 64)
print("text:", row["text"])
print("original_len:", row["original_len"])
print("tokens:", row["tokens"])
print("input_ids:", row["input_ids"])
print("attention_mask:", row["attention_mask"])

Expected output:

Terminal window
----------------------------------------------------------------
text: Please help reset password
original_len: 6
tokens: ['[CLS]', 'please', 'help', 'reset', 'password', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
input_ids: [2, 13, 14, 6, 7, 3, 0, 0, 0, 0]
attention_mask: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
----------------------------------------------------------------
text: Transformers refund policy
original_len: 7
tokens: ['[CLS]', 'transform', '##er', '##s', 'refund', 'policy', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
input_ids: [2, 8, 9, 10, 4, 5, 3, 0, 0, 0]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
----------------------------------------------------------------
text: Tokenization needs evidence
original_len: 7
tokens: ['[CLS]', 'token', '##ization', 'need', '##s', 'evidence', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
input_ids: [2, 11, 12, 15, 10, 16, 3, 0, 0, 0]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]

Read the output like this:

  • [CLS] and [SEP] are structure tokens.
  • transformers becomes transform, ##er, ##s because the whole word is not in VOCAB.
  • input_ids are the integer values the model actually receives.
  • attention_mask=0 marks [PAD] positions so the model can ignore them.

Tokenizer to input_ids and attention_mask Diagram

Now force the same tokenizer into a small context window.

row = encode("Please help reset password refund policy evidence", max_length=6)
print("original_len:", row["original_len"])
print("tokens:", row["tokens"])
print("input_ids:", row["input_ids"])
print("attention_mask:", row["attention_mask"])

Expected output:

Terminal window
original_len: 9
tokens: ['[CLS]', 'please', 'help', 'reset', 'password', '[SEP]']
input_ids: [2, 13, 14, 6, 7, 3]
attention_mask: [1, 1, 1, 1, 1, 1]

The words refund policy evidence disappeared. In a real support assistant, that could remove the user’s actual intent. This is why tokenization is not a small preprocessing detail; it affects cost, retrieval size, prompt design, and failure modes.

Lab 3: Inspect a Real Hugging Face Tokenizer

Section titled “Lab 3: Inspect a Real Hugging Face Tokenizer”

Use this when you have internet access for the first model download.

Terminal window
python -m pip install "transformers>=4.0" torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
batch = tokenizer(
["Please help reset password", "Tokenization needs evidence"],
padding="max_length",
truncation=True,
max_length=10,
return_tensors="pt",
)
print(batch.keys())
print(batch["input_ids"].shape)
print(tokenizer.convert_ids_to_tokens(batch["input_ids"][1]))
print(batch["attention_mask"][1].tolist())

Expected shape-level output:

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
torch.Size([2, 10])
['[CLS]', 'token', '##ization', 'needs', 'evidence', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Tokenizer run output result map

The exact split can differ across tokenizers. That is the point: always inspect the tokenizer that belongs to the model you actually use.

TermMeaning in practice
vocabtoken-to-ID dictionary learned during tokenizer training
OOVout-of-vocabulary; often handled by [UNK] or subword composition
BPEmerges frequent character pairs into reusable subwords
WordPiecesimilar subword idea, common in BERT-style tokenizers
SentencePiecetreats text as a raw stream; useful for multilingual and no-space languages
padding_sidewhether pads are added on the left or right; important for some decoder models
context lengthmaximum token budget for input plus generated output
chat templatetokenizer-level formatting that adds role and boundary tokens

When a prompt behaves strangely, inspect the tokenizer before blaming the model:

  • Print tokens and token IDs for the exact prompt.
  • Count tokens after the chat template, not just raw user text.
  • Check whether truncation removed instructions, retrieved evidence, or the latest user question.
  • Verify padding side and attention_mask when batching decoder models.
  • Compare Chinese, English, code, and emoji-heavy inputs; their token counts can differ sharply.

Keep this page’s proof of learning as a small evidence card:

Sample Text
one English/CJK/code-like example
Tokens
printed token list and token count
Truncation Case
what was cut and why it matters
Product Risk
cost, context limit, or lost instruction
Debug Action
inspect tokenization before blaming the model
  1. Remove transform from VOCAB. What happens to Transformers refund policy?
  2. Change max_length from 10 to 5. Which useful tokens disappear first?
  3. Add "##ing" and test resetting password. Can your tokenizer represent it?
  4. Run Lab 3 with another model tokenizer and compare token counts for Chinese, English, and code.
  5. For a RAG prompt, decide how many tokens you reserve for system instructions, retrieved evidence, user question, and answer space.
Reference implementation and walkthrough
  1. Removing transform should make Transformers harder to represent. Depending on the toy rules, it may fall back to [UNK] or a less useful split, which shows why vocabulary coverage matters.
  2. With max_length=5, later tokens are truncated first. In a real prompt, this can remove constraints, retrieved evidence, or the user question tail.
  3. "##ing" only helps if the tokenizer can also find a useful stem. Subword tokens solve many surface-form issues, but they do not magically understand words.
  4. Different tokenizers can produce very different token counts for the same text. Chinese, code, and emoji-heavy strings are especially worth checking before estimating cost or context length.
  5. A reasonable first budget could reserve about 10-15% for system instructions, 60-70% for retrieved evidence, 5-10% for the user question, and the rest for the answer. The exact split should follow product risk.

A tokenizer is not just a text splitter. It defines the model’s visible world:

text boundarytoken boundaryID sequenceattention maskcontext budget

If you can inspect that path, you can explain many practical LLM problems before touching model architecture.