Skip to main content

7.1.1 NLP Crash Roadmap: Text to Tokens to Vectors

Before LLMs feel understandable, first see how text becomes pieces a model can process: text -> tokens -> IDs -> vectors -> model output.

Look at the Flow First

NLP crash course chapter flowchart

WordFirst meaning
tokena piece of text used by the model
tokenizertool that splits text and maps pieces to IDs
embeddingdense vector for a token or text
pretrained modelmodel already trained on broad text
Hugging Facemodel/dataset/tool ecosystem

Run One Tiny Token Lab

text = "RAG retrieves evidence before answering"
tokens = text.lower().split()
vocab = {token: index for index, token in enumerate(sorted(set(tokens)))}
ids = [vocab[token] for token in tokens]

print("tokens:", tokens)
print("ids:", ids)
print("unique_tokens:", len(vocab))

Expected output:

tokens: ['rag', 'retrieves', 'evidence', 'before', 'answering']
ids: [3, 4, 2, 1, 0]
unique_tokens: 5

Real tokenizers are smarter, but this shows the main idea: text must become stable pieces and IDs before vectors and models can work.

Learn in This Order

OrderReadWhat to practice
17.1.2 Tokenizertext -> tokens -> IDs
27.1.3 Embeddingstokens/text -> vectors
37.1.4 Pretrained Modelsload and reuse model capability
47.1.5 Hugging Face Quickstartpipeline, model card, local run
57.1.6 Tokenizer and Embedding Labinspect tokens and vectors

Pass Check

You pass this roadmap when you can explain why raw text needs tokenization, why embeddings are vectors, and why pretrained models are reused instead of trained from zero.