7.1.1 NLP Crash Roadmap: Text to Tokens to Vectors

Before LLMs feel understandable, first see how text becomes pieces a model can process: text -> tokens -> IDs -> vectors -> model output.

Look at the Flow First

NLP crash course chapter flowchart

Word	First meaning
token	a piece of text used by the model
tokenizer	tool that splits text and maps pieces to IDs
embedding	dense vector for a token or text
pretrained model	model already trained on broad text
Hugging Face	model/dataset/tool ecosystem

Run One Tiny Token Lab

text = "RAG retrieves evidence before answering"
tokens = text.lower().split()
vocab = {token: index for index, token in enumerate(sorted(set(tokens)))}
ids = [vocab[token] for token in tokens]

print("tokens:", tokens)
print("ids:", ids)
print("unique_tokens:", len(vocab))

Expected output:

tokens: ['rag', 'retrieves', 'evidence', 'before', 'answering']
ids: [3, 4, 2, 1, 0]
unique_tokens: 5

Real tokenizers are smarter, but this shows the main idea: text must become stable pieces and IDs before vectors and models can work.

Learn in This Order

Order	Read	What to practice
1	7.1.2 Tokenizer	text -> tokens -> IDs
2	7.1.3 Embeddings	tokens/text -> vectors
3	7.1.4 Pretrained Models	load and reuse model capability
4	7.1.5 Hugging Face Quickstart	pipeline, model card, local run
5	7.1.6 Tokenizer and Embedding Lab	inspect tokens and vectors

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Text Path: raw text → tokens → ids → embeddings
Token Risk: long input can hit context or cost limits
Embedding Use: similarity can support retrieval but is not reasoning
Model Bridge: pretrained model = shared foundation plus task behavior
Next Action: run tokenizer and embedding labs before Prompt work

Pass Check

You pass this roadmap when you can explain why raw text needs tokenization, why embeddings are vectors, and why pretrained models are reused instead of trained from zero.

Check reasoning and explanation

A passing answer explains how tokens, context, attention, prompts, and generation behavior connect in one request-response path.
The evidence should include at least one reproducible prompt or structured-output test, plus notes on why the output passed or failed.
A good self-check separates prompt design, RAG, fine-tuning, and alignment: use the lightest method that fixes the observed problem.