7.1.1 NLP Crash Roadmap: Text to Tokens to Vectors

Before LLMs feel understandable, first see how text becomes pieces a model can process: text -> tokens -> IDs -> vectors -> model output.

Look at the Flow First

NLP crash course chapter flowchart

Word	First meaning
token	a piece of text used by the model
tokenizer	tool that splits text and maps pieces to IDs
embedding	dense vector for a token or text
pretrained model	model already trained on broad text
Hugging Face	model/dataset/tool ecosystem

Run One Tiny Token Lab

text = "RAG retrieves evidence before answering"
tokens = text.lower().split()
vocab = {token: index for index, token in enumerate(sorted(set(tokens)))}
ids = [vocab[token] for token in tokens]

print("tokens:", tokens)
print("ids:", ids)
print("unique_tokens:", len(vocab))

Expected output:

tokens: ['rag', 'retrieves', 'evidence', 'before', 'answering']
ids: [3, 4, 2, 1, 0]
unique_tokens: 5

Real tokenizers are smarter, but this shows the main idea: text must become stable pieces and IDs before vectors and models can work.

Learn in This Order

Order	Read	What to practice
1	7.1.2 Tokenizer	text -> tokens -> IDs
2	7.1.3 Embeddings	tokens/text -> vectors
3	7.1.4 Pretrained Models	load and reuse model capability
4	7.1.5 Hugging Face Quickstart	pipeline, model card, local run
5	7.1.6 Tokenizer and Embedding Lab	inspect tokens and vectors

Pass Check

You pass this roadmap when you can explain why raw text needs tokenization, why embeddings are vectors, and why pretrained models are reused instead of trained from zero.

Look at the Flow First​

Run One Tiny Token Lab​

Learn in This Order​

Pass Check​

Look at the Flow First

Run One Tiny Token Lab

Learn in This Order

Pass Check