Skip to content

7.1.1 NLP Crash Roadmap: Text to Tokens to Vectors

Before LLMs feel understandable, first see how text becomes pieces a model can process: text -> tokens -> IDs -> vectors -> model output.

NLP crash course chapter flowchart

WordFirst meaning
tokena piece of text used by the model
tokenizertool that splits text and maps pieces to IDs
embeddingdense vector for a token or text
pretrained modelmodel already trained on broad text
Hugging Facemodel/dataset/tool ecosystem
text = "RAG retrieves evidence before answering"
tokens = text.lower().split()
vocab = {token: index for index, token in enumerate(sorted(set(tokens)))}
ids = [vocab[token] for token in tokens]
print("tokens:", tokens)
print("ids:", ids)
print("unique_tokens:", len(vocab))

Expected output:

Terminal window
tokens: ['rag', 'retrieves', 'evidence', 'before', 'answering']
ids: [3, 4, 2, 1, 0]
unique_tokens: 5

Real tokenizers are smarter, but this shows the main idea: text must become stable pieces and IDs before vectors and models can work.

OrderReadWhat to practice
17.1.2 Tokenizertext -> tokens -> IDs
27.1.3 Embeddingstokens/text -> vectors
37.1.4 Pretrained Modelsload and reuse model capability
47.1.5 Hugging Face Quickstartpipeline, model card, local run
57.1.6 Tokenizer and Embedding Labinspect tokens and vectors

Keep this page’s proof of learning as a small evidence card:

Text Path
raw text tokens ids embeddings
Token Risk
long input can hit context or cost limits
Embedding Use
similarity can support retrieval but is not reasoning
Model Bridge
pretrained model = shared foundation plus task behavior
Next Action
run tokenizer and embedding labs before Prompt work

You pass this roadmap when you can explain why raw text needs tokenization, why embeddings are vectors, and why pretrained models are reused instead of trained from zero.

Check reasoning and explanation
  1. A passing answer explains how tokens, context, attention, prompts, and generation behavior connect in one request-response path.
  2. The evidence should include at least one reproducible prompt or structured-output test, plus notes on why the output passed or failed.
  3. A good self-check separates prompt design, RAG, fine-tuning, and alignment: use the lightest method that fixes the observed problem.