Skip to content

11.2.1 Representation Roadmap: Meaning as Vectors

Representation learning asks how text can become numbers that carry meaning, not just identity.

NLP representation learning chapter learning sequence diagram

Embedding semantic space diagram

Contextual embedding comparison diagram

The path moves from sparse word identity, to word vectors, to contextual vectors, to language models that learn broader language patterns.

vectors = {
"cat": [1.0, 0.8],
"dog": [0.9, 0.7],
"car": [0.1, 0.2],
}
def dot(a, b):
return sum(x * y for x, y in zip(a, b))
print("cat_dog:", round(dot(vectors["cat"], vectors["dog"]), 2))
print("cat_car:", round(dot(vectors["cat"], vectors["car"]), 2))

Expected output:

Terminal window
cat_dog: 1.46
cat_car: 0.26

This is a toy score, but it shows the core idea: close meanings should be easier for a model to compare.

StepReadPractice Output
1Word embeddingsExplain semantic closeness as vector closeness
2Contextual representationsExplain why the same word can mean different things
3Language modelsConnect representation learning to next-token or masked prediction

Keep this page’s proof of learning as a small evidence card:

Representation
BoW, TF-IDF, static embedding, contextual embedding, or language-model score
Comparison
nearest text, similarity score, or next-token/log-prob style output
Interpretation
what the representation captures and what it misses
Failure Check
polysemy, domain mismatch, short text, tokenization, or semantic drift
Expected Output
small comparison table with at least one surprising result

You pass this chapter when you can compare sparse features, word embeddings, and contextual embeddings, and explain why representation quality affects classification, retrieval, and RAG.

Check reasoning and explanation
  1. A passing answer starts from the text unit and output type: token, span, sentence label, sequence, embedding, or generated text.
  2. The evidence should include a small dataset example, model or pipeline choice, metric, and at least one inspected error case.
  3. A good self-check distinguishes preprocessing issues from model issues, such as tokenization mistakes, label ambiguity, data imbalance, or hallucinated generation.