11.2.3 Contextual Embeddings

Learning goals
Section titled “Learning goals”- Understand why fixed word vectors are not enough
- Understand the core idea behind contextualized representations
- Build an intuition for “the same word, different vector” through runnable examples
- Understand why this is a key turning point from traditional NLP to modern pretraining models
First, build a map
Section titled “First, build a map”For beginners, the best way to understand this section is not “it is just a stronger word embedding,” but first to see clearly:
flowchart LR A["Static word vectors"] --> B["The same word always has one representation"] B --> C["Polysemous words get stuck"] C --> D["Contextualized representations"] D --> E["The same word has different vectors in different sentences"]So what this section is really trying to solve is:
- Why fixed word vectors eventually become insufficient
- Why “word meaning depends on context” changes the main direction of NLP
A more beginner-friendly overall analogy
Section titled “A more beginner-friendly overall analogy”You can think of static word vectors and contextualized representations as:
- Static word vectors are like a “fixed profile photo” in a dictionary
- Contextualized representations are like an actor’s “role state” in different scenes
The same actor is still the same person, but in different scenes, their expression, actions, and role identity change. Likewise:
- The same word in different sentences
- Should not always look like the same vector
What is the fundamental limitation of fixed word vectors?
Section titled “What is the fundamental limitation of fixed word vectors?”A word may have multiple meanings
Section titled “A word may have multiple meanings”Classic example:
bank
It can mean:
- a financial bank
- a river bank
If it always has only one fixed vector, then what should that vector be closer to?
So fixed word vectors struggle with polysemous words
Section titled “So fixed word vectors struggle with polysemous words”Even if static embeddings are very good, they still treat:
- “open a bank account”
- “sit on the river bank”
as the same bank vector.
An analogy
Section titled “An analogy”A fixed word vector is like giving every person one ID photo that never changes. A contextualized representation is more like a dynamic work photo based on the current scene:
- One version when working at a bank
- Another version when walking by a river
What exactly do contextualized representations do?
Section titled “What exactly do contextualized representations do?”Core idea
Section titled “Core idea”It is not “one vector per word,” but:
- one vector for each word in the current sentence
In other words, the representation is determined not only by the word itself, but also by the surrounding context.
Why is this important?
Section titled “Why is this important?”Because the real difficulty in NLP has never been just “what is the word,” but:
- what does the word mean in this sentence
What does this change?
Section titled “What does this change?”This moves representation learning from:
- static lookup
to:
- dynamic semantic encoding
This is also one of the key reasons modern pretraining models can significantly improve performance on many tasks.
Let’s first run an intuitive “same word, different vector” example
Section titled “Let’s first run an intuitive “same word, different vector” example”The example below does not implement real BERT, but it clearly simulates the process of “word vector + contextual adjustment.”
base_embeddings = { "bank": [0.5, 0.5], "money": [0.9, 0.1], "river": [0.1, 0.9],}
context_shifts = { "finance": [0.3, -0.2], "nature": [-0.2, 0.3],}
def contextualize(word, context_type): base = base_embeddings[word] shift = context_shifts[context_type] return [round(base[0] + shift[0], 3), round(base[1] + shift[1], 3)]
bank_in_finance = contextualize("bank", "finance")bank_in_nature = contextualize("bank", "nature")
print("bank in finance:", bank_in_finance)print("bank in nature :", bank_in_nature)Expected output:
bank in finance: [0.8, 0.3]bank in nature : [0.3, 0.8]The same word starts from the same base vector, but the finance context pushes it toward a finance-like direction, while the nature context pushes it toward a river-like direction.
Of course, this code is not a real contextual model
Section titled “Of course, this code is not a real contextual model”But it captures the most important intuition:
- The word itself has a base representation
- Context pushes that representation in different directions
Why is this intuition important enough?
Section titled “Why is this intuition important enough?”Because when you later learn BERT, GPT, and T5, you will keep seeing one fact:
- the final representation of a token depends on the entire context
What should you remember first when learning this section?
Section titled “What should you remember first when learning this section?”The most important things to remember first are:
- Static embeddings are naturally weak on polysemous words
- Contextualized representations answer “what does this word mean in this sentence”
- This is a key step that makes modern pretraining models much stronger
Another minimal example showing how a context window affects representation
Section titled “Another minimal example showing how a context window affects representation”sentences = [ ("bank", ["open", "account", "money"], "finance"), ("bank", ["river", "water", "shore"], "nature"),]
def explain_representation(word, context_words, sense): return { "word": word, "context": context_words, "sense": sense, }
for word, context_words, sense in sentences: print(explain_representation(word, context_words, sense))Expected output:
{'word': 'bank', 'context': ['open', 'account', 'money'], 'sense': 'finance'}{'word': 'bank', 'context': ['river', 'water', 'shore'], 'sense': 'nature'}This output makes the hidden assumption visible: you cannot decide the meaning of bank from the word alone; you need the surrounding words.
This example is definitely not a neural network, but it helps beginners build a crucial awareness:
- A word’s representation must be viewed together with the surrounding context
Otherwise, it is hard to clearly say:
- which meaning the word has right now
What practical changes do contextualized representations bring?
Section titled “What practical changes do contextualized representations bring?”Polysemy handling becomes more natural
Section titled “Polysemy handling becomes more natural”The model can distinguish the representation of the same word in different sentences.
Sentence and paragraph understanding becomes stronger
Section titled “Sentence and paragraph understanding becomes stronger”Because the word representation is no longer isolated, it already incorporates contextual clues.
Transfer learning works better
Section titled “Transfer learning works better”Many downstream tasks no longer need to learn complex representations from scratch, but can directly use contextualized hidden states.
Why does this step directly raise the ceiling for many tasks?
Section titled “Why does this step directly raise the ceiling for many tasks?”Because the real difficulty in many NLP tasks is not:
- what a word probably means
but:
- what role it actually plays in the current context
Once the representation layer starts distinguishing this, many classification, extraction, and question-answering tasks become naturally more stable.
If we put this into tasks, which scenarios should you think of first?
Section titled “If we put this into tasks, which scenarios should you think of first?”Contextualized representations are especially easy for beginners to appreciate in these scenarios:
- Polysemy classification
- Named entity recognition
- Question answering
- Machine translation
Because what is truly hard in these tasks is often not:
- what the word means in a dictionary
but:
- what role it plays in this sentence
What is the relationship between this and static word vectors?
Section titled “What is the relationship between this and static word vectors?”Not a complete replacement, but an upgrade in capability
Section titled “Not a complete replacement, but an upgrade in capability”Static word vectors still have educational value and are useful in some lightweight tasks. But on the main line of modern NLP, contextualized representations are usually stronger.
A simple summary
Section titled “A simple summary”- Static embedding: fixed representation at the word level
- Contextualized representation: dynamic representation of a token in a sentence
The most common misconceptions
Section titled “The most common misconceptions”Misconception 1: Contextualized representations are just “bigger word vectors”
Section titled “Misconception 1: Contextualized representations are just “bigger word vectors””Not true. The key change is:
- the representation depends on context
Misconception 2: Different vectors for the same word are just a minor optimization
Section titled “Misconception 2: Different vectors for the same word are just a minor optimization”No. This step actually changes the performance ceiling of many tasks.
Misconception 3: Once you have contextualized representations, you no longer need higher-level modeling
Section titled “Misconception 3: Once you have contextualized representations, you no longer need higher-level modeling”Contextualized representations are powerful, but they still need to be used within specific tasks and specific models.
If you turn this into study notes or a project, what is most worth showing?
Section titled “If you turn this into study notes or a project, what is most worth showing?”What is usually most worth showing is not:
- “BERT is stronger”
but:
- A comparison of the same word across different sentences
- The difference between static embeddings and contextualized representations
- Which tasks depend especially on this capability
- Why this step became a watershed moment in modern NLP
That way, others can immediately see:
- You understand “why it becomes stronger”
- Not just the model names
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Representation
- BoW, TF-IDF, static embedding, contextual embedding, or language-model score
- Comparison
- nearest text, similarity score, or next-token/log-prob style output
- Interpretation
- what the representation captures and what it misses
- Failure Check
- polysemy, domain mismatch, short text, tokenization, or semantic drift
- Expected Output
- small comparison table with at least one surprising result
Summary
Section titled “Summary”The most important thing in this section is to build a judgment:
Fixed word vectors can only answer “what a word generally looks like,” while contextualized representations begin to answer “what this word means in this sentence.”
This step is an important threshold for modern NLP to truly enter the pretraining era.
What you should take away from this section
Section titled “What you should take away from this section”- Contextualized representations are not “bigger word vectors,” but “representations that change with the sentence”
- They are the key turning point from traditional NLP to modern pretraining models
- When you later study BERT, GPT, and T5, you should keep this line of thinking in mind
Exercises
Section titled “Exercises”- Add another word
appleto the example and simulate how its representation changes in the “fruit” and “company” contexts. - Explain in your own words: why do fixed word vectors struggle with polysemous words?
- Why do contextualized representations make many downstream tasks easier?
- Think about this: if the representation already depends on context, is the “word itself” still important? Why?
Reference implementation and walkthrough
- For
apple, a fruit context should move it toward food words, while a company context should move it toward technology or product words. - Fixed vectors struggle with polysemy because the word has one stored representation no matter which sentence it appears in.
- Contextualized representations help downstream tasks because the model can distinguish meaning from surrounding words before classification, extraction, or retrieval.
- The word itself still matters as the anchor token, but the final representation should combine token identity with context.