Skip to content

11.6.3 BERT Series

BERT Masked Language Model

  • Understand why BERT became a milestone in NLP
  • Clearly explain the core difference between BERT and autoregressive models like GPT
  • Master key concepts such as [CLS], [SEP], [MASK], and bidirectional context
  • Read a minimal BERT input example
  • Understand common fine-tuning methods for BERT

Background: Which paper did BERT come from?

Section titled “Background: Which paper did BERT come from?”

The most important historical milestone in this section is:

YearPaperKey AuthorsWhat it solved most importantly
2018BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingDevlin et al.Made bidirectional Transformer pretraining + fine-tuning the main path for modern NLP understanding tasks

For beginners, the most important thing to remember first is:

  • BERT is not “just another model name”
  • It represents a very important paradigm shift:

First do general pretraining on massive text, then fine-tune the same base model for different tasks.

That is also why, when you learn large models today, many ideas about “pretrain first, then adapt” can be seen very clearly in BERT as an early prototype.


First, the old problem: word meaning depends on context

Section titled “First, the old problem: word meaning depends on context”

Words do not always have a fixed meaning.

For example, the English word bank:

  • “river bank” means the riverbank
  • “bank account” means a financial bank

The same is true in another English example:

  • “I ate an apple” — apple means the fruit
  • “Apple released a new device” — Apple means the company

If a model can only give each word a fixed vector, it will struggle.

One of BERT’s core contributions is:

Making a word’s representation truly depend on context.

In other words, the same word can get different representations in different sentences.

This is called a “contextual representation.”

You can think of BERT as:

  • a “careful reader” that looks both before and after when reading a sentence

Unlike early static word vectors, which only give a word one fixed business card, it is more like:

  • when the same word appears in different sentences, BERT re-understands the role it is playing now

That is why BERT is especially suitable for understanding tasks.


Why is BERT called a “bidirectional” model?

Section titled “Why is BERT called a “bidirectional” model?”

Consider the sentence:

“I took a walk next to the bank yesterday”

When understanding “bank,” people do not only look at the left context “I took a walk next to the,” but also the right context “yesterday.”

BERT’s important feature is:

The representation of the current token uses both left and right context at the same time.

Roughly speaking:

  • BERT: more focused on understanding, reads context bidirectionally
  • GPT: more focused on generation, looks only at left history

So:

  • For classification, extraction, and matching tasks, BERT is very strong
  • For continuation, dialogue, and generation, the GPT route is more natural

tokenrole
[CLS]Aggregation position for sentence-level tasks
[SEP]Sentence separator
[MASK]Position hidden during pretraining
tokens = ["[CLS]", "I", "love", "natural", "language", "processing", "[SEP]"]
print(tokens)
print("sequence length:", len(tokens))

Expected output:

Terminal window
['[CLS]', 'I', 'love', 'natural', 'language', 'processing', '[SEP]']
sequence length: 7

The two boundary tokens are part of the actual sequence length. This matters because model cost and attention masks are calculated over the full token sequence, not only the visible words.

For sentence-pair tasks, such as question matching:

tokens = [
"[CLS]", "How", "is", "the", "weather", "today", "[SEP]",
"Will", "it", "rain", "in", "Beijing", "today", "[SEP]"
]
print(tokens)

Expected output:

Terminal window
['[CLS]', 'How', 'is', 'the', 'weather', 'today', '[SEP]', 'Will', 'it', 'rain', 'in', 'Beijing', 'today', '[SEP]']

Notice that sentence pairs use two [SEP] markers: one to close sentence A and one to close sentence B. In a real BERT input, segment IDs then help the model distinguish the two sides.

ComponentMost important thing to remember
[CLS]Aggregation position for sentence-level tasks
[SEP]Sentence boundary separator
[MASK]Position to be recovered during pretraining

This table is especially helpful for beginners because it turns BERT input from a “mysterious token string” into a few understandable parts.


The classic task: Masked Language Modeling

Section titled “The classic task: Masked Language Modeling”

BERT’s most classic training objective is MLM, which means:

Hide some tokens in a sentence and let the model guess them back from context.

For example:

“I love [MASK] language processing”

The model must infer what [MASK] is from the surrounding context.

tokens = ["[CLS]", "I", "love", "[MASK]", "language", "processing", "[SEP]"]
mask_index = tokens.index("[MASK]")
candidates = ["natural", "machine", "deep"]
print("tokens =", tokens)
print("mask index =", mask_index)
print("candidate fill-ins =", candidates)

Expected output:

Terminal window
tokens = ['[CLS]', 'I', 'love', '[MASK]', 'language', 'processing', '[SEP]']
mask index = 3
candidate fill-ins = ['natural', 'machine', 'deep']

The index tells you exactly where the model must predict. In a real MLM model, the candidate with the highest score would be selected from the whole vocabulary, not from this tiny hand-written list.

This example is not actually training a model, but it already teaches you:

  • the [MASK] position is explicit
  • the model’s job is to recover the hidden information
  • predictions at the current position depend on bidirectional context

Because it forces the model to truly understand:

  • what is said on the left
  • what is said on the right
  • what should go in the hidden position

This makes BERT very good at understanding tasks.

The safest default learning order for BERT

Section titled “The safest default learning order for BERT”

A more stable order is usually:

  1. First understand what bidirectional context is filling in
  2. Then look at the most common tokens: [CLS] / [SEP] / [MASK]
  3. Then see what MLM asks the model to learn during training
  4. Finally look at how fine-tuning attaches a classification head

This is easier than jumping straight into paper details and large-model parameters.


Each token is first turned into a vector.

The model also needs to know the order, so positional encoding must be added.

For sentence-pair tasks, the model also needs to know “which tokens belong to sentence A” and “which tokens belong to sentence B”.

You can think of BERT input as the sum of three parts:

final input representation = token embedding + position embedding + segment embedding

This step is important because Transformer itself does not inherently contain sequence-order awareness.


The example below does not require downloading pretrained weights. You only need to install transformers and torch, and you can initialize a small random BERT locally. It is mainly here to help you understand input/output shapes.

import torch
from transformers import BertConfig, BertModel
config = BertConfig(
vocab_size=100,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=64
)
model = BertModel(config)
input_ids = torch.tensor([
[1, 5, 8, 9, 2, 0, 0], # a shorter sample, padded with 0s at the end
[1, 7, 6, 3, 4, 2, 0]
])
attention_mask = torch.tensor([
[1, 1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 1, 0]
])
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
print("last_hidden_state shape:", outputs.last_hidden_state.shape)
print("pooler_output shape :", outputs.pooler_output.shape)

Expected output:

Terminal window
last_hidden_state shape: torch.Size([2, 7, 32])
pooler_output shape : torch.Size([2, 32])

The model is randomly initialized, so the numeric values are not meaningful predictions. The useful part is the shape: 2 samples, 7 positions per sample, and 32 hidden dimensions per token.

  • last_hidden_state

    • shape: [batch, seq_len, hidden_size]
    • each token has a contextual representation
  • pooler_output

    • shape: [batch, hidden_size]
    • usually can be understood as one kind of whole-sentence summary representation

This also explains why BERT is suitable for:

  • token-level tasks: use last_hidden_state
  • sentence-level tasks: use [CLS] or sentence-level representations

The most common approach is:

  1. Input a sentence
  2. Pass it through BERT
  3. Take [CLS] or a sentence representation
  4. Attach a linear classification head

This is the classic fine-tuning approach.

import torch
from torch import nn
# Assume this is the [CLS] representation output by BERT
cls_embedding = torch.randn(4, 32) # batch=4, hidden=32
# Attach a classification head
classifier = nn.Linear(32, 2)
logits = classifier(cls_embedding)
print("logits shape:", logits.shape)

Expected output:

Terminal window
logits shape: torch.Size([4, 2])

This means the batch has 4 samples and the classifier returns 2 raw scores for each sample. A real classifier would usually apply softmax or cross-entropy on top of these logits.

This code is very simple, but it teaches you something very important:

BERT is often not the end of a task, but a powerful representation layer.

What is most worth showing when BERT is used in a project

Section titled “What is most worth showing when BERT is used in a project”

What is usually most worth showing is not:

  • “I used BERT”

but rather:

  1. What the input text looks like
  2. How the [CLS] representation connects to the classification head
  3. What it does better than traditional representations or lighter models
  4. Which failure cases it still gets wrong

That way, other people can more easily see:

  • you understand BERT’s role in the task pipeline
  • you did more than just change the model name

  • Text classification
  • Sentence-pair matching
  • Named entity recognition
  • Extractive question answering

BERT itself was not designed for free-form long-text generation. If the task focus is:

  • long conversation generation
  • continuation
  • large-scale text creation

then the GPT route is usually more natural.


Why is BERT no longer the only main character?

Section titled “Why is BERT no longer the only main character?”

The reason is not that BERT is useless, but that the ecosystem kept moving forward

Section titled “The reason is not that BERT is useless, but that the ecosystem kept moving forward”

Later NLP and LLM development brought:

  • larger-scale pretraining
  • stronger generative models
  • more unified task interfaces

So today, many applications more often discuss GPT, T5, and Llama-style routes.

Because it helps you truly understand:

  • contextual representations
  • encoder-only models
  • the pretraining + fine-tuning paradigm
  • the difference between token-level and sentence-level tasks

These are all important foundations for learning large models later.


Keep this page’s proof of learning as a small evidence card:

Model Choice
BERT, GPT, T5, Transformers pipeline, or other pretrained baseline
Tokenizer Output
ids, masks, decoded text, or batch shape
Task Result
classification, generation, extraction, or text-to-text output
Failure Check
wrong model family, token limit, domain mismatch, cost, or latency
Expected Output
model call result plus a short choice rationale

Mixing up BERT and GPT as if they were the same thing

Section titled “Mixing up BERT and GPT as if they were the same thing”

They are both important, but their training objectives and strengths are not the same.

Thinking [CLS] is “naturally the best sentence vector”

Section titled “Thinking [CLS] is “naturally the best sentence vector””

It is useful in many tasks, but it is not a universal best choice.

Only knowing “use BERT for classification” without knowing what it actually learns

Section titled “Only knowing “use BERT for classification” without knowing what it actually learns”

What you really need to master is:

  • why it is bidirectional
  • why MLM works
  • why it is better suited to understanding tasks

The most important thing in this section is not memorizing BERT’s full name, but grasping these three points:

  1. BERT is a representative model of bidirectional context modeling
  2. It learns to “understand tokens from context” through MLM
  3. It is very suitable for understanding tasks and the fine-tuning paradigm

Once you understand these three points, many differences will become naturally clear when you later learn GPT, T5, and LLMs.


  1. Create a sentence with [MASK] by yourself, and write the candidate word(s) you think are most reasonable.
  2. Change hidden_size in the offline BERT example to 64, then see how the output shape changes.
  3. Think about this: why can a training objective like “I love [MASK] language processing” help the model learn bidirectional understanding?
  4. Explain in your own words: what is the core difference between BERT and GPT in the way they “look at context”?
Reference implementation and walkthrough
  1. A valid [MASK] example should have enough left and right context for multiple plausible candidates, such as “I love [MASK] language processing.”
  2. If hidden_size becomes 64, the last hidden state should end with dimension 64 instead of the previous hidden dimension.
  3. Masked language modeling helps bidirectional understanding because the model must use both left and right context to infer the missing token.
  4. BERT reads context bidirectionally for understanding tasks; GPT reads causally from left to right for next-token generation.