11.4.3 HMM, CRF, and the Historical Thread of Sequence Labeling

HMM CRF Sequence Labeling History Map

Where is the real difficulty in sequence labeling?

Sequence labeling does not assign one label to the whole sentence. Instead, it assigns a label to each position.

For example, named entity recognition:

Jobs   founded   Apple
B-PER  O         B-ORG

The difficulty is that the label at each position is not completely independent.

For example:

I-PER usually cannot appear at the beginning of a sentence out of nowhere
B-ORG may be followed by I-ORG
Chinese word segmentation, part-of-speech tagging, and NER all depend on context

So this historical line has always been trying to solve the same problem:

How can we look at the current token, the surrounding context, and still make the whole label sequence reasonable?

HMM: the classic starting point of early statistical sequence modeling

You can think of HMM as a model that “generates observed words from hidden states.”

In part-of-speech tagging:

Hidden states: part-of-speech tags, such as noun, verb, adjective
Observations: the actual words that appear

It asks two questions:

Question	Name in HMM
Which part of speech is more likely to follow another part of speech?	Transition probability
Is a certain word more likely to be generated by a certain part of speech?	Emission probability

The most classic decoding method is Viterbi: instead of choosing the highest-probability tag at each position separately, it finds the most likely tag path for the whole sentence.

CRF: scoring the “entire label path” more directly

HMM is classic, but it has relatively strong generative assumptions. CRF is more like answering directly:

Given this sentence, which entire sequence of labels is the most reasonable?

This is very important for NER because there are constraints between labels.

For example:

B-PER -> I-PER  reasonable
O     -> I-PER  usually unreasonable

The value of CRF is that it does not only ask whether “this token looks like an entity,” but also whether “the whole label chain is valid and smooth.”

BiLSTM-CRF: contextual representations + label constraints

Later, when deep learning entered NLP, BiLSTM was responsible for reading context, and CRF was responsible for choosing the overall label path.

You can understand it as a division of labor:

Module	Responsibility
Embedding	Turn words into vectors
BiLSTM	Look at both left and right context at the same time
CRF	Choose the most reasonable label sequence

This is why many early NER systems used BiLSTM-CRF.

After BERT, is HMM/CRF still worth learning?

Yes. The reason is not that you must manually implement HMM in a project, but that:

HMM helps you understand “sequence states” and “path decoding”
CRF helps you understand “constraints between labels”
BiLSTM-CRF helps you understand “contextual representations + structured output”
BERT token classification helps you understand that “stronger representations can replace part of feature engineering”

In modern projects, BERT can often produce very strong token representations directly. But when the data is small, label rules are strict, or boundaries are easy to get wrong, the CRF idea is still valuable.

Mapping historical milestones to course chapters

Historical milestone	Problem it solved	Corresponding course chapter
HMM part-of-speech tagging	Model label sequences with hidden states and transition probabilities	4.5 This section, 4.2 Sequence labeling tasks
Viterbi decoding	Find the most likely label path for the whole sentence	4.5 This section, 4.3 BiLSTM + CRF
CRF	Model the entire label path directly given the input	4.3 BiLSTM + CRF
BiLSTM-CRF	Combine contextual representations with label constraints	4.3 BiLSTM + CRF, 4.4 NER practice
BERT token classification	Use pretrained contextual representations for token-level tasks	6.3 BERT, Chapter 7 LLM foundations

A minimal intuitive example

The following is not a complete HMM. It is only meant to help you understand the feeling of “transition constraints”:

labels = ["B-PER", "I-PER", "O"]

allowed = {
    "B-PER": ["I-PER", "O"],
    "I-PER": ["I-PER", "O"],
    "O": ["B-PER", "O"],
}

path = ["O", "I-PER"]

if path[1] not in allowed[path[0]]:
    print("This label path is unreasonable")
else:
    print("This label path is acceptable")

Expected output:

This label path is unreasonable

I-PER should continue an existing person entity. Starting it directly after O breaks the label grammar, which is exactly the kind of constraint HMM/CRF-style thinking makes visible.

What this code wants to show is: sequence labeling is not something where each token is judged independently; labels also have their own “grammar.”

The intuition you should have after finishing this section

The history of sequence labeling did not start with BERT. It roughly went through:

HMM / ViterbiCRFBiLSTM-CRFBERT token classification

Each generation was answering the same question:

How can we make each position’s label fit the context while also making the entire label sequence reasonable?

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Schema: entity types, BIO tags, or sequence-label rules
Prediction: token-level labels and extracted spans
Metric: entity precision/recall/F1 and boundary cases
Failure Check: span boundary, nested entity, unknown word, or inconsistent annotation
Expected Output: gold-vs-predicted span table with at least one miss

Review notes and pass criteria

A passing review should explain why token labels are not independent decisions. The answer should mention both local evidence and label-transition constraints.
Test at least one invalid BIO path such as O -> I-PER, then explain whether the mistake comes from representation, decoding, or annotation rules.
Compare a BERT token-classification output with a rule-checked span table. If the model predicts a high-confidence but illegal path, the structured check should still catch it.
The page is complete when you can say when a CRF-style constraint is useful even if the main model is a modern pretrained encoder.