11.4.4 BiLSTM + CRF

Learning objectives
Section titled “Learning objectives”- Understand what BiLSTM does in sequence labeling
- Understand why CRF can model label transition constraints
- Know which predictions are unreasonable under the BIO labeling scheme
- Explain the difference between BiLSTM + CRF and ordinary token classification
First, look at the overall structure
Section titled “First, look at the overall structure”flowchart LR A[Token sequence] --> B[Embedding] B --> C[BiLSTM contextual representations] C --> D[Label scores at each position] D --> E[CRF label transition constraints] E --> F[Optimal label sequence]BiLSTM is responsible for understanding context, and CRF is responsible for selecting the most reasonable overall label path. Combined, the model does not just ask, “Does this token look like an entity?” It also asks, “Is this entire sequence of labels reasonable when connected together?”
Why ordinary token classification is not enough
Section titled “Why ordinary token classification is not enough”Suppose we use the BIO labeling scheme: B-PER means the beginning of a person name, I-PER means the inside of a person name, and O means non-entity. If the model predicts each token independently, it may output labels like this:
I love BeijingO I-LOC B-LOCHere, I-LOC appears at the beginning of an entity, which is usually unreasonable. An ordinary classifier has difficulty explicitly constraining this kind of label transition, while CRF can learn transition scores between labels.
BiLSTM is responsible for contextual representations
Section titled “BiLSTM is responsible for contextual representations”An LSTM reads text sequentially, while a BiLSTM reads it from left to right and from right to left at the same time. In this way, the representation of each token includes information from both the left and right context.
For example, “apple” may refer to a fruit in one sentence and a company in another. The role of BiLSTM is to let the current token see the surrounding words, thereby reducing ambiguity.
CRF is responsible for global decoding
Section titled “CRF is responsible for global decoding”CRF considers two kinds of scores at the same time: the emission score for how likely each position is to have a certain label, and the transition score between labels. During final prediction, it does not greedily choose labels one position at a time, but instead searches for the label path with the highest total score for the whole sequence.
flowchart TD A[Emission scores: which label each token looks like] --> C[Total sequence score] B[Transition scores: whether adjacent labels are reasonable] --> C C --> D[Viterbi decoding finds the optimal path]This is why CRF is especially suitable for tasks with structural constraints between labels, such as NER, part-of-speech tagging, and word segmentation.
A minimal intuition example
Section titled “A minimal intuition example”labels = ["B-PER", "I-PER", "O", "B-LOC", "I-LOC"]
# Simplified: manually define some invalid transitionsinvalid_transitions = { ("O", "I-PER"), ("O", "I-LOC"), ("B-PER", "I-LOC"), ("B-LOC", "I-PER"),}
path = ["O", "I-LOC", "B-LOC"]
for a, b in zip(path, path[1:]): if (a, b) in invalid_transitions: print("Invalid transition:", a, "->", b)Expected output:
Invalid transition: O -> I-LOCThe local token classifier might assign I-LOC, but the path is structurally suspicious because an inside-location tag should usually follow B-LOC or I-LOC.
In a real CRF, these are not hand-written rules. Instead, it learns from training data which label transitions are more reasonable. This example is only meant to help you build the intuition that labels are related to each other.
Relationship with BERT token classification
Section titled “Relationship with BERT token classification”Modern NER often uses BERT with a linear classification layer, and CRF can also be added after BERT. BERT’s contextual representation ability is usually stronger than BiLSTM’s, but CRF is still valuable for label constraints, especially in tasks with small amounts of data, strict label formats, and entity boundary errors.
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Schema
- entity types, BIO tags, or sequence-label rules
- Prediction
- token-level labels and extracted spans
- Metric
- entity precision/recall/F1 and boundary cases
- Failure Check
- span boundary, nested entity, unknown word, or inconsistent annotation
- Expected Output
- gold-vs-predicted span table with at least one miss
Common misconceptions
Section titled “Common misconceptions”The first misconception is that CRF is an outdated model. It is not necessarily the strongest solution, but the idea of label constraints is still important. The second misconception is looking only at token-level accuracy and ignoring entity-level F1. What NER ultimately cares about is whether the entity boundaries and types are both completely correct. The third misconception is ignoring BIO labeling consistency, which makes the training data itself contain invalid label sequences.
Exercises
Section titled “Exercises”- Write the BIO labels for a Chinese sentence and check whether there is any invalid
I-*at the beginning. - Compare the difference between “token-by-token classification” and “global sequence decoding.”
- Think about why entity-level F1 is more suitable than token accuracy for NER.
- If you use BERT for NER, do you still need CRF? List reasons for and against it.
Solution approach and explanation
- For a Chinese BIO example, segment the sentence first, then ensure no
I-*starts an entity and everyI-*follows a matchingB-*orI-*. - Token-by-token classification scores each position independently; global sequence decoding chooses the best valid label path for the whole sentence.
- Entity-level F1 is better than token accuracy because one wrong boundary can make the extracted entity unusable even if most tokens look correct.
- BERT may work without CRF, but CRF can still help enforce legal transitions and consistent entity boundaries; it also adds complexity and may be unnecessary on easy datasets.
Passing criteria
Section titled “Passing criteria”After learning this section, you should be able to explain what BiLSTM and CRF are each responsible for, identify invalid transitions in BIO labels, explain why sequence labeling needs to consider dependencies between labels, and transfer this idea to later structured information extraction tasks.