Skip to content

11.4.5 NER Practice

NER project entity evaluation loop

  • Learn how to define the boundary of a minimal NER project
  • Learn how to recover entities from token labels
  • Learn how to do entity-level error analysis
  • Build a project skeleton for information extraction through a runnable example

NER practice is easier to understand in the order of “labels -> entities -> evaluation -> iteration”:

flowchart LR
A["Define entity types"] --> B["Design BIO labels"]
B --> C["Model outputs label sequence"]
C --> D["Recover entity span"]
D --> E["Do entity-level evaluation and error analysis"]

So the real questions this section wants to solve are:

  • Why is an NER project not just “label prediction”?
  • Why do entity recovery and error analysis feel more like a real project?

Input:

  • A resume or candidate profile text

Output:

  • Name
  • School
  • Skill

Why is this more suitable for practice than “just extracting some entities”?

Section titled “Why is this more suitable for practice than “just extracting some entities”?”

Because the boundaries are clear:

  • Not too many categories
  • Entity types are explicit
  • The results are easy to explain from a business perspective

The first key point is not the model, but the label scheme

Section titled “The first key point is not the model, but the label scheme”

For example:

  • Zhang San -> B-NAME
  • Tsinghua University -> B-SCHOOL I-SCHOOL ...
  • Python -> B-SKILL

If this step is vague, the model and evaluation will both become messy later.

You can think of NER as:

  • using a highlighter to mark important information in a piece of text

The hard part is not only “marking it,” but also:

  • where to start marking
  • where to stop
  • what category this span belongs to

Once you understand it this way, it becomes much more natural why NER often gets stuck on boundary issues.


First build a runnable annotation and decoding loop

Section titled “First build a runnable annotation and decoding loop”

The example below does three things:

  1. Prepare a small sample
  2. Decode BIO labels into entities
  3. Do a simple prediction comparison and error analysis
samples = [
{
"tokens": ["Zhang San", "graduated from", "Tsinghua University", ",", "familiar with", "Python", "and", "PyTorch"],
"gold_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL", "O", "B-SKILL"],
"pred_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL", "O", "B-SKILL"],
},
{
"tokens": ["Li Si", "is from", "Peking University", ",", "knows", "Java"],
"gold_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL"],
"pred_tags": ["B-NAME", "O", "O", "O", "O", "B-SKILL"],
},
]
def decode_entities(tokens, tags):
entities = []
current_tokens = []
current_type = None
for token, tag in zip(tokens, tags):
if tag == "O":
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = []
current_type = None
continue
prefix, entity_type = tag.split("-", 1)
if prefix == "B":
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = [token]
current_type = entity_type
elif prefix == "I" and current_type == entity_type:
current_tokens.append(token)
else:
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = [token]
current_type = entity_type
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
return entities
for sample in samples:
gold_entities = decode_entities(sample["tokens"], sample["gold_tags"])
pred_entities = decode_entities(sample["tokens"], sample["pred_tags"])
print("tokens:", sample["tokens"])
print("gold :", gold_entities)
print("pred :", pred_entities)
print("miss :", [x for x in gold_entities if x not in pred_entities])
print()

Expected output:

Terminal window
tokens: ['Zhang San', 'graduated from', 'Tsinghua University', ',', 'familiar with', 'Python', 'and', 'PyTorch']
gold : [('Zhang San', 'NAME'), ('Tsinghua University', 'SCHOOL'), ('Python', 'SKILL'), ('PyTorch', 'SKILL')]
pred : [('Zhang San', 'NAME'), ('Tsinghua University', 'SCHOOL'), ('Python', 'SKILL'), ('PyTorch', 'SKILL')]
miss : []
tokens: ['Li Si', 'is from', 'Peking University', ',', 'knows', 'Java']
gold : [('Li Si', 'NAME'), ('Peking University', 'SCHOOL'), ('Java', 'SKILL')]
pred : [('Li Si', 'NAME'), ('Java', 'SKILL')]
miss : [('Peking University', 'SCHOOL')]

NER gold pred miss result map

The second sample misses the school entity. That is exactly why NER projects should inspect recovered entities, not only token-level labels.

Why is this code the “minimal project loop”?

Section titled “Why is this code the “minimal project loop”?”

Because it already includes:

  • data representation
  • prediction results
  • entity recovery
  • error analysis

This is much closer to the shape of a real project than printing a string of labels.

Why compare by entity here instead of only by token?

Section titled “Why compare by entity here instead of only by token?”

Because what the business usually cares about is:

  • whether the entity was extracted
  • whether the type is correct

Not whether a single token was labeled correctly.

sample = {
"tokens": ["Li Si", "is from", "Peking University", ",", "knows", "Java"],
"gold_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL"],
"pred_tags": ["B-NAME", "O", "O", "O", "O", "B-SKILL"],
}
def decode_entities(tokens, tags):
entities = []
current_tokens = []
current_type = None
for token, tag in zip(tokens, tags):
if tag == "O":
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = []
current_type = None
continue
prefix, entity_type = tag.split("-", 1)
if prefix == "B":
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = [token]
current_type = entity_type
elif prefix == "I" and current_type == entity_type:
current_tokens.append(token)
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
return entities
gold_entities = decode_entities(sample["tokens"], sample["gold_tags"])
pred_entities = decode_entities(sample["tokens"], sample["pred_tags"])
print(
{
"text": " ".join(sample["tokens"]).replace(" ,", ","),
"gold_entities": gold_entities,
"pred_entities": pred_entities,
}
)

Expected output:

Terminal window
{'text': 'Li Si is from Peking University, knows Java', 'gold_entities': [('Li Si', 'NAME'), ('Peking University', 'SCHOOL'), ('Java', 'SKILL')], 'pred_entities': [('Li Si', 'NAME'), ('Java', 'SKILL')]}

This kind of log is especially good for beginners because it turns an abstract labeling task into a more realistic project output:

  • What is the original text?
  • What are the correct entities?
  • What exactly did the system miss?

What metrics should an NER project look at first?

Section titled “What metrics should an NER project look at first?”

This is the most common and most meaningful set of metrics.

Because most positions in a sequence are often:

  • O

If you only look at token accuracy, it can easily seem “very high,” but the actual entity extraction performance may still be poor.

samples = [
{
"tokens": ["Zhang San", "graduated from", "Tsinghua University", ",", "familiar with", "Python", "and", "PyTorch"],
"gold_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL", "O", "B-SKILL"],
"pred_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL", "O", "B-SKILL"],
},
{
"tokens": ["Li Si", "is from", "Peking University", ",", "knows", "Java"],
"gold_tags": ["B-NAME", "O", "B-SCHOOL", "O", "O", "B-SKILL"],
"pred_tags": ["B-NAME", "O", "O", "O", "O", "B-SKILL"],
},
]
def decode_entities(tokens, tags):
entities = []
current_tokens = []
current_type = None
for token, tag in zip(tokens, tags):
if tag == "O":
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = []
current_type = None
continue
prefix, entity_type = tag.split("-", 1)
if prefix == "B":
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
current_tokens = [token]
current_type = entity_type
elif prefix == "I" and current_type == entity_type:
current_tokens.append(token)
if current_tokens:
entities.append((" ".join(current_tokens), current_type))
return entities
def entity_recall(gold_entities, pred_entities):
if not gold_entities:
return 1.0
hit = sum(entity in pred_entities for entity in gold_entities)
return hit / len(gold_entities)
for sample in samples:
gold_entities = decode_entities(sample["tokens"], sample["gold_tags"])
pred_entities = decode_entities(sample["tokens"], sample["pred_tags"])
print(entity_recall(gold_entities, pred_entities))

Expected output:

Terminal window
1.0
0.6666666666666666

The first sample recovers all entities. The second recovers 2 of 3 entities, so entity-level recall drops even though many O tokens were still correct.

The safest default order when doing an NER project for the first time

Section titled “The safest default order when doing an NER project for the first time”

A more stable order is usually:

  1. Narrow down the entity types first
  2. Write the labeling standard clearly first
  3. Do entity recovery and entity-level evaluation first
  4. Then switch to a stronger model

This is easier to keep the project stable than rushing to BERT from the start.


The most common failure points in NER projects

Section titled “The most common failure points in NER projects”

For example, only half of a school name is extracted.

For example, a skill is recognized as a school.

For example, in sample 2, Peking University is missed.

Why is this so suitable for error analysis?

Section titled “Why is this so suitable for error analysis?”

Because NER errors are usually very concrete, which makes them easy to inspect one by one and fix category by category.

A very useful error-bucketing method for beginners

Section titled “A very useful error-bucketing method for beginners”

When doing error analysis for the first time, the most valuable buckets are usually:

  1. Boundary error
  2. Type error
  3. Missing entity

These three are already enough to help you judge:

  • Is it a data annotation problem?
  • Is it a model representation problem?
  • Or are the post-processing rules not strong enough?

What should the next step be in a real project?

Section titled “What should the next step be in a real project?”

Especially:

  • long entities
  • rare entities
  • easily confused types

Upgrade from rules / classic models to stronger models

Section titled “Upgrade from rules / classic models to stronger models”

For example:

  • BiLSTM + CRF
  • BERT token classification

In many business projects, reasonable post-processing rules can significantly improve entity quality.

If you turn this into a project, what is most worth showing?

Section titled “If you turn this into a project, what is most worth showing?”

What is usually most worth showing is not:

  • a string of label prediction results

but:

  1. Original text
  2. Gold entities
  3. Predicted entities
  4. Missed and false extraction cases
  5. Which type of error you plan to fix first

This makes it much easier for others to feel that:

  • you built an information extraction project
  • not just trained a sequence labeling model

Misconception 1: Only look at token-level metrics

Section titled “Misconception 1: Only look at token-level metrics”

NER should pay more attention to entity-level performance.

Misconception 2: Try to cover all entity types from the start

Section titled “Misconception 2: Try to cover all entity types from the start”

A more stable approach is usually:

  • first choose 2~4 core entity types and make them solid

Misconception 3: Do not define the label scheme clearly at the beginning

Section titled “Misconception 3: Do not define the label scheme clearly at the beginning”

If the label boundaries are unclear, both the data and the evaluation will drift.


Keep this page’s proof of learning as a small evidence card:

Schema
entity types, BIO tags, or sequence-label rules
Prediction
token-level labels and extracted spans
Metric
entity precision/recall/F1 and boundary cases
Failure Check
span boundary, nested entity, unknown word, or inconsistent annotation
Expected Output
gold-vs-predicted span table with at least one miss

The most important thing in this section is to build a practical habit:

When doing an NER project, first make the entity types, label scheme, entity recovery, and entity-level error analysis solid, and only then pursue more complex models.

In that way, what you leave behind is a truly explainable and improvable information extraction project, not just a half-finished script that can run training.


  1. Add an ORG or TITLE entity type to the example and expand the samples.
  2. Think about why NER projects are more suitable for entity-level metrics than token accuracy.
  3. If the system often extracts only half of a long school name, would you prioritize changing the data, changing the model, or adding post-processing? Why?
  4. How would you further expand this resume extraction project into a portfolio presentation?
Project reference and review notes
  1. When adding ORG or TITLE, define boundary rules first: what counts as the organization name, job title, or surrounding modifier.
  2. NER should use entity-level metrics because the user receives extracted entities, not isolated token labels.
  3. For half-extracted long school names, first inspect annotation consistency, then add examples or post-processing; change the model only after the target is clear.
  4. A strong portfolio presentation should show label schema, examples, entity-level metrics, error buckets, fixes, and a small before/after improvement log.