11.7.3 Project: Text Summarization System

Text summarization extraction-generation-evaluation diagram

Learning objectives

Learn how to define the minimum end-to-end loop of a summarization project
Learn how to turn an extractive baseline into an explainable system
Learn how to design minimal evaluation and failure analysis
Learn how to package this topic as a complete NLP project

First, build a map

For beginners, the best way to understand a text summarization project is not to “chase a stronger model first,” but to first see the full project loop clearly:

flowchart LR
    A["Original text"] --> B["Sentence splitting"]
    B --> C["Scoring / selecting key information"]
    C --> D["Generating summary"]
    D --> E["Evaluation and failure analysis"]

So what this section really wants to solve is:

What does it mean to “keep the main thread”?
How do you evaluate and present a summarization project?

A better overall analogy for beginners

You can think of text summarization as:

Making a reading card for a long article

The real difficulty is not “making the text shorter,” but:

Not losing the main thread
Not keeping only side details
Making the final summary read smoothly

How should you narrow the project topic?

A good starter project could be:

Generate a 2-sentence summary for long course articles.

This type of task is good because:

The domain is clear
The text length is moderate
The summarization goal is easy to understand

When doing your first summarization project, how do you choose a safer topic?

A safer starting point usually has these three traits:

The original text has a clear structure
The main thread is concentrated
It is easy for readers to judge whether the key points are missing

So texts like:

course introductions
news briefs
meeting minutes

are often great practice topics.

A useful early judgment for beginners

When you do a summarization project for the first time, the most worthwhile thing to choose first is:

Texts where readers can easily tell which parts are the key points

Because the hardest layer of summarization is:

What exactly counts as key information?

The minimum project loop for a portfolio-level summarization project

Select a text collection
Split into sentences
Score sentences
Select the top-k sentences
Do human evaluation
Summarize failure patterns

A project checklist that beginners can remember first

Step	What should you confirm first?
Sentence splitting	Whether sentence boundaries are stable
Scoring	What standard is used to decide “more important”
Summary generation	Whether the top-k sentences preserve the main thread
Evaluation	Whether you are only checking “does it read smoothly,” or also “does it miss key points”

This table is useful for beginners because it turns a summarization project back into a chain of steps that can be checked, rather than “just extract a few sentences and stop.”

Recommended order of progress

For beginners, a safer order is usually:

Build an extractive baseline first
Add minimal human evaluation
Do failure case analysis
Only then consider a comparison with generative summarization

This way, you can more easily see what the summarization system is actually improving.

Start with a more complete extractive summarization system

import re

article = """
The learning path for AI courses is usually divided into a foundation stage and an advanced stage.
The foundation stage includes Python programming, data analysis, and machine learning.
Only after learners master these topics can they move more steadily into deep learning and large model application development.
Many people want to jump straight into large models at the beginning, but they often get stuck quickly because their foundation is not solid enough.
If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.
""".strip()


def split_sentences(text):
    parts = re.split(r"[。！？\n]+", text)
    return [p.strip() for p in parts if p.strip()]


def sentence_score(sentence, all_sentences):
    # Extremely simple frequency-based scoring: sentences with more high-frequency words get higher scores
    tokens = "".join(all_sentences)
    return sum(tokens.count(ch) for ch in sentence if ch.strip())


def summarize(text, top_k=2):
    sentences = split_sentences(text)
    scored = [
        (sentence_score(sent, sentences), idx, sent)
        for idx, sent in enumerate(sentences)
    ]
    top = sorted(sorted(scored, reverse=True)[:top_k], key=lambda x: x[1])
    return " ".join(item[2] for item in top), scored


summary, scored = summarize(article, top_k=2)
print("summary:", summary)
print("top scored:", sorted(scored, reverse=True)[:2])

Expected output:

summary: Many people want to jump straight into large models at the beginning, but they often get stuck quickly because their foundation is not solid enough. If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.
top scored: [(4136, 4, 'If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.'), (3866, 3, 'Many people want to jump straight into large models at the beginning, but they often get stuck quickly because their foundation is not solid enough.')]

Do not treat the score as truth. Treat it as a debug signal: if the selected sentence is strange, inspect the scoring rule before changing models.

Why does this example feel more like a project?

Because it does not only give you the result, it also keeps:

the sentence-splitting result
the scoring result

This lets you do:

explanation
debugging
failure analysis

Why is it especially worth showing intermediate scores in a summarization project?

Because whether a summary is good or bad is inherently subjective. The intermediate scoring process helps others understand:

how you made your selection

Here is another minimal example for “summary length control”

for k in [1, 2, 3]:
    summary_k, _ = summarize(article, top_k=k)
    print(f"top_k={k} -> {summary_k}")

Expected output:

top_k=1 -> If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.
top_k=2 -> Many people want to jump straight into large models at the beginning, but they often get stuck quickly because their foundation is not solid enough. If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.
top_k=3 -> Only after learners master these topics can they move more steadily into deep learning and large model application development. Many people want to jump straight into large models at the beginning, but they often get stuck quickly because their foundation is not solid enough. If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.

This example is great for beginners because it helps you build one key intuition:

A summary is not better just because it has more sentences
Nor is it more advanced just because it is shorter

Rather, it is about:

Preserving the main thread as much as possible under length constraints

What should a minimal human evaluation table look like?

eval_cases = [
    {
        "text": article,
        "gold_focus": ["foundation stage", "deep learning and large models", "system deployment"],
    }
]

for case in eval_cases:
    pred_summary, _ = summarize(case["text"], top_k=2)
    covered = [item for item in case["gold_focus"] if item in pred_summary]
    print({
        "summary": pred_summary,
        "covered_focus": covered,
        "coverage_ratio": round(len(covered) / len(case["gold_focus"]), 4),
    })

Expected output:

{'summary': 'Many people want to jump straight into large models at the beginning, but they often get stuck quickly because their foundation is not solid enough. If the learning goal is AI application engineering, understanding data processing, model training, and system deployment is all very important.', 'covered_focus': ['system deployment'], 'coverage_ratio': 0.3333}

The low coverage ratio is not a failure of the lesson. It is the lesson: a naive frequency baseline can miss important learning-path facts even when the summary reads smoothly.

Why is this evaluation simple but useful?

Because it forces you to answer:

Did the summary keep the main thread or not?

That is more concrete than only asking whether it “reads smoothly.”

The failure cases most worth showing in a summarization project

For example:

Repeated sentence selection
Missing key information
Unnatural sentence order

Why are these worth showing?

Because they happen to reflect the typical limitations of extractive summarization.

A failure analysis framework that is easy for beginners to use directly

You can first categorize them into these three types:

Missing main-thread information
Repeated or redundant sentences
The individual sentences are fine, but the combination feels unnatural

This is easier to move forward with than just saying “the summary is not very good.”

An error bucket table that beginners can copy directly

Error type	Next improvement
Missing main-thread information	Sentence scoring rules
Repeated sentences	Redundancy removal strategy
Unnatural combination	Sentence ordering or generative rewriting

This table is helpful for beginners because it helps turn “the summary is not very good” back into concrete problems that can be improved.

How can you push this project toward portfolio quality?

Add a generative summarization comparison

Include more text types

For example:

news
course introductions
meeting minutes

Make a one-page before / after display

For example:

original text
baseline summary
tuned summary
failure analysis

What you should ideally include when delivering the project

Original text / summary comparison
Intermediate sentence score table
A set of failed summary examples
A short explanation of what you define as “key information”

If you turn it into a portfolio piece, what should you emphasize most?

What is usually most worth emphasizing is not:

“I built a summarization model”

but rather:

How your baseline selects sentences
How you define “keeping the main thread”
How you present intermediate sentence scores
What the main error cases are

This makes it easier for others to see that:

You understand the evaluation criteria of a summarization project
Not just that you shortened the text

If you keep going, what is most worth adding next?

The most worthwhile additions, in order, are usually:

More stable sentence scoring features
Better human evaluation criteria
A comparison page for extractive and generative summarization

Then your project can grow from “it runs” into “it can compare, explain, and present.”

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Task Output: label, entity fields, summary, answer, retrieval result, or semantic graph
Artifacts: raw text, processed text, predictions, metrics, and failure cases
Metric: accuracy/F1, precision/recall, retrieval hit rate, faithfulness, or schema validity
Failure Check: unclear labels, over-cleaning, boundary errors, hallucination, or unsupported answer
Expected Output: reproducible text pipeline folder with metrics and examples

Summary

The most important takeaway from this section is to build a portfolio-level judgment:

The key to a summarization project is not just extracting a few sentences, but whether you can explain sentence splitting, scoring, generation, evaluation, and failure analysis as one explainable loop.

As long as this loop is clear, a text summarization project will feel very much like a mature NLP deliverable.

Suggested version roadmap

Version	Goal	Delivery focus
Basic version	Run the minimum loop	Can input, process, and output, while keeping a set of examples
Standard version	Become a presentable project	Add configuration, logging, error handling, a README, and screenshots
Advanced version	Approach portfolio quality	Add evaluation, comparison experiments, failure sample analysis, and a next-step roadmap

It is recommended to finish the basic version first. Do not chase a large, all-in-one solution from the beginning. Each time you improve a version, write down in the README what new capability was added, how it was validated, and what problems still remain.

Exercises

Change top_k to 1 and 3, and observe how the summary changes.
Why is it especially worthwhile for a summarization project to show the “intermediate scoring results”?
Think about it: what type of failure is extractive summarization most likely to have?
If you were to put this project into a portfolio, which 4 parts would you prioritize showing?

Project reference and review notes

With top_k=1, the summary becomes shorter and may miss context; with top_k=3, it includes more evidence but may become redundant.
Intermediate scoring is worth showing because it explains why each sentence was selected and makes failure analysis possible.
Extractive summarization often fails by missing cross-sentence context, selecting redundant sentences, or omitting a necessary condition.
A portfolio version should show source text, scoring table, selected summary, factuality check, and failure/improvement notes.