10.4.3 Instance Segmentation

Learning objectives

Understand the difference between instance segmentation and semantic segmentation
Understand why “category” and “instance” are two different levels
Build intuition for instance masks through runnable examples
Understand why instance segmentation is closer to real-world visual scenes

First, build a map

For beginners, the best way to understand instance segmentation is not “it’s just one more segmentation task,” but to first see clearly:

flowchart LR
    A["Semantic segmentation"] --> B["Only separates categories"]
    C["Instance segmentation"] --> D["Also separates individual objects of the same category"]
    D --> E["Better for counting, tracking, and interaction scenarios"]

So what this section really wants to solve is:

Why “the category is correct” is still not enough
Why “separating same-category individuals” significantly increases the difficulty

What does instance segmentation add beyond semantic segmentation?

Semantic segmentation:

Only separates categories

Instance segmentation:

Category + individual distinction

In other words, two “person” objects in the image should not be merged into one whole.

Three things beginners should distinguish first

When learning instance segmentation for the first time, the most important things to remember are:

Semantic segmentation answers “what category is this”
Instance segmentation also answers “which individual is this”
So the latter is naturally closer to real multi-object scenarios

A comparison table that is easier for beginners

When many beginners first learn this topic, it is easy to mix up classification, detection, semantic segmentation, and instance segmentation. The safest way is to place them in the same table and compare them:

Task	What it outputs	Core question
Classification	One category for the whole image	What is this image overall
Detection	Category + box	Where is the object
Semantic segmentation	Category mask	Which pixels belong to which category
Instance segmentation	Category mask + individual distinction	How do we separate same-category objects one by one

This table is especially valuable because it helps you immediately see:

Instance segmentation is not just “more detailed semantic segmentation”
It actually handles pixel-level understanding and instance-level distinction at the same time

Start with a minimal instance mask example

instance_map = [
    [0, 1, 1, 0],
    [0, 1, 1, 2],
    [0, 0, 0, 2],
]


def pixels_of_instance(instance_map, target_id):
    pixels = []
    for r, row in enumerate(instance_map):
        for c, value in enumerate(row):
            if value == target_id:
                pixels.append((r, c))
    return pixels


print("instance 1:", pixels_of_instance(instance_map, 1))
print("instance 2:", pixels_of_instance(instance_map, 2))

Expected output:

instance 1: [(0, 1), (0, 2), (1, 1), (1, 2)]
instance 2: [(1, 3), (2, 3)]

The two instance IDs are not class labels; they are individual object IDs. Both could belong to the same class, but the system still needs to keep them separate.

What is the most important thing in this example?

It shows that instance segmentation does not just output a category ID, it also distinguishes:

the 1st instance
the 2nd instance

This is very important in counting, tracking, and interaction scenarios.

Why is instance segmentation especially suitable for security and autonomous driving?

Because in these scenarios, the system often does not only care about:

whether there is a person in the scene

It cares more about:

how many people there are
which targets are very close to each other
whether these individuals can be tracked afterward

In other words, instance segmentation is naturally closer to a visual representation for downstream decision-making.

Look at another minimal “count + area” example

One reason instance segmentation is so valuable in real systems is that it can not only tell you “how many targets there are,” but also make it easy to compute:

how large each target is
which target is closer to the boundary
which targets overlap with each other

The example below gives you a minimal way to feel this “instance-level statistics”:

instance_map = [
    [0, 1, 1, 0, 2],
    [0, 1, 1, 0, 2],
    [0, 0, 0, 0, 2],
]


def instance_area(instance_map, target_id):
    area = 0
    for row in instance_map:
        for value in row:
            if value == target_id:
                area += 1
    return area


for target_id in [1, 2]:
    print(target_id, "area =", instance_area(instance_map, target_id))

Expected output:

1 area = 4
2 area = 3

Once each object has its own mask, you can count objects and compute area. That is one reason instance segmentation is valuable in inspection, tracking, and planning systems.

The most important thing to grasp here is not the code itself, but that:

once individuals are separated
many statistics and decisions naturally become available afterward

Instance separation, counting, and area diagram

The most common pitfalls

Adjacent same-category instances easily stick together

This is a very common error in instance segmentation.

Small instances are harder

The smaller and denser the objects are, the harder they are to separate.

Evaluation is more complex than semantic segmentation

Because now you not only need to check mask quality, you also need to check whether the instances are correctly separated.

The safest default workflow for your first instance segmentation project

When you first bring instance segmentation into a project, it is better to move forward in this order:

First confirm that the task really needs same-category objects to be separated
First inspect a small number of samples manually to see whether instance boundaries are clear
First build a visualization baseline to see whether instances are being merged
Then check whether mask quality and instance separation both hold
Finally consider more complex models and finer-grained evaluation

This is safer than chasing a complex network from the start, because the biggest problems in instance segmentation are often not:

the model is not complex enough

but rather:

the annotation boundaries themselves are unclear
the task requirements are not defined clearly

The right expectation when learning this section for the first time

The most important thing in this section is not to master a complex instance segmentation network today, but to clearly understand:

Why semantic segmentation and instance segmentation are not the same thing
Why neighboring same-category objects become the real difficulty
Why, once this task is done well, it becomes especially valuable for counting, interaction, and tracking

Evidence to Keep

Keep this page’s proof of learning as a small evidence card:

Input Image: original image and target mask or class map
Prediction: predicted mask, overlay visualization, and boundary examples
Metric: IoU, Dice, per-class score, and boundary failure notes
Failure Check: annotation quality, thin boundary, small region, or class confusion
Expected Output: mask overlay plus segmentation metric summary

Summary

The most important idea in this section is to build one judgment:

Instance segmentation solves one more layer of the problem than semantic segmentation: how to distinguish among objects of the same category, so it is closer to real multi-object visual scenarios.

What you should take away from this section

Instance segmentation adds “individual separation” on top of semantic segmentation
The difficulty is often not the category, but the boundary between neighboring same-category objects
If downstream tasks need counting, tracking, or interaction, instance segmentation is often especially valuable

Exercises

Create a larger instance_map by yourself and mark 3 instances.
Why is instance segmentation harder than semantic segmentation?
If two adjacent objects are always merged into one instance, what would you suspect first?
Think about it: why is instance segmentation especially valuable in autonomous driving or security scenarios?

Solution approach and explanation

A correct instance_map usually uses 0 for background and a different integer id for each object instance. Three objects should have three distinct ids.
Instance segmentation is harder because the model must classify pixels and separate different objects of the same class.
If adjacent objects are always merged, first suspect boundary annotation, image resolution, post-processing, or whether the model has enough separation cues.
Instance segmentation is valuable when the system must count, track, or act on individual objects, such as pedestrians, vehicles, packages, or people in security scenes.