6.8.4 Project: Generative Models in Practice [Optional]

Learning Objectives

Explain why generative projects need different evaluation from classification.
Track quality and diversity together.
Build a small checkpoint review table.
Identify mode collapse and blurry-output failure modes.
Package generated samples as project evidence.

See the Evaluation Loop First

Generative model project evaluation loop

trainsample checkpointsreview quality + diversitykeep failureschoose next step

For a practice project, choose a generation target that is:

visually inspectable;
small enough to train or simulate;
easy to compare across checkpoints.

Digits, icons, simple shapes, or tiny grayscale patterns are better first projects than open-ended photorealistic generation.

Lab: Checkpoint Review Dashboard

Create generative_review_dashboard.py:

checkpoints = [
    {"epoch": 1, "quality": 0.20, "diversity": 0.80, "note": "mostly noise"},
    {"epoch": 10, "quality": 0.45, "diversity": 0.72, "note": "outlines appear"},
    {"epoch": 30, "quality": 0.68, "diversity": 0.60, "note": "usable but varied"},
    {"epoch": 60, "quality": 0.75, "diversity": 0.48, "note": "possible collapse"},
]

print("generation_review")
for row in checkpoints:
    status = "candidate" if row["quality"] >= 0.6 and row["diversity"] >= 0.55 else "review"
    print(
        f"epoch={row['epoch']:03d} "
        f"quality={row['quality']:.2f} "
        f"diversity={row['diversity']:.2f} "
        f"status={status}"
    )

selected = max(
    [row for row in checkpoints if row["diversity"] >= 0.55],
    key=lambda row: row["quality"],
)
print("selected_epoch:", selected["epoch"])

Run it:

python generative_review_dashboard.py

Expected output:

generation_review
epoch=001 quality=0.20 diversity=0.80 status=review
epoch=010 quality=0.45 diversity=0.72 status=review
epoch=030 quality=0.68 diversity=0.60 status=candidate
epoch=060 quality=0.75 diversity=0.48 status=review
selected_epoch: 30

Checkpoint review result map for generative models

Why not pick epoch 60? Because quality is higher but diversity is lower. A good generative project does not select only the prettiest sample.

What to Save

Evidence	Why
samples by checkpoint	shows training progression
failure samples	reveals limits honestly
diversity notes	catches repeated outputs
quality notes	explains visual improvements
training logs	shows stability or collapse
final selection rule	makes the choice reproducible

Quality, Diversity, Stability

Dimension	Good sign	Warning sign
Quality	samples look like target data	noisy, blurry, broken structure
Diversity	samples vary meaningfully	repeated outputs or one dominant style
Stability	checkpoints improve gradually	sudden collapse or oscillation
Interpretability	failures are documented	only best samples are shown

The common trade-off:

best-looking single sample != best project checkpoint

Project Upgrade Path

Version	What to add
basic	one model, fixed sampling seed, checkpoint samples
standard	quality/diversity table and failure samples
challenge	compare VAE, GAN, or diffusion-style outputs
portfolio	clear story: data, model, samples, failures, next step

Evidence to Keep

A generative project should leave this minimum evidence:

Checkpoint Samples: fixed-seed samples across epochs
Quality Note: what improved visually
Diversity Note: whether outputs repeat
Failure Sample: blurry, broken, collapsed, or unrealistic output
Selection Rule: why this checkpoint was kept
Next Action: data, objective, architecture, or sampling change

Common Mistakes

Mistake	Fix
showing only best samples	show average and failure samples too
ignoring diversity	track repeated outputs or unique patterns
comparing checkpoints by memory	use the same fixed seed set
using a dataset too complex at first	start with small visual targets
not explaining model choice	state why VAE, GAN, or another method fits the goal

Exercises

Add an epoch 90 with quality 0.80 and diversity 0.30. Should it be selected?
Add a failure field to each checkpoint.
Write a 4-row table for your own generative project idea.
Explain mode collapse using the checkpoint table.
Draft a portfolio section titled “Why I selected this checkpoint.”

Project reference and review notes

Usually no, unless the project values quality far more than diversity. A diversity score of 0.30 is a warning sign for repeated or narrow outputs.
The failure field should name visible problems such as repetition, artifacts, prompt mismatch, unsafe output, or poor diversity.
A useful table has rows for idea, data/source, evaluation signal, and main risk. The table should help someone judge whether the project can be evaluated.
Mode collapse means the model produces a small set of similar outputs. In the checkpoint table, it looks like acceptable quality with low diversity.
The portfolio section should justify the selected checkpoint with evidence: quality, diversity, failure notes, sample outputs, and why rejected checkpoints were weaker.

Key Takeaways

Generative projects need evaluation stories, not just galleries.
Quality and diversity must be read together.
Failure samples make the project more credible.
A clear checkpoint selection rule is part of the deliverable.