Skip to content

10 Computer Vision (Elective Track)

Computer Vision main visual

This elective chapter answers a simple question: what does it mean for a model to see an image? Start with pixels, then move from coarse output to fine output: classify the whole image, locate objects, segment pixels, and finally connect vision to OCR, video, or multimodal systems.

If your main track is LLM apps and Agents, you can return later. If you care about OCR, industrial inspection, medical imaging, visual search, or multimodal products, study this chapter systematically.

Vision task granularity ladder

Ask three questions about the same image:

QuestionTaskOutput
What is this image mainly about?Classificationone or more labels
Where is each object?Detectionboxes, labels, confidence
Which pixels belong to which object or region?Segmentationmasks or pixel classes
What text or visual meaning can be extracted?OCR / visual understandingtext, tables, descriptions, answers

Do the project after you understand the output type. The same image can become several different tasks.

StepReadDoEvidence to keep
10.1Image basics and OpenCVInspect pixels, channels, resizing, grayscale, edgesinput image, processed output
10.2ClassificationRun or train a small classifierlabels, accuracy/F1, failed images
10.3DetectionUnderstand boxes, confidence, IoU, mAP, YOLOprediction boxes and threshold notes
10.4SegmentationUnderstand masks and pixel-level labelsmask visualization and IoU/Dice notes
10.5Advanced topicsChoose OCR, video, face, 3D, or medical direction only if neededdirection notes and scenario boundary
10.6Stage projectRun 10.6.4 Hands-on: Build a Reproducible Vision Mini Pipelinegenerated images, masks, boxes, metrics, failure report

First Runnable Loop: Inspect Pixels Without Dependencies

Section titled “First Runnable Loop: Inspect Pixels Without Dependencies”

This zero-dependency lab creates a tiny color image, converts it to grayscale, and saves files that most image viewers can open. It teaches the core idea: an image is structured numeric data.

Create ch10_pixel_lab.py and run it with Python 3.10 or later.

from pathlib import Path
width, height = 8, 8
pixels = [
[(x * 32, y * 32, 128) for x in range(width)]
for y in range(height)
]
gray = [
[round(0.299 * r + 0.587 * g + 0.114 * b) for r, g, b in row]
for row in pixels
]
ppm_body = "\n".join(" ".join(f"{r} {g} {b}" for r, g, b in row) for row in pixels)
pgm_body = "\n".join(" ".join(str(value) for value in row) for row in gray)
Path("synthetic_rgb.ppm").write_text(f"P3\n{width} {height}\n255\n{ppm_body}\n")
Path("synthetic_gray.pgm").write_text(f"P2\n{width} {height}\n255\n{pgm_body}\n")
print("size:", (width, height))
print("channels:", 3)
print("top_left_rgb:", pixels[0][0])
print("center_gray:", gray[height // 2][width // 2])
print("saved:", "synthetic_rgb.ppm", "synthetic_gray.pgm")

Expected output:

Size
(8, 8)
Channels
3
Top Left Rgb
(0, 0, 128)
Center Gray
128
Saved
synthetic_rgb.ppm synthetic_gray.pgm

Operation tip: change width, height, or the RGB formula. If the saved image changes, you are already doing image preprocessing. Later sections replace this tiny lab with OpenCV, Pillow, PyTorch, and detection or segmentation models.

  • size and channels tell you the shape of the image data before any model sees it.
  • top_left_rgb is a real pixel value, not a description of the picture.
  • center_gray proves that preprocessing changed RGB data into a single grayscale number.
  • The saved files are evidence artifacts. If you cannot show the before/after files, the preprocessing step is hard to debug.
LevelWhat you can prove
Minimum passYou can run the pixel lab and explain image size, channels, RGB values, grayscale conversion, and saved output.
Project-readyYou can choose the right task output, keep original, processed, and prediction images, report the right metric, and save failure samples.
Deeper checkYou can trace a wrong result to data, annotation, preprocessing, model, threshold, metric, or deployment constraint before changing architecture.

Vision pipeline and failure review loop

When a vision model is wrong, inspect the input and labels before blaming the architecture.

SymptomPrint or visualize firstLikely fix
Classification is unstablemisclassified images and class countsclean data, rebalance classes, adjust augmentation
Small objects are missedimage resolution, boxes, confidence thresholdimprove labels, increase resolution, tune threshold
Segmentation boundary is roughmask overlaid on the original imageimprove annotation, use suitable IoU/Dice metrics
Demo images work but real images faillighting, angle, background, camera sourceadd real samples and scenario notes

Keep this page’s proof of learning as a small evidence card:

Task Output
classification label, detection box, segmentation mask, OCR text, or video event
Artifacts
original image, processed image, prediction overlay, metrics file, and failure samples
Metric
accuracy/F1, mAP, IoU, Dice, latency, or scenario-specific review score
Failure Check
data quality, label error, preprocessing mismatch, threshold, or deployment constraint
Expected Output
a reproducible run folder with visual outputs and a short failure report
  • Chasing model names before checking data quality.
  • Reporting accuracy without saving failed images.
  • Mixing classification, detection, and segmentation outputs.
  • Using augmentation that changes the meaning of labels.
  • Ignoring deployment constraints such as image size, latency, and device memory.

Before leaving this elective, you should be able to:

  • explain classification, detection, segmentation, OCR, and visual understanding by their outputs;
  • run the pixel lab and explain image size, channel, RGB value, and grayscale value;
  • keep input images, processed images, predictions, metrics, and failure samples;
  • choose suitable metrics such as accuracy/F1, mAP, IoU, or Dice;
  • run the reproducible vision mini pipeline and write a short failure analysis.

For a printable checklist, use 10.0 Learning Checklist. For the guided project, start with 10.6.4 Hands-on: Build a Reproducible Vision Mini Pipeline.

Check reasoning and explanation
  1. A passing answer maps the task to the right visual output: class label, bounding box, mask, OCR text, embedding, or video event.
  2. The evidence should include a rendered visual artifact and one metric or qualitative error note.
  3. A good self-check names one visual failure mode such as class confusion, missed objects, bad masks, lighting shift, domain shift, or weak annotation quality.