Skip to main content

12 AIGC and Multimodal

Main visual for AIGC and Multimodal

Chapter 12 is the final expansion: AI is no longer only text. Images, PDFs, audio, video, screenshots, charts, and generated assets can all enter the same product workflow.

Do not chase every new demo. First learn how to turn non-text inputs into structured records, connect them to RAG or Agents, generate or edit assets, review risks, and export something usable.

See the Multimodal Workflow

Multimodal workflow loop

Use this workflow as the chapter map.

LayerWhat happensEvidence to keep
Inputtext, screenshot, image, PDF, audio, videosource file, owner, license, version
Parse / alignOCR, layout parsing, visual understanding, transcriptstructured record, page/region/time reference
Understand / generateanswer, caption, image, voice, storyboard, video planprompt, model, output, candidate versions
Edit / reviewhuman selection, factual check, copyright and portrait checksreview checklist, rejected versions, reason
Export / integrateRAG index, Agent trace, creative package, demoREADME, export file, limitations, next step

Learning Order And Task List

Make one small workflow traceable before trying video or full creative platforms.

StepReadDoEvidence to keep
12.1Multimodal basicsturn one screenshot or image into a structured recordsource, visible text, objects, uncertainty
12.2Image generationrecord prompts, references, negative requirements, selected outputprompt versions and review notes
12.3Video, speech, digital humansunderstand storyboard, voice, shot, subtitle, timingstoryboard and asset list
12.4Ethics and compliancecheck copyright, portrait rights, sensitive content, factual risksafety review checklist
12.5Stage projectrun 12.5.3 Hands-on: Build a Reproducible Multimodal Creative Packagebrief, prompts, assets, storyboard, review, export preview

First Runnable Loop: Structure A Visual Input

This offline script simulates the first engineering step of a multimodal system: after a model or human reads an image, the result must become a structured, checkable record.

Create ch12_visual_record.py and run it with Python 3.10 or later.

visual_record = {
"source": "course-slide-01.png",
"content_type": "course screenshot",
"visible_text": ["RAGOps", "evaluation set", "Trace", "cost monitoring"],
"objects": ["flowchart", "table"],
"uncertainty": ["small text in the lower-right corner is unclear"],
"next_step": "write into the multimodal RAG index for the course Q&A assistant to cite",
}

required_fields = {"source", "content_type", "visible_text", "objects", "uncertainty", "next_step"}
missing = required_fields - visual_record.keys()
rag_ready = not missing and bool(visual_record["visible_text"])

print("source:", visual_record["source"])
print("visible_text_count:", len(visual_record["visible_text"]))
print("uncertainty_count:", len(visual_record["uncertainty"]))
print("rag_ready:", rag_ready)

Expected output:

source: course-slide-01.png
visible_text_count: 4
uncertainty_count: 1
rag_ready: True

Visual record RAG-ready result map

Operation tip: add page, region, or timestamp fields. If the record can be cited later, it can enter multimodal RAG. If it cannot be checked or cited, it should stay in review.

Depth Ladder

LevelWhat you can prove
Minimum passYou can turn one screenshot, image, PDF, audio, or video note into a structured record with source and uncertainty.
Project-readyYou can preserve source references, prompt versions, candidate outputs, review decisions, and export files.
Deeper checkYou can connect multimodal records to RAG or Agent while enforcing copyright, portrait, sensitive-content, factual, latency, and cost boundaries.

Connect Multimodal To RAG, Agent, And Creative Work

Multimodal RAG Agent capstone map

Multimodal is not separate from the main track.

Main-track skillMultimodal extension
RAGretrieve PDF pages, screenshots, charts, image captions, and text chunks with citations
Agentobserve screenshots or documents, choose tools, and leave traceable actions
Promptcreate image, voice, storyboard, and review prompts with version records
Engineeringtrack assets, licenses, reviews, export files, latency, and cost
Capstonebuild a multimodal learning assistant or creative workspace

Common Failures

  • Treating AIGC as "one pretty output" instead of a workflow.
  • Losing source references after OCR, PDF parsing, or screenshot understanding.
  • Comparing generated results without prompt and version records.
  • Skipping human review for copyright, portrait rights, sensitive content, or factual risk.
  • Starting with video generation before the storyboard, assets, and review rules are clear.

Pass Check

Before finishing the course, you should be able to:

  • explain how text, images, PDFs, audio, and video enter one workflow;
  • run the visual record script and add source references such as page, region, or timestamp;
  • preserve prompts, assets, selected outputs, rejected outputs, and review reasons;
  • connect a multimodal record to RAG, Agent, or a creative package;
  • run the multimodal workshop and keep a README, review checklist, export preview, and failure cases.

For a printable checklist, use 12.0 Learning Checklist. For the guided final project, start with 12.5.3 Hands-on: Build a Reproducible Multimodal Creative Package.