Skip to content

12 AIGC and Multimodal

Main visual for AIGC and Multimodal

Chapter 12 is the multimodal expansion: AI is no longer only text. Images, PDFs, audio, video, screenshots, charts, and generated assets can all enter the same product workflow.

Do not chase every new demo. First learn how to turn non-text inputs into structured records, connect them to RAG or Agents, generate or edit assets, review risks, and export something usable.

Multimodal workflow loop

Use this workflow as the chapter map.

LayerWhat happensEvidence to keep
Inputtext, screenshot, image, PDF, audio, videosource file, owner, license, version
Parse / alignOCR, layout parsing, visual understanding, transcriptstructured record, page/region/time reference
Understand / generateanswer, caption, image, voice, storyboard, video planprompt, model, output, candidate versions
Edit / reviewhuman selection, factual check, copyright and portrait checksreview checklist, rejected versions, reason
Export / integrateRAG index, Agent trace, creative package, demoREADME, export file, limitations, next step

Make one small workflow traceable before trying video or full creative platforms.

  1. 12.1 Multimodal basics Turn one screenshot or image into a structured record. Keep the source, visible text, objects, and uncertainty notes.

  2. 12.2 Image generation Record prompts, references, negative requirements, and selected output. Keep prompt versions and review notes.

  3. 12.3 Video, speech, digital humans Understand storyboard, voice, shot, subtitle, and timing. Keep the storyboard and asset list.

  4. 12.4 Ethics and compliance Check copyright, portrait rights, sensitive content, and factual risk. Keep the safety review checklist.

  5. 12.5 Stage project Run 12.5.3 Hands-on: Build a Reproducible Multimodal Creative Package. Keep the brief, prompts, assets, storyboard, review, and export preview.

First Runnable Loop: Structure A Visual Input

Section titled “First Runnable Loop: Structure A Visual Input”

This offline script simulates the first engineering step of a multimodal system: after a model or human reads an image, the result must become a structured, checkable record.

Create ch12_visual_record.py and run it with Python 3.10 or later.

visual_record = {
"source": "course-slide-01.png",
"content_type": "course screenshot",
"visible_text": ["RAGOps", "evaluation set", "Trace", "cost monitoring"],
"objects": ["flowchart", "table"],
"uncertainty": ["small text in the lower-right corner is unclear"],
"next_step": "write into the multimodal RAG index for the course Q&A assistant to cite",
}
required_fields = {"source", "content_type", "visible_text", "objects", "uncertainty", "next_step"}
missing = required_fields - visual_record.keys()
rag_ready = not missing and bool(visual_record["visible_text"])
print("source:", visual_record["source"])
print("visible_text_count:", len(visual_record["visible_text"]))
print("uncertainty_count:", len(visual_record["uncertainty"]))
print("rag_ready:", rag_ready)

Expected output:

Terminal window
source: course-slide-01.png
visible_text_count: 4
uncertainty_count: 1
rag_ready: True

Visual record RAG-ready result map

Operation tip: add page, region, or timestamp fields. If the record can be cited later, it can enter multimodal RAG. If it cannot be checked or cited, it should stay in review.

  • source proves where the visual record came from.
  • visible_text_count shows how much text was extracted or observed.
  • uncertainty_count is not a weakness; it is the part that should stay reviewable.
  • rag_ready=True means the record has enough structure to be cited later, not that the visual understanding is automatically correct.
LevelWhat you can prove
Minimum passYou can turn one screenshot, image, PDF, audio, or video note into a structured record with source and uncertainty.
Project-readyYou can preserve source references, prompt versions, candidate outputs, review decisions, and export files.
Deeper checkYou can connect multimodal records to RAG or Agent while enforcing copyright, portrait, sensitive-content, factual, latency, and cost boundaries.

Connect Multimodal To RAG, Agent, And Creative Work

Section titled “Connect Multimodal To RAG, Agent, And Creative Work”

Multimodal RAG Agent capstone map

Multimodal is not separate from the main track.

Main-track skillMultimodal extension
RAGretrieve PDF pages, screenshots, charts, image captions, and text chunks with citations
Agentobserve screenshots or documents, choose tools, and leave traceable actions
Promptcreate image, voice, storyboard, and review prompts with version records
Engineeringtrack assets, licenses, reviews, export files, latency, and cost
Capstonebuild a multimodal learning assistant or creative workspace

Keep this page’s proof of learning as a small evidence card:

Brief
user goal, audience, assets, constraints, and export format
Artifacts
source files, prompts, generated candidates, selected output, and rejected versions
Review
factual check, copyright/portrait/sensitive-content check, and human decision
Integration
RAG record, Agent trace, creative package, storyboard, or export preview
Expected Output
reproducible asset package with README, review checklist, and failure notes
  • Treating AIGC as “one pretty output” instead of a workflow.
  • Losing source references after OCR, PDF parsing, or screenshot understanding.
  • Comparing generated results without prompt and version records.
  • Skipping human review for copyright, portrait rights, sensitive content, or factual risk.
  • Starting with video generation before the storyboard, assets, and review rules are clear.

Before finishing the course, you should be able to:

  • explain how text, images, PDFs, audio, and video enter one workflow;
  • run the visual record script and add source references such as page, region, or timestamp;
  • preserve prompts, assets, selected outputs, rejected outputs, and review reasons;
  • connect a multimodal record to RAG, Agent, or a creative package;
  • run the multimodal workshop and keep a README, review checklist, export preview, and failure cases.

For a printable checklist, use 12.0 Learning Checklist. For the guided final project, start with 12.5.3 Hands-on: Build a Reproducible Multimodal Creative Package.

Check reasoning and explanation
  1. A passing answer names the modalities involved, the input-output contract, and how text, image, audio, or video evidence is aligned.
  2. The evidence should include a real media artifact or trace, plus a note on quality, safety, and failure cases.
  3. A good self-check explains whether the task needs generation, understanding, retrieval, tool orchestration, or human review rather than treating every multimodal problem as the same kind of demo.