Skip to content

8 LLM Application Development and RAG

Main visual for LLM applications and RAG

Chapter 7 explained how an LLM produces text. Chapter 8 turns that model into a useful application: connect documents, retrieve evidence, answer with citations, log failures, and improve with an evaluation set.

Think of RAG as “read before answering.” The model should not guess from memory when the answer must come from your course notes, company documents, product manuals, or private knowledge base.

You have already learned how to control an LLM response with prompts, structured output, and evaluation habits. This chapter adds external knowledge: documents must be parsed, chunked, retrieved, cited, and tested before the answer is trusted.

This is the bridge from “the model can answer” to “the application can answer from the right evidence.” Chapter 9 will reuse this evidence habit, but the system will also choose tools, act, observe, and leave an execution trace.

RAG application loop

Use this loop as the chapter map.

LayerJobWhat you should print or save
KnowledgeParse documents, clean text, split chunks, keep metadatachunks.jsonl, source, section, page, version
RetrievalFind the chunks most relevant to a questionquery, top-k chunks, scores, source IDs
GenerationAsk the LLM to answer only from retrieved contextfinal prompt, answer, citations, no-answer reason
ApplicationWrap the flow as CLI, API, chat UI, or internal toolrequest, response, error handling, user feedback
OperationsCompare quality, cost, latency, and failures over timeeval set, logs, token cost, latency, failure cases

Do the workshop after the basics. First make the retrieval chain visible; then wrap it as an application. Follow the core path first: 8.1 -> 8.3 -> 8.4 -> 8.5. Use 8.2 when you need local serving, unified APIs, or deployment choices.

StepReadDoEvidence to keep
8.1RAG basics, document processing, retrieval, evaluationBuild a tiny document-to-answer loopchunks, top-k output, cited answer
8.3LLM app developmentWrap the RAG loop with API, tools, dialog, or document parsingrequest/response sample and error path
8.4Engineering practicesAdd operations notes and context strategylogs, config, deployment checklist, strategy card
8.5Stage projectRun 8.5.6 Hands-on: Full Chapter 8 RAG App Workshopworkshop output, one added doc, one added eval case
8.2Deployment and unified APIsUnderstand cloud API, local model, and unified calling layerone calling note or config comparison
  • Required core: document parsing, chunk metadata, top-k retrieval, citations, no-answer handling, a fixed evaluation set, and request/response logs. These are the minimum skills for trustworthy knowledge-grounded LLM apps.
  • Optional extension: local model serving, unified APIs, LangChain/LlamaIndex, advanced RAG, and Docker deployment. Return here when the project needs scale, framework integration, or operations depth.
  • Depth challenge: keep the same evaluation questions, change one retrieval or chunking variable, and compare cited answers. This prevents “it feels better” RAG tuning.
PaceWhat to finishPortfolio deliverable
Fast passTiny RAG, top-k printout, one cited answer, one no-answer caserag_trace.md with query, chunks, answer, and failure note
Standard passCore path 8.1 -> 8.3 -> 8.4 -> 8.5README section with API request/response, eval case, and added document
Deep passAdd one extension such as local serving, reranking, framework integration, or Dockerbefore/after eval table, latency/cost note, and deployment checklist

A strong Chapter 8 output is not a chatbot screenshot. It is a rerunnable evidence path: document -> chunk -> retrieval -> answer -> citation -> evaluation.

First Runnable Loop: Tiny RAG Without a Framework

Section titled “First Runnable Loop: Tiny RAG Without a Framework”

Before LangChain, LlamaIndex, or a vector database, run the smallest possible chain. The goal is not a powerful retriever; the goal is to see every step.

Create ch08_tiny_rag.py and run it with Python 3.10 or later.

import re
docs = [
{
"id": "ragops",
"source": "study-guide.md#ragops",
"text": "A RAG app needs an evaluation set with fixed questions, expected sources, ideal answers, and failure labels.",
},
{
"id": "chunking",
"source": "rag-basics.md#chunking",
"text": "A RAG app splits documents into chunks and keeps source metadata so answers can cite evidence.",
},
{
"id": "agentops",
"source": "agent-guide.md#trace",
"text": "Agent systems record tool calls, observations, permissions, and recovery steps.",
},
]
question = "Why does a RAG app need an evaluation set?"
STOPWORDS = {"a", "an", "the", "why", "does", "with", "and", "so", "can", "be"}
def tokenize(text: str) -> set[str]:
return set(re.findall(r"[\w\u4e00-\u9fff\u3040-\u30ff]+", text.lower())) - STOPWORDS
query_tokens = tokenize(question)
ranked = sorted(
(
(len(query_tokens & tokenize(doc["text"])), doc)
for doc in docs
),
key=lambda item: item[0],
reverse=True,
)
print("question:", question)
print("top chunks:")
for score, doc in ranked[:2]:
print(f"- {doc['id']} score={score} source={doc['source']}")
best = ranked[0][1]
answer = (
"Use a fixed evaluation set so every RAG change can be compared "
f"against the same questions and expected sources. [{best['source']}]"
)
print("answer:", answer)

Expected output:

Terminal window
question: Why does a RAG app need an evaluation set?
top chunks:
- ragops score=4 source=study-guide.md#ragops
- chunking score=2 source=rag-basics.md#chunking
answer: Use a fixed evaluation set so every RAG change can be compared against the same questions and expected sources. [study-guide.md#ragops]

Operation tip: add one new document, ask one new question, and print the top-k chunks before reading the final answer. If the evidence is wrong, the answer cannot be trusted.

LevelWhat you can prove
Minimum passYou can print chunks, top-k scores, answer, and citation for one question.
Project-readyYou can add metadata, handle empty retrieval with a no-answer response, and compare changes on a fixed eval set.
Deeper checkYou can separate document, chunking, retrieval, reranking, generation, citation, latency, and cost failures.

RAG debugging ladder

When the answer is bad, locate the failing layer before changing the model.

SymptomPrint firstLikely fix
The answer has no sourcefinal prompt and retrieved chunkskeep source IDs in chunks and require citations
The source document has the answer but retrieval misses itoriginal text search and chunk textadjust chunk size, add keywords, use hybrid search
Many chunks are recalled but the best one is not firsttop-k scores and manual relevance labelsadd reranking or rule-based filtering
The answer uses old informationdocument version and index build timerebuild index and add regression tests
You cannot tell whether quality improvedbefore/after answers on the same questionscreate a fixed evaluation set
  • Treating “connected a vector database” as “RAG is done.” RAG quality also depends on document quality, chunking, ranking, Prompt, citations, and evaluation.
  • Adding frameworks before understanding the chain. Frameworks are easier after you can print query, chunks, prompt, answer, and source.
  • Letting the model answer when retrieval is empty. A useful RAG app must say “I do not know from the provided sources.”
  • Forgetting metadata. Without source, page, section, and version, citations and debugging become weak.
  • Optimizing by feeling. Use the same evaluation questions every time you change chunking, retrieval, reranking, or Prompt.

Keep this page’s proof of learning as a small evidence card:

Core Route
8.1 8.3 8.4 8.5 first
Rag Loop
ingest chunk embed retrieve generate cite evaluate
App Loop
API call, state, tool/function, document parsing, output validation
Ops Loop
async, API contract, logging, monitoring, deployment
Bridge
Chapter 9 turns reliable app actions into traceable Agent workflows

Before entering Chapter 9, you should be able to:

  • explain why RAG solves private, fresh, and citable knowledge problems;
  • run the tiny RAG script and inspect top-k chunks before the answer;
  • create chunks with source metadata and cite those sources in the answer;
  • separate document, chunking, retrieval, generation, citation, and deployment failures;
  • run the full Chapter 8 workshop, add one document, add one evaluation case, and record the result in a README.

For a printable checklist, use 8.0 Learning Checklist. For the guided project, start with 8.5.6 Hands-on: Full Chapter 8 RAG App Workshop.

Check reasoning and explanation
  1. A passing answer traces the full path from query to chunks, retrieval scores, cited evidence, answer, and fallback behavior.
  2. The evidence should include retrieved passages, source metadata, a cited answer, and at least one empty-retrieval or wrong-retrieval case.
  3. A good self-check explains whether a failure came from chunking, retrieval, ranking, prompt assembly, missing sources, or unsupported generation.