Skip to content

8.5.3 Project: Integrated RAG + Fine-tuning System

  • Understand why “only doing RAG” or “only doing fine-tuning” is sometimes not enough
  • Learn how to split a domain Q&A system into a RAG layer and a fine-tuning layer
  • Design an explainable RAG + fine-tuning project plan
  • Run a minimal combined project scaffold

Before mixing RAG and fine-tuning, separate the training terms clearly:

TermBeginner meaningWhat it should solve
fine-tuningContinue training a base model on task examplesMake behavior, format, and domain style more stable
SFTSupervised Fine-Tuning, training with input-output examples written or curated by humansTeach the model what a good answer should look like
LoRALow-Rank Adaptation, a lightweight fine-tuning method that trains small adapter weightsReduce training cost while adapting model behavior
QLoRAQuantized LoRA, LoRA combined with lower-precision model loadingMake fine-tuning possible on smaller hardware
domain adaptationMaking the system fit a specific field or business contextUsually needs both domain knowledge and domain behavior
eval setA fixed set of test questions and expected checksPrevents you from judging improvement by one good-looking example

The practical rule is: do not use fine-tuning to memorize frequently changing documents. Use RAG for changing knowledge, and use fine-tuning or SFT examples for stable behavior.


The strengths and limitations of RAG alone

Section titled “The strengths and limitations of RAG alone”

The advantages of RAG:

  • Knowledge can be updated
  • Sources can be cited
  • No need to retrain the model

But it also has limitations:

  • The model may not understand your domain language
  • Even if it retrieves the right content, it may not answer in the required business format
  • For complex tasks, the model’s “answering habits” may not be stable enough

The strengths and limitations of fine-tuning alone

Section titled “The strengths and limitations of fine-tuning alone”

The advantages of fine-tuning:

  • It can make the model better understand specific task formats
  • Output style becomes more stable
  • Instruction following fits business needs better

But it also has limitations:

  • New knowledge is not updated as flexibly
  • It is hard to make the model memorize all detailed documents through fine-tuning alone
  • The cost is higher

You can remember this in one sentence:

RAG adds knowledge, fine-tuning adds behavior.

That is the core logic of a combined system.

RAG and fine-tuning responsibility split diagram


We define the goal as a domain Q&A assistant, for example:

  • For internal company policy documents
  • Answers must reliably cite sources
  • Output format must be standardized
  • Some questions need to be answered with fixed business wording

In other words, this system needs to:

  • Find the knowledge
  • And answer like a domain system should

flowchart LR
A["User question"] --> B["Retriever"]
B --> C["Relevant document chunks"]
C --> D["Fine-tuned answer model"]
D --> E["Standardized output"]
style A fill:#e3f2fd,stroke:#1565c0,color:#333
style B fill:#fff3e0,stroke:#e65100,color:#333
style C fill:#f3e5f5,stroke:#6a1b9a,color:#333
style D fill:#e8f5e9,stroke:#2e7d32,color:#333
style E fill:#ffebee,stroke:#c62828,color:#333

It is not that “there are many components,” but that the responsibilities are clear:

  • The retriever is responsible for finding information
  • The fine-tuned model is responsible for organizing the answer in a business-friendly way

This makes the system more explainable and easier to iterate on.


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
kb = [
{"id": "doc1", "text": "Refund policy: Refunds are available within 7 days of purchase if learning progress is below 20%."},
{"id": "doc2", "text": "Certificate policy: A certificate is issued after completing the project and passing the test."},
{"id": "doc3", "text": "Customer support rule: When answering, first explain the policy basis, then give the conclusion."}
]
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\\b\\w+\\b")
doc_vectors = vectorizer.fit_transform([item["text"] for item in kb])
def retrieve(query, top_k=2):
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, doc_vectors)[0]
top_idx = scores.argsort()[::-1][:top_k]
return [kb[i] for i in top_idx]
print(retrieve("What are the refund conditions"))

Expected output:

Terminal window
[{'id': 'doc1', 'text': 'Refund policy: Refunds are available within 7 days of purchase if learning progress is below 20%.'}, {'id': 'doc3', 'text': 'Customer support rule: When answering, first explain the policy basis, then give the conclusion.'}]

This retriever is not complicated, but it is already the first half of the combined system.


In a real project, this step might come from:

  • Instruction tuning
  • LoRA / QLoRA
  • Supervised dataset training

To make the code runnable directly, here we first simulate a “trained business output style” with rules.

def domain_answer_style(question, retrieved_docs):
evidence = " ".join(doc["text"] for doc in retrieved_docs)
if "refund" in question:
return {
"answer": "According to the current refund policy, users may request a refund within 7 days of purchase if their learning progress is below 20%.",
"reasoning_style": "policy first, conclusion second",
"evidence": evidence
}
if "certificate" in question:
return {
"answer": "According to the certificate policy, a certificate can be obtained after completing the project and passing the test.",
"reasoning_style": "policy first, conclusion second",
"evidence": evidence
}
return {
"answer": "No sufficiently matching business rule was found at the moment.",
"reasoning_style": "cautious refusal",
"evidence": evidence
}

Because it helps you understand:

  • RAG solves “what does the system know?”
  • Fine-tuning solves “how should it answer?”

def rag_plus_finetune_system(question):
docs = retrieve(question, top_k=2)
result = domain_answer_style(question, docs)
return {
"question": question,
"retrieved_docs": docs,
**result
}
result = rag_plus_finetune_system("What are the refund conditions?")
print(result["question"])
print(result["answer"])
print("evidence:", result["evidence"])

Expected output:

Terminal window
What are the refund conditions?
According to the current refund policy, users may request a refund within 7 days of purchase if their learning progress is below 20%.
evidence: Refund policy: Refunds are available within 7 days of purchase if learning progress is below 20%. Customer support rule: When answering, first explain the policy basis, then give the conclusion.

It already shows:

A combined system is not about forcing two technologies together, but about letting each do the part it is best at.


What does fine-tuning usually optimize in a real project?

Section titled “What does fine-tuning usually optimize in a real project?”

It is not for “memorizing all documents”

Section titled “It is not for “memorizing all documents””

Many beginners mistakenly think:

After fine-tuning, the model should memorize the whole knowledge base

But a more common and realistic goal is:

  • Learn the style of domain terminology
  • Learn the output format
  • Learn business answer templates
  • Learn the fixed structure of certain tasks

You may want the model to learn:

  • “Cite the policy first, then give the conclusion”
  • “When uncertain, explicitly refuse to answer”
  • “All answers must output standard fields”

These kinds of capabilities are well suited to fine-tuning, or at least to strong supervised training.


  • Document chunking
  • Retrieval
  • Source citations
  • Knowledge updates
  • Response style
  • Output format
  • Task templates
  • Understanding business terminology

Once this responsibility split is clear, the project becomes much easier to maintain.


You cannot only look at whether the answer sounds smooth

Section titled “You cannot only look at whether the answer sounds smooth”

You should check at least two layers:

  • Retrieval layer: did it find the right document?
  • Answer layer: does the output meet business requirements?
eval_data = [
{"question": "What are the refund conditions", "gold_doc": "doc1", "must_contain": "7 days"},
{"question": "How to get a certificate", "gold_doc": "doc2", "must_contain": "passing the test"}
]
for item in eval_data:
result = rag_plus_finetune_system(item["question"])
hit = result["retrieved_docs"][0]["id"] == item["gold_doc"]
good_answer = item["must_contain"] in result["answer"]
print(item["question"], "retrieval_hit=", hit, "answer_ok=", good_answer)

Expected output:

Terminal window
What are the refund conditions retrieval_hit= True answer_ok= True
How to get a certificate retrieval_hit= True answer_ok= True

This is already much better than just saying “the demo looks good.”

When the combined system fails, first decide which layer owns the problem. This small table is the beginning of a real project postmortem.

diagnostics = [
{"symptom": "Correct document is not in top-2", "likely_layer": "RAG", "next_step": "Improve chunking, query rewrite, or retrieval"},
{"symptom": "Correct document is retrieved but answer format is unstable", "likely_layer": "fine-tuning / prompt", "next_step": "Add supervised examples or stricter schema"},
{"symptom": "Answer cites one source but uses facts from another", "likely_layer": "grounding", "next_step": "Add citation checks and sentence-level evidence"},
]
for row in diagnostics:
print(f"{row['likely_layer']}: {row['symptom']} -> {row['next_step']}")

Expected output:

Terminal window
RAG: Correct document is not in top-2 -> Improve chunking, query rewrite, or retrieval
fine-tuning / prompt: Correct document is retrieved but answer format is unstable -> Add supervised examples or stricter schema
grounding: Answer cites one source but uses facts from another -> Add citation checks and sentence-level evidence

RAG plus fine-tuning result map


Keep this page’s proof of learning as a small evidence card:

Project Goal
user task and business boundary
Baseline
simplest prompt/RAG/app version first
Evaluation
fixed cases, retrieval evidence, answer quality, and citation check
Failure Log
at least one failed case with likely cause
Deliverable
README, run command, screenshots/logs, next step

Using fine-tuning to solve knowledge update problems

Section titled “Using fine-tuning to solve knowledge update problems”

This is usually inefficient.

Using RAG to force stable output style problems

Section titled “Using RAG to force stable output style problems”

This is not always appropriate either.

Confusing the responsibilities of the two layers

Section titled “Confusing the responsibilities of the two layers”

If you cannot clearly explain “which layer is responsible for what,” the system will be hard to debug later.


The most important point in this section is not simply putting the two words RAG and fine-tuning together, but understanding:

The value of an integrated RAG + fine-tuning system is that knowledge acquisition and answer behavior are handled by the most suitable mechanisms respectively.

That is the real engineering thinking behind combined LLM systems.


If you want to include this project in your portfolio, do not just show “ask a question, get an answer.” A better approach is to deliver the RAG layer, answer layer, evaluation layer, and postmortem materials together.

DeliverableMinimum requirementPortfolio-level requirement
Knowledge base sampleAt least 3–5 document snippetsShow raw materials, chunking results, metadata fields, and sources
Retrieval logsCan print matched documentsSave query, top-k, score, source, and context length
Answer outputCan provide an answerAnswer includes conclusion, evidence, source, and a fallback for “not enough information”
Evaluation set2–5 test questions20–50 questions covering paraphrases, boundary cases, and confusing cases
Failure samplesSimple error notesSeparate retrieval failures, generation failures, citation failures, and format failures
READMECan explain how to run itIncludes architecture diagram, run commands, sample inputs/outputs, metrics, and next steps

The key point of this table is to upgrade the project from a “technical demo” to an “explainable project.” People looking at your project will not only check whether it answers correctly, but also whether you know why it answered correctly, why it answered incorrectly, and how to improve it.

You can organize the final project like this:

AreaFiles to include
RootREADME.md
data/raw_docs/, chunks.jsonl, eval_questions.csv
src/ingest.py, retrieve.py, answer.py, evaluate.py
logs/retrieval_logs.jsonl, failure_cases.md
reports/baseline_result.md, improvement_record.md

When you build it for the first time, you do not need to fill every file immediately. But at minimum, you should let others see three lines clearly: how the materials enter the system, how questions match documents, and how answers are evaluated.

A portfolio project README should not just say “this project uses RAG and fine-tuning.” It is more valuable to show the full loop.

README moduleQuestion it should answer
Project goalWhat domain problem does this system solve, and why are RAG or fine-tuning needed?
System architectureHow does the user question flow through retrieval, context, answer, and citation?
How to runHow to install dependencies, prepare data, run Q&A, and run evaluation
Sample outputInput question, matched documents, final answer, source citations
Evaluation resultsBaseline performance, improved performance, failure samples
Technical trade-offsWhy use RAG, why consider fine-tuning, and where is the boundary between them
Next stepsWhat to improve next: retrieval, answer style, cost, or deployment

A small but effective sample output can be written like this:

Terminal window
Question: What are the refund conditions?
Matched document: doc1 refund policy score=0.92
Answer: According to the refund policy, users may request a refund within 7 days of purchase if their learning progress is below 20%.
Source: doc1
Evaluation: retrieval_hit=true, answer_ok=true, citation_ok=true

In a RAG + fine-tuning project, the part that best shows engineering ability is often not the success cases, but the failure cases. It is recommended to record at least 3 types of failures:

  • Retrieval failure: the correct policy does not appear in top-k. Check chunking, keyword match, and embeddings; then adjust chunking, add hybrid retrieval, or rewrite the query.
  • Answer failure: the right material is retrieved, but the answer misses key conditions. Check prompt constraints and answer template stability; then strengthen the output format and add must_contain checks.
  • Citation failure: the answer conclusion does not match the cited passage. Check citation stitching and model improvisation; then require citation checks and sentence-level grounding.
  • Style failure: the facts are correct, but the answer does not fit the business style. Check fine-tuning data and examples; then add more format examples or supervised data.

Writing down failure samples clearly is more persuasive than only showing one successful screenshot.

VersionGoalDelivery focus
Basic versionRun the minimal closed loopCan input, process, and output, and keep one set of examples
Standard versionForm a presentable projectAdd configuration, logs, error handling, README, and screenshots
Challenge versionClose to portfolio qualityAdd evaluation, comparison experiments, failure analysis, and next-step roadmap

It is recommended to complete the basic version first. Do not try to make it too large at the beginning. With each version upgrade, write down in the README “what capability was added, how it was verified, and what problems remain.”

  1. Add two more documents to the knowledge base and observe whether the retrieval results change.
  2. Design your own “domain answer style rules” to simulate the behavior of the fine-tuning layer.
  3. Think about this: if the system always retrieves the right documents but the answer format is always messy, should you prioritize improving RAG or fine-tuning?
  4. Explain in your own words: why do we say “RAG adds knowledge, fine-tuning adds behavior”?
Project reference and review notes
  1. New documents should test whether retrieval changes only when the query intent matches, not randomly.
  2. Style rules might specify tone, section order, citation format, refusal boundaries, and domain terminology.
  3. If the documents are correct but the format is messy, prioritize the behavior layer: prompt/schema first, then fine-tuning if the pattern is stable enough.
  4. RAG supplies changing or domain-specific knowledge at query time; fine-tuning teaches stable response behavior, style, and format.