8.5.2 Project: Enterprise Knowledge Base Q&A
Learning Objectives
Section titled “Learning Objectives”- Learn how to organize enterprise documents into searchable knowledge units
- Learn how to design the two enterprise-grade constraints of permissions and citations
- Learn how to build a showcaseable project loop with a minimal retriever
- Learn how to present the project around error analysis and traceability
Beginner terminology bridge
Section titled “Beginner terminology bridge”Enterprise knowledge-base projects use several words that sound simple but have strict engineering meaning:
| Term | Beginner meaning | Why it matters in this project |
|---|---|---|
permission filtering | Remove documents the current user is not allowed to see before retrieval or answering | Prevents the system from leaking internal content |
citation | A source reference attached to the answer | Lets users verify where the answer came from |
metadata | Extra fields attached to a chunk, such as source file, department, visibility, page, or version | Makes filtering, debugging, and citation possible |
SOP | Standard Operating Procedure, a fixed internal workflow document | Many enterprise answers are not just facts, but process rules |
traceability | The ability to follow an answer back to the original document and processing path | This is what makes the project trustworthy instead of just fluent |
The key idea is: an enterprise knowledge base is not only a search problem. It is also a permission, evidence, and audit problem.
Why is enterprise knowledge base Q&A harder than ordinary FAQ?
Section titled “Why is enterprise knowledge base Q&A harder than ordinary FAQ?”Documents are longer
Section titled “Documents are longer”Enterprise knowledge is often not just a few Q&As, but rather:
- Policy documents
- Internal SOPs
- Training manuals
- Product documentation
Permissions are more complex
Section titled “Permissions are more complex”For the same question, there may be:
- An external version
- An internal version
Trust requirements are higher
Section titled “Trust requirements are higher”Users will often ask:
- Where did this rule come from?
- Which file are you citing?
So enterprise knowledge base Q&A is more like a combination of:
- A retrieval system
- A citation system
- A permission system

Define the project scope first
Section titled “Define the project scope first”A very suitable minimum scope for a portfolio project is:
Build a “refund / invoice / certificate / internal customer support SOP” knowledge base Q&A system for an internal help center on a course platform.
It should at least answer four types of questions:
- External policy questions
- Internal process questions
- Questions whose answers differ by permission
- Questions that require source citations
Why is this scope good?
Section titled “Why is this scope good?”- The document topics are focused
- The permission boundary is realistic
- It is easy to explain whether the result is good or bad
Design the knowledge units first, not the model first
Section titled “Design the knowledge units first, not the model first”The following example does three things:
- Splits documents into the smallest knowledge units
- Adds metadata to each chunk
- Distinguishes between public and internal visibility
kb = [ { "id": "doc_001", "section": "Refund Policy", "department": "support", "visibility": "public", "text": "A refund can be requested within 7 days of purchase if learning progress is below 20%.", "keywords": {"refund", "7 days", "progress", "20%"}, }, { "id": "doc_002", "section": "Certificate Guide", "department": "teaching", "visibility": "public", "text": "A completion certificate can be issued after finishing all required projects and passing the course final test.", "keywords": {"certificate", "final test", "project"}, }, { "id": "doc_003", "section": "Internal Customer Support SOP", "department": "internal", "visibility": "internal", "text": "When handling a refund request, customer support must first verify the order number, learning progress, and payment channel.", "keywords": {"refund", "customer support", "SOP", "verify", "verification", "process"}, },]
for item in kb: print(f"{item['id']} | {item['visibility']} | {item['section']}")Expected output:
doc_001 | public | Refund Policydoc_002 | public | Certificate Guidedoc_003 | internal | Internal Customer Support SOPWhy add so much metadata here?
Section titled “Why add so much metadata here?”Because enterprise knowledge base retrieval is not only about “does the content seem similar,” but also about deciding:
- Whether the current user is allowed to see it
- Which business domain it belongs to
- How the source should be shown in the answer
This is also the fundamental difference between an enterprise project and a normal Q&A demo.
Build an explainable retriever first
Section titled “Build an explainable retriever first”To make the example runnable in the current environment, we will not use an external embedding library yet, but instead use a pure Python keyword-overlap retriever to get the project skeleton in place first.
def retrieve(query, allowed_visibility, top_k=2): candidates = [] query_text = query.lower()
for item in kb: if item["visibility"] not in allowed_visibility: continue score = sum(keyword.lower() in query_text for keyword in item["keywords"]) candidates.append((score, item))
candidates.sort(key=lambda x: x[0], reverse=True) return [item for score, item in candidates[:top_k] if score > 0]
print("public user:")for hit in retrieve("What is the refund policy?", allowed_visibility={"public"}): print(hit["id"], hit["visibility"], hit["section"])
print("\ninternal support:")for hit in retrieve("What is the customer verification process?", allowed_visibility={"public", "internal"}): print(hit["id"], hit["visibility"], hit["section"])Expected output:
public user:doc_001 public Refund Policy
internal support:doc_003 internal Internal Customer Support SOPAlthough this retriever is simple, why is it very suitable for teaching?
Section titled “Although this retriever is simple, why is it very suitable for teaching?”Because it makes three things very clear:
- How query terms affect recall
- How permissions affect the candidate set
- Why the results are different
Why deliberately not use embeddings directly here?
Section titled “Why deliberately not use embeddings directly here?”Because this lesson first needs to explain clearly:
- Permissions
- Source citations
- Structured knowledge units
These enterprise-grade key points. Once the skeleton is clear, switching to a stronger retrieval method will be much more stable.
Make “answer + sources” together
Section titled “Make “answer + sources” together”def answer_with_sources(query, allowed_visibility): hits = retrieve(query, allowed_visibility=allowed_visibility, top_k=2)
if not hits: return { "answer": "No sufficiently relevant information was found within the current permission scope.", "sources": [], }
top = hits[0] return { "answer": top["text"], "sources": [ { "id": top["id"], "section": top["section"], "department": top["department"], "visibility": top["visibility"], } ], }
print(answer_with_sources("What is the refund policy?", {"public"}))print(answer_with_sources("What is the customer verification process?", {"public", "internal"}))Expected output:
{'answer': 'A refund can be requested within 7 days of purchase if learning progress is below 20%.', 'sources': [{'id': 'doc_001', 'section': 'Refund Policy', 'department': 'support', 'visibility': 'public'}]}{'answer': 'When handling a refund request, customer support must first verify the order number, learning progress, and payment channel.', 'sources': [{'id': 'doc_003', 'section': 'Internal Customer Support SOP', 'department': 'internal', 'visibility': 'internal'}]}Why is “returning sources” a highlight of a portfolio project?
Section titled “Why is “returning sources” a highlight of a portfolio project?”Because it makes the system do more than just “give you an answer,” and also answer:
- Where did this answer come from?
- Why should I trust it?
This significantly increases the credibility of the project.
Why do enterprise scenarios need sources more than ordinary Q&A?
Section titled “Why do enterprise scenarios need sources more than ordinary Q&A?”Because enterprise users often really use the answer to carry out a process. Without sources, trust is hard to build.
How should this project be evaluated?
Section titled “How should this project be evaluated?”It is not enough to only check “whether it answered”
Section titled “It is not enough to only check “whether it answered””An enterprise knowledge base project should be evaluated in at least three layers:
- Whether the retrieval is relevant
- Whether the permissions are correct
- Whether the citations are traceable
A minimal evaluation set
Section titled “A minimal evaluation set”eval_cases = [ { "query": "What is the refund policy?", "visibility": {"public"}, "expected_doc": "doc_001", }, { "query": "What is the customer verification process?", "visibility": {"public"}, "expected_doc": None, }, { "query": "What is the customer verification process?", "visibility": {"public", "internal"}, "expected_doc": "doc_003", },]
for case in eval_cases: result = answer_with_sources(case["query"], case["visibility"]) got = result["sources"][0]["id"] if result["sources"] else None print({ "query": case["query"], "expected_doc": case["expected_doc"], "got": got, "match": got == case["expected_doc"], })Expected output:
{'query': 'What is the refund policy?', 'expected_doc': 'doc_001', 'got': 'doc_001', 'match': True}{'query': 'What is the customer verification process?', 'expected_doc': None, 'got': None, 'match': True}{'query': 'What is the customer verification process?', 'expected_doc': 'doc_003', 'got': 'doc_003', 'match': True}
Why is this kind of evaluation valuable?
Section titled “Why is this kind of evaluation valuable?”Because it directly covers the two most important risks in an enterprise knowledge base:
- It should answer, but fails to answer correctly
- It should not be visible, but internal documents are exposed
How can you take this project one step closer to portfolio quality?
Section titled “How can you take this project one step closer to portfolio quality?”Upgrade rule-based retrieval to vector retrieval
Section titled “Upgrade rule-based retrieval to vector retrieval”Add document chunking and reranking
Section titled “Add document chunking and reranking”Build a user interface for source display
Section titled “Build a user interface for source display”The most recommended items to show are:
- User question
- Matched document
- Final answer
- Source citation
Show a few “permission-related failure examples”
Section titled “Show a few “permission-related failure examples””This will be very convincing.
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Project Goal
- user task and business boundary
- Baseline
- simplest prompt/RAG/app version first
- Evaluation
- fixed cases, retrieval evidence, answer quality, and citation check
- Failure Log
- at least one failed case with likely cause
- Deliverable
- README, run command, screenshots/logs, next step
The most common pitfalls
Section titled “The most common pitfalls”Only building “can answer,” not “can be traced”
Section titled “Only building “can answer,” not “can be traced””Only looking at semantic relevance, not permission boundaries
Section titled “Only looking at semantic relevance, not permission boundaries”Making document chunks too coarse
Section titled “Making document chunks too coarse”When chunks are too coarse, both answers and sources often become vague.
Summary
Section titled “Summary”The most important thing in this lesson is to build a portfolio-grade judgment:
What makes enterprise knowledge base Q&A feel like a real project is not that it connects to a retriever, but that it organizes knowledge units, permission boundaries, answer generation, and source traceability into a trustworthy closed loop.
Once this closed loop is clear, this project will look very much like a real enterprise system.
Suggested version roadmap
Section titled “Suggested version roadmap”| Version | Goal | Delivery focus |
|---|---|---|
| Basic version | Get the minimal loop running | Can accept input, process it, and output results, while keeping a set of examples |
| Standard version | Form a showcaseable project | Add configuration, logs, error handling, README, and screenshots |
| Advanced version | Close to portfolio quality | Add evaluation, comparison experiments, failure-case analysis, and a next-step roadmap |
It is recommended to finish the basic version first, and do not pursue an all-in-one solution from the beginning. Each time you upgrade a version, write into the README what new capability was added, how it was verified, and what problems still remain.
Exercises
Section titled “Exercises”- Add two more “public documents” and one “internal document” to
kbto make query competition more realistic. - Why is “correct permissions” sometimes more important than “a beautiful answer” in an enterprise knowledge base project?
- Think about it: if document chunks are cut too coarsely, how will that affect the answer and the citation?
- If you turn this project into a portfolio piece, which 4 blocks of information would be most worth showing on the homepage?
Project reference and review notes
- The added documents should include similar public/internal topics so you can test both ranking quality and permission filtering.
- Leaking internal information is a security failure even if the answer sounds elegant. Permission correctness is a hard requirement.
- Coarse chunks can mix unrelated facts, produce vague citations, and make permission/citation boundaries unclear.
- Strong homepage blocks are problem scope, architecture/retrieval flow, permission model, evaluation results, and failure analysis. Choose the four that best show your project judgment.