Skip to content

8.5.2 Project: Enterprise Knowledge Base Q&A

  • Learn how to organize enterprise documents into searchable knowledge units
  • Learn how to design the two enterprise-grade constraints of permissions and citations
  • Learn how to build a showcaseable project loop with a minimal retriever
  • Learn how to present the project around error analysis and traceability

Enterprise knowledge-base projects use several words that sound simple but have strict engineering meaning:

TermBeginner meaningWhy it matters in this project
permission filteringRemove documents the current user is not allowed to see before retrieval or answeringPrevents the system from leaking internal content
citationA source reference attached to the answerLets users verify where the answer came from
metadataExtra fields attached to a chunk, such as source file, department, visibility, page, or versionMakes filtering, debugging, and citation possible
SOPStandard Operating Procedure, a fixed internal workflow documentMany enterprise answers are not just facts, but process rules
traceabilityThe ability to follow an answer back to the original document and processing pathThis is what makes the project trustworthy instead of just fluent

The key idea is: an enterprise knowledge base is not only a search problem. It is also a permission, evidence, and audit problem.


Why is enterprise knowledge base Q&A harder than ordinary FAQ?

Section titled “Why is enterprise knowledge base Q&A harder than ordinary FAQ?”

Enterprise knowledge is often not just a few Q&As, but rather:

  • Policy documents
  • Internal SOPs
  • Training manuals
  • Product documentation

For the same question, there may be:

  • An external version
  • An internal version

Users will often ask:

  • Where did this rule come from?
  • Which file are you citing?

So enterprise knowledge base Q&A is more like a combination of:

  • A retrieval system
  • A citation system
  • A permission system

Enterprise knowledge base permission and citation loop diagram


A very suitable minimum scope for a portfolio project is:

Build a “refund / invoice / certificate / internal customer support SOP” knowledge base Q&A system for an internal help center on a course platform.

It should at least answer four types of questions:

  1. External policy questions
  2. Internal process questions
  3. Questions whose answers differ by permission
  4. Questions that require source citations
  • The document topics are focused
  • The permission boundary is realistic
  • It is easy to explain whether the result is good or bad

Design the knowledge units first, not the model first

Section titled “Design the knowledge units first, not the model first”

The following example does three things:

  1. Splits documents into the smallest knowledge units
  2. Adds metadata to each chunk
  3. Distinguishes between public and internal visibility
kb = [
{
"id": "doc_001",
"section": "Refund Policy",
"department": "support",
"visibility": "public",
"text": "A refund can be requested within 7 days of purchase if learning progress is below 20%.",
"keywords": {"refund", "7 days", "progress", "20%"},
},
{
"id": "doc_002",
"section": "Certificate Guide",
"department": "teaching",
"visibility": "public",
"text": "A completion certificate can be issued after finishing all required projects and passing the course final test.",
"keywords": {"certificate", "final test", "project"},
},
{
"id": "doc_003",
"section": "Internal Customer Support SOP",
"department": "internal",
"visibility": "internal",
"text": "When handling a refund request, customer support must first verify the order number, learning progress, and payment channel.",
"keywords": {"refund", "customer support", "SOP", "verify", "verification", "process"},
},
]
for item in kb:
print(f"{item['id']} | {item['visibility']} | {item['section']}")

Expected output:

Terminal window
doc_001 | public | Refund Policy
doc_002 | public | Certificate Guide
doc_003 | internal | Internal Customer Support SOP

Because enterprise knowledge base retrieval is not only about “does the content seem similar,” but also about deciding:

  • Whether the current user is allowed to see it
  • Which business domain it belongs to
  • How the source should be shown in the answer

This is also the fundamental difference between an enterprise project and a normal Q&A demo.


To make the example runnable in the current environment, we will not use an external embedding library yet, but instead use a pure Python keyword-overlap retriever to get the project skeleton in place first.

def retrieve(query, allowed_visibility, top_k=2):
candidates = []
query_text = query.lower()
for item in kb:
if item["visibility"] not in allowed_visibility:
continue
score = sum(keyword.lower() in query_text for keyword in item["keywords"])
candidates.append((score, item))
candidates.sort(key=lambda x: x[0], reverse=True)
return [item for score, item in candidates[:top_k] if score > 0]
print("public user:")
for hit in retrieve("What is the refund policy?", allowed_visibility={"public"}):
print(hit["id"], hit["visibility"], hit["section"])
print("\ninternal support:")
for hit in retrieve("What is the customer verification process?", allowed_visibility={"public", "internal"}):
print(hit["id"], hit["visibility"], hit["section"])

Expected output:

Terminal window
public user:
doc_001 public Refund Policy
internal support:
doc_003 internal Internal Customer Support SOP

Although this retriever is simple, why is it very suitable for teaching?

Section titled “Although this retriever is simple, why is it very suitable for teaching?”

Because it makes three things very clear:

  1. How query terms affect recall
  2. How permissions affect the candidate set
  3. Why the results are different

Why deliberately not use embeddings directly here?

Section titled “Why deliberately not use embeddings directly here?”

Because this lesson first needs to explain clearly:

  • Permissions
  • Source citations
  • Structured knowledge units

These enterprise-grade key points. Once the skeleton is clear, switching to a stronger retrieval method will be much more stable.


def answer_with_sources(query, allowed_visibility):
hits = retrieve(query, allowed_visibility=allowed_visibility, top_k=2)
if not hits:
return {
"answer": "No sufficiently relevant information was found within the current permission scope.",
"sources": [],
}
top = hits[0]
return {
"answer": top["text"],
"sources": [
{
"id": top["id"],
"section": top["section"],
"department": top["department"],
"visibility": top["visibility"],
}
],
}
print(answer_with_sources("What is the refund policy?", {"public"}))
print(answer_with_sources("What is the customer verification process?", {"public", "internal"}))

Expected output:

Terminal window
{'answer': 'A refund can be requested within 7 days of purchase if learning progress is below 20%.', 'sources': [{'id': 'doc_001', 'section': 'Refund Policy', 'department': 'support', 'visibility': 'public'}]}
{'answer': 'When handling a refund request, customer support must first verify the order number, learning progress, and payment channel.', 'sources': [{'id': 'doc_003', 'section': 'Internal Customer Support SOP', 'department': 'internal', 'visibility': 'internal'}]}

Why is “returning sources” a highlight of a portfolio project?

Section titled “Why is “returning sources” a highlight of a portfolio project?”

Because it makes the system do more than just “give you an answer,” and also answer:

  • Where did this answer come from?
  • Why should I trust it?

This significantly increases the credibility of the project.

Why do enterprise scenarios need sources more than ordinary Q&A?

Section titled “Why do enterprise scenarios need sources more than ordinary Q&A?”

Because enterprise users often really use the answer to carry out a process. Without sources, trust is hard to build.


It is not enough to only check “whether it answered”

Section titled “It is not enough to only check “whether it answered””

An enterprise knowledge base project should be evaluated in at least three layers:

  1. Whether the retrieval is relevant
  2. Whether the permissions are correct
  3. Whether the citations are traceable
eval_cases = [
{
"query": "What is the refund policy?",
"visibility": {"public"},
"expected_doc": "doc_001",
},
{
"query": "What is the customer verification process?",
"visibility": {"public"},
"expected_doc": None,
},
{
"query": "What is the customer verification process?",
"visibility": {"public", "internal"},
"expected_doc": "doc_003",
},
]
for case in eval_cases:
result = answer_with_sources(case["query"], case["visibility"])
got = result["sources"][0]["id"] if result["sources"] else None
print({
"query": case["query"],
"expected_doc": case["expected_doc"],
"got": got,
"match": got == case["expected_doc"],
})

Expected output:

Terminal window
{'query': 'What is the refund policy?', 'expected_doc': 'doc_001', 'got': 'doc_001', 'match': True}
{'query': 'What is the customer verification process?', 'expected_doc': None, 'got': None, 'match': True}
{'query': 'What is the customer verification process?', 'expected_doc': 'doc_003', 'got': 'doc_003', 'match': True}

Enterprise KB permission evaluation result map

Because it directly covers the two most important risks in an enterprise knowledge base:

  • It should answer, but fails to answer correctly
  • It should not be visible, but internal documents are exposed

How can you take this project one step closer to portfolio quality?

Section titled “How can you take this project one step closer to portfolio quality?”

Upgrade rule-based retrieval to vector retrieval

Section titled “Upgrade rule-based retrieval to vector retrieval”

The most recommended items to show are:

  • User question
  • Matched document
  • Final answer
  • Source citation
Section titled “Show a few “permission-related failure examples””

This will be very convincing.


Keep this page’s proof of learning as a small evidence card:

Project Goal
user task and business boundary
Baseline
simplest prompt/RAG/app version first
Evaluation
fixed cases, retrieval evidence, answer quality, and citation check
Failure Log
at least one failed case with likely cause
Deliverable
README, run command, screenshots/logs, next step

Only building “can answer,” not “can be traced”

Section titled “Only building “can answer,” not “can be traced””

Only looking at semantic relevance, not permission boundaries

Section titled “Only looking at semantic relevance, not permission boundaries”

When chunks are too coarse, both answers and sources often become vague.


The most important thing in this lesson is to build a portfolio-grade judgment:

What makes enterprise knowledge base Q&A feel like a real project is not that it connects to a retriever, but that it organizes knowledge units, permission boundaries, answer generation, and source traceability into a trustworthy closed loop.

Once this closed loop is clear, this project will look very much like a real enterprise system.


VersionGoalDelivery focus
Basic versionGet the minimal loop runningCan accept input, process it, and output results, while keeping a set of examples
Standard versionForm a showcaseable projectAdd configuration, logs, error handling, README, and screenshots
Advanced versionClose to portfolio qualityAdd evaluation, comparison experiments, failure-case analysis, and a next-step roadmap

It is recommended to finish the basic version first, and do not pursue an all-in-one solution from the beginning. Each time you upgrade a version, write into the README what new capability was added, how it was verified, and what problems still remain.

  1. Add two more “public documents” and one “internal document” to kb to make query competition more realistic.
  2. Why is “correct permissions” sometimes more important than “a beautiful answer” in an enterprise knowledge base project?
  3. Think about it: if document chunks are cut too coarsely, how will that affect the answer and the citation?
  4. If you turn this project into a portfolio piece, which 4 blocks of information would be most worth showing on the homepage?
Project reference and review notes
  1. The added documents should include similar public/internal topics so you can test both ranking quality and permission filtering.
  2. Leaking internal information is a security failure even if the answer sounds elegant. Permission correctness is a hard requirement.
  3. Coarse chunks can mix unrelated facts, produce vague citations, and make permission/citation boundaries unclear.
  4. Strong homepage blocks are problem scope, architecture/retrieval flow, permission model, evaluation results, and failure analysis. Choose the four that best show your project judgment.