Skip to content

8.4.3 API Design and Serviceization

  • Understand the most basic things an LLM service API should define
  • Learn how to design clear request and response structures
  • Understand key service concepts such as idempotency, error returns, trace_id, and version management
  • Read and understand a minimal API processing loop

API design becomes much easier once these words are no longer mysterious:

TermBeginner meaningIn this lesson
APIApplication Programming Interface, a stable way for one program to call another programThe service entry point that other code depends on
endpointA specific callable address, such as /api/v1/chatThe URL path where a capability is exposed
schemaA rule that defines which fields are allowed and requiredIt keeps request and response structures predictable
payloadThe data body sent with a requestIn this lesson, it is usually the user query and related metadata
trace_idA unique ID for following one request through the systemIt helps connect API logs, retrieval logs, model logs, and errors
idempotencyRepeating the same request does not create uncontrolled side effectsIt matters when retries happen after timeouts or network failures

Do not memorize these as vocabulary only. In a real system, these terms are the pieces that let frontend, backend, logs, evaluation, and deployment cooperate.


Why API design is not “just wrapping some JSON”

Section titled “Why API design is not “just wrapping some JSON””
bad_request = {
"msg": "What is the refund policy?"
}
bad_response = {
"text": "Refunds are available within 7 days"
}

What’s wrong here?

  • What is msg? User message? System message?
  • No trace_id
  • No error structure
  • No version information
  • No context field

At its core, it answers:

  • What does the input look like?
  • What does the output look like?
  • How should errors be represented?
  • Can it stay stable when called 10 times or 100,000 times?

In other words, API design is not “just writing an entry point”; it is defining:

The contract between the system and the outside world.


A minimal request structure usually needs at least these fields

Section titled “A minimal request structure usually needs at least these fields”
  • query
  • user_id (optional)
  • session_id (for multi-turn scenarios)
  • metadata (optional)
request = {
"query": "What is the refund policy?",
"user_id": 1,
"session_id": "sess_001",
"metadata": {
"channel": "web"
}
}
print(request)

Expected output:

Terminal window
{'query': 'What is the refund policy?', 'user_id': 1, 'session_id': 'sess_001', 'metadata': {'channel': 'web'}}

Here, you can already feel:

  • What the query is
  • Who sent it
  • Which session it belongs to
  • What extra context is included

This is much better than “passing only a string.”


Why must the response also be standardized?

Section titled “Why must the response also be standardized?”

Because real consumers are often not just people, but also:

  • Frontend applications
  • Other services
  • Logging systems
  • Evaluation systems

They all need to consume the result reliably.

response = {
"trace_id": "trace_001",
"answer": "A refund can be requested within 7 days after purchase, provided the learning progress is below 20%.",
"sources": [
{"id": "doc_001", "section": "Refund Policy"}
],
"usage": {
"prompt_tokens": 120,
"completion_tokens": 35
}
}
print(response)

Expected output:

Terminal window
{'trace_id': 'trace_001', 'answer': 'A refund can be requested within 7 days after purchase, provided the learning progress is below 20%.', 'sources': [{'id': 'doc_001', 'section': 'Refund Policy'}], 'usage': {'prompt_tokens': 120, 'completion_tokens': 35}}
  • trace_id: makes it easy to trace the request path
  • answer: the actual business output
  • sources: helps with citation and verification
  • usage: helps with cost analysis

Many systems only design successful responses

Section titled “Many systems only design successful responses”

But in real engineering, the more common issues are actually:

  • Invalid parameters
  • Upstream timeouts
  • Insufficient permissions
  • Empty knowledge base
error_response = {
"trace_id": "trace_002",
"error": {
"code": "INVALID_ARGUMENT",
"message": "query cannot be empty"
}
}
print(error_response)

Expected output:

Terminal window
{'trace_id': 'trace_002', 'error': {'code': 'INVALID_ARGUMENT', 'message': 'query cannot be empty'}}

This step is very important because it makes the caller clearly understand:

  • What went wrong
  • What category the error belongs to
  • Whether it is worth retrying

API contract, error structure, and version management diagram


A minimal runnable service handling function

Section titled “A minimal runnable service handling function”

Simulate an API handler with pure Python first

Section titled “Simulate an API handler with pure Python first”
def handle_chat(request):
trace_id = "trace_demo_001"
if "query" not in request or not request["query"].strip():
return {
"trace_id": trace_id,
"error": {
"code": "INVALID_ARGUMENT",
"message": "query cannot be empty"
}
}
answer = f"System reply: {request['query']}"
return {
"trace_id": trace_id,
"answer": answer,
"sources": [],
"usage": {"prompt_tokens": 12, "completion_tokens": 8}
}
print(handle_chat({"query": "What is the refund policy?"}))
print(handle_chat({"query": ""}))

Expected output:

Terminal window
{'trace_id': 'trace_demo_001', 'answer': 'System reply: What is the refund policy?', 'sources': [], 'usage': {'prompt_tokens': 12, 'completion_tokens': 8}}
{'trace_id': 'trace_demo_001', 'error': {'code': 'INVALID_ARGUMENT', 'message': 'query cannot be empty'}}

It teaches you:

  1. Validate the request first
  2. Every request should have a trace_id
  3. Both success and failure need a unified structure

This is already the most important layer of service design.


Simply put:

Repeated calls with the same request should produce the same or a controlled result.

This is especially important in these scenarios:

  • Retries
  • Re-sending after a timeout
  • Network instability

Especially:

  • Ticket creation
  • Payment initiation
  • Order changes

A pure question-answering API is usually more like a “read-only operation,” so idempotency is easier to handle.


Why can’t version management be added later?

Section titled “Why can’t version management be added later?”

Once others integrate with your API, changing fields casually becomes hard

Section titled “Once others integrate with your API, changing fields casually becomes hard”

If today the response returns:

  • answer

and tomorrow it changes to:

  • response_text

the caller will break immediately.

api_info = {
"version": "v1",
"endpoint": "/api/v1/chat"
}
print(api_info)

Expected output:

Terminal window
{'version': 'v1', 'endpoint': '/api/v1/chat'}

Even for a small project, it is best to build version awareness early.


A FastAPI example closer to a real service

Section titled “A FastAPI example closer to a real service”

If you want to see a style closer to a real backend, take a look at this minimal version.

from fastapi import FastAPI
from pydantic import BaseModel, Field
class ChatRequest(BaseModel):
query: str = Field(min_length=1)
session_id: str | None = None
app = FastAPI()
@app.post("/api/v1/chat")
def chat(payload: ChatRequest):
return {
"trace_id": "trace_demo_002",
"answer": f"System reply: {payload.query}",
"session_id": payload.session_id,
}

Although this code is simple, it is already closer to a real service because ChatRequest is a request schema. FastAPI uses it to validate the payload before your business logic runs. In production, you would usually add authentication, structured errors, logging, and a real trace ID generator.


If your goal is a “knowledge-base-driven SOP document assistant,” what should the minimal API look like?

Section titled “If your goal is a “knowledge-base-driven SOP document assistant,” what should the minimal API look like?”

These systems usually need more than just a /chat endpoint. At minimum, they often have interfaces like these:

EndpointWhat it is responsible for
/sop-drafts/generateGenerate a structured SOP draft from policy, case, and checklist evidence
/sop-drafts/previewPreview structured SOP sections before export
/documents/ingestUpload and parse PDF / Word / PPT
/retrieval/searchDebug retrieval results

When building for the first time, a more stable default approach is usually:

  1. Start with only one generate endpoint
  2. Return structured results or an export link first
  3. Then add debugging and batch interfaces

A very small request structure can be defined like this first:

generate_request = {
"topic": "Refund escalation SOP",
"audience": "frontline support",
"doc_format": "word",
"case_count": 2,
"checklist_required": True,
}
print(generate_request)

Expected output:

Terminal window
{'topic': 'Refund escalation SOP', 'audience': 'frontline support', 'doc_format': 'word', 'case_count': 2, 'checklist_required': True}

The value of this object is:

  • It turns the slots collected during multi-turn conversation into actual service API parameters

Hands-on: Simulate an SOP Draft API Contract

Section titled “Hands-on: Simulate an SOP Draft API Contract”

Before building a real FastAPI endpoint, first write the request validation and response contract in pure Python. This makes the service boundary clear.

REQUIRED_FIELDS = ["topic", "audience", "doc_format", "case_count", "checklist_required"]
def validate_generate_request(payload):
missing = [field for field in REQUIRED_FIELDS if field not in payload or payload.get(field) is None]
if missing:
return False, {
"code": "INVALID_ARGUMENT",
"message": f"missing fields: {missing}"
}
if payload["doc_format"] not in {"word", "ppt"}:
return False, {
"code": "INVALID_ARGUMENT",
"message": "doc_format must be word or ppt"
}
return True, None
def handle_generate(payload):
trace_id = "trace_sop_001"
ok, error = validate_generate_request(payload)
if not ok:
return {"trace_id": trace_id, "error": error}
return {
"trace_id": trace_id,
"status": "accepted",
"sop_draft": {
"title": payload["topic"],
"audience": payload["audience"],
"format": payload["doc_format"],
"sections": ["Policy Summary", "Handled Cases", "Frontline Checklist"],
}
}
generate_request = {
"topic": "Refund escalation SOP",
"audience": "frontline support",
"doc_format": "word",
"case_count": 2,
"checklist_required": True,
}
print(handle_generate(generate_request))
print(handle_generate({"topic": "Refund escalation SOP", "doc_format": "pdf"}))

Expected output:

Terminal window
{'trace_id': 'trace_sop_001', 'status': 'accepted', 'sop_draft': {'title': 'Refund escalation SOP', 'audience': 'frontline support', 'format': 'word', 'sections': ['Policy Summary', 'Handled Cases', 'Frontline Checklist']}}
{'trace_id': 'trace_sop_001', 'error': {'code': 'INVALID_ARGUMENT', 'message': "missing fields: ['audience', 'case_count', 'checklist_required']"}}

SOP draft API contract result map

This exercise is useful because it forces you to design success and failure together. A service is not ready just because it can return a happy-path answer.

It may feel convenient at first, but it becomes very painful later.

This makes it increasingly difficult for the frontend and other services to integrate.

When something goes wrong, it becomes hard to trace the request path.

Binding the API too tightly to a single business logic from the start

Section titled “Binding the API too tightly to a single business logic from the start”

This makes future expansion very difficult.


Keep this page’s proof of learning as a small evidence card:

Service Contract
endpoint, input schema, output schema, error schema
Run Signal
latency, throughput, logs, health check, or container status
Observability
request id, trace id, structured log, or metric
Failure Check
timeout, retry storm, missing log, deployment mismatch
Ops Action
backoff, queue, alert, rollout, or rollback

The most important thing in this section is not getting the API to run, but understanding:

The core of API design is turning input, output, errors, and traceability into a stable system contract.

Once the contract is clear, the service can truly be relied on by others for the long term.


  1. Add support for a session_id field to handle_chat().
  2. Design a unified error code enum, such as INVALID_ARGUMENT, TIMEOUT, and NOT_FOUND.
  3. Think about it: if this were a “ticket creation” API, how would you consider idempotency?
  4. Explain in your own words: why is API design essentially about defining a system contract?
Reference implementation and walkthrough
  1. session_id should flow through request parsing, state lookup, logs, and response trace. Validate empty or malformed IDs.
  2. An error enum gives clients stable handling and separates user errors from service errors.
  3. An idempotency key prevents duplicate tickets when the client retries after a timeout.
  4. An API contract defines inputs, outputs, errors, permissions, timing expectations, and compatibility.