Skip to content

9.7.3 Communication Between Agents

  • Understand why communication is a key factor in whether a multi-Agent system succeeds or fails
  • Distinguish between three common communication patterns: message passing, shared state, and event bus
  • Read a minimal event bus example
  • Understand the engineering differences between synchronous and asynchronous communication

Why Does Communication Become the Core Problem in Multi-Agent Systems?

Section titled “Why Does Communication Become the Core Problem in Multi-Agent Systems?”

The Biggest Risk in Multi-Agent Systems Is Not “Not Doing the Work,” but “Not Staying Aligned”

Section titled “The Biggest Risk in Multi-Agent Systems Is Not “Not Doing the Work,” but “Not Staying Aligned””

Even if each Agent is strong on its own, the system can still fail because of poor communication design:

  • Repeated work
  • Lost messages
  • Inconsistent understanding of information
  • Continuing to discuss a task after it has already been completed

A multi-Agent system is a lot like a small team working together:

  • Division of labor is only the first step
  • What often determines efficiency is the communication mechanism: meetings, handoffs, synchronization, and feedback

That is why communication is not an “extra module” — it is a core structure.


Three of the Most Common Communication Patterns

Section titled “Three of the Most Common Communication Patterns”

One Agent explicitly sends a message to another Agent.

Pros:

  • Simple
  • Clear
  • Easy to trace

Cons:

  • The coupling between Agents is relatively strong

All Agents write to and read from one shared workspace.

Pros:

  • No need for explicit point-to-point messaging every time
  • Very suitable for multiple parties collaboratively observing the same task state

Cons:

  • Easier to get messy
  • Harder to control permissions and conflicts

Agents do not necessarily know each other directly; instead, they publish messages to a bus, and subscribers receive them.

Pros:

  • More decoupled
  • Better for complex systems

Cons:

  • More difficult to debug

Start with the Simplest Point-to-Point Message Passing

Section titled “Start with the Simplest Point-to-Point Message Passing”
message = {
"from": "planner",
"to": "worker",
"type": "task_assignment",
"content": "Please summarize the key conditions of the refund policy"
}
print(message)

Expected output:

Terminal window
{'from': 'planner', 'to': 'worker', 'type': 'task_assignment', 'content': 'Please summarize the key conditions of the refund policy'}

Because it makes the key elements of communication explicit:

  • Who sent it
  • Who it was sent to
  • Message type
  • Message content

This is much more robust than “just passing some natural language.”


Why Should Message Formats Be Standardized?

Section titled “Why Should Message Formats Be Standardized?”
bad_message = "Help me do this task"
print(bad_message)

Expected output:

Terminal window
Help me do this task

The problem is:

  • You do not know who sent it
  • You do not know the task type
  • You do not know the context
  • You do not know what to do next
good_message = {
"from": "planner",
"to": "researcher",
"type": "search_request",
"task_id": "task_001",
"payload": {
"query": "refund policy"
}
}
print(good_message)

Expected output:

Terminal window
{'from': 'planner', 'to': 'researcher', 'type': 'search_request', 'task_id': 'task_001', 'payload': {'query': 'refund policy'}}

This is much closer to a message that can enter a system pipeline.

Agent communication contract diagram


from collections import defaultdict
class EventBus:
def __init__(self):
self.handlers = defaultdict(list)
def subscribe(self, event_type, handler):
self.handlers[event_type].append(handler)
def publish(self, event_type, payload):
for handler in self.handlers[event_type]:
handler(payload)
def planner_handler(payload):
print("[planner] received result:", payload)
def worker_handler(payload):
print("[worker] received task:", payload)
result = {
"task_id": payload["task_id"],
"summary": f"Finished retrieving information about {payload['query']}"
}
bus.publish("task_done", result)
bus = EventBus()
bus.subscribe("task_assignment", worker_handler)
bus.subscribe("task_done", planner_handler)
bus.publish("task_assignment", {
"task_id": "task_001",
"query": "refund policy"
})

Expected output:

Terminal window
[worker] received task: {'task_id': 'task_001', 'query': 'refund policy'}
[planner] received result: {'task_id': 'task_001', 'summary': 'Finished retrieving information about refund policy'}

It teaches you:

  • Communication does not have to be point-to-point coupled
  • You can decouple components through event types
  • Completion messages and result messages can use the same underlying infrastructure

This is already very close to the communication backbone of a real system.


If multiple Agents are working around the same task, such as:

  • planner writing the plan
  • retriever collecting materials
  • writer generating a draft
  • reviewer writing review comments

Then much of the information can be placed in a shared workspace.

shared_state = {
"goal": "Complete the refund policy summary",
"plan": [],
"evidence": [],
"draft": None,
"review": None
}
# planner
shared_state["plan"] = ["check policy", "organize key points", "output summary"]
# retriever
shared_state["evidence"].append("Refunds are available within 7 days after purchase if study progress is below 20%")
# writer
shared_state["draft"] = "Refund conditions include time limits and study progress limits."
print(shared_state)

Expected output:

Terminal window
{'goal': 'Complete the refund policy summary', 'plan': ['check policy', 'organize key points', 'output summary'], 'evidence': ['Refunds are available within 7 days after purchase if study progress is below 20%'], 'draft': 'Refund conditions include time limits and study progress limits.', 'review': None}

Pros:

  • Everyone can see the same blackboard
  • The state is more centralized

Cons:

  • You need to control who can write what
  • Conflicts are easy to create

How Should We Understand Synchronous and Asynchronous Communication?

Section titled “How Should We Understand Synchronous and Asynchronous Communication?”

After an Agent sends a request, it must wait for the other side to reply before it can continue.

Pros:

  • Simple
  • Easy to understand

Cons:

  • Can easily block progress

After sending a message, the Agent continues doing other work first, and handles the result later when the other side finishes.

Pros:

  • More flexible
  • Better for complex systems and high concurrency

Cons:

  • More complex state management

A Very Practical Engineering Rule of Thumb

Section titled “A Very Practical Engineering Rule of Thumb”

If your task chain is short and the process is clear, start with synchronous communication. If the task is long and waiting time is unstable, then consider asynchronous communication.


The Most Common Failure Points in Agent-to-Agent Communication

Section titled “The Most Common Failure Points in Agent-to-Agent Communication”

Today it is called task_id, tomorrow id, and the day after job_id — the system will quickly become messy.

This is a very common issue in event systems:

  • It was published
  • But there are no subscribers

Multiple Agents Interpret the Same Message Differently

Section titled “Multiple Agents Interpret the Same Message Differently”

For example:

  • One Agent thinks it is a “retrieval request”
  • Another Agent thinks it is a “summary request”

This will cause the system to drift off course.

If one Agent gets stuck, the whole system may keep waiting forever.


How Can Real Systems Make Communication More Reliable?

Section titled “How Can Real Systems Make Communication More Reliable?”

At minimum, standardize:

  • from
  • to
  • type
  • task_id
  • payload

Each task should ideally have a unique ID to make it easier to:

  • Trace the full chain
  • Replay
  • Debug

For example:

  • Automatic fallback after timeout
  • Escalate to a human on failure
  • Stop after multiple retries

Keep this page’s proof of learning as a small evidence card:

Roles
owner, worker, reviewer, or specialist responsibilities
Message Contract
artifact, request, response, and handoff state
Coordination
routing, task split, conflict resolution, and final owner
Failure Check
duplicated work, lost context, no accountable owner, or message loop
Eval Action
compare multi-agent result against single-agent baseline

The most important thing in this section is not memorizing the terms “message passing,” “event bus,” and “shared state,” but understanding this:

The key to multi-Agent communication is not just sending messages out, but making the message structure stable, responsibilities clear, and failures controllable.

Only when the communication layer is solid can a multi-Agent system avoid wasting model capability due to organizational chaos.


  1. Add a reviewer_handler to the event bus example and make it subscribe to task_done.
  2. Design your own unified message protocol. It should include at least type, task_id, and payload.
  3. Think about it: when would you prefer shared state over point-to-point messaging?
  4. Explain in your own words: why is communication design often just as important as task division in a multi-Agent system?
Reference implementation and walkthrough
  1. reviewer_handler should subscribe to task_done, read the payload, check whether the result satisfies the criteria, and publish a review event or attach review status to shared state.
  2. A useful protocol might include type, task_id, from, to, payload, evidence, status, and timestamp. The exact fields can vary, but message meaning should be stable.
  3. Prefer shared state when many agents need the same evolving artifact or when point-to-point messages would duplicate large context. Prefer direct messages for simple handoffs and narrow requests.
  4. Communication design matters because even good roles fail if they receive ambiguous inputs, lose evidence, duplicate work, or cannot tell whether a task is done.