9.7.3 Communication Between Agents
Learning Objectives
Section titled “Learning Objectives”- Understand why communication is a key factor in whether a multi-Agent system succeeds or fails
- Distinguish between three common communication patterns: message passing, shared state, and event bus
- Read a minimal event bus example
- Understand the engineering differences between synchronous and asynchronous communication
Why Does Communication Become the Core Problem in Multi-Agent Systems?
Section titled “Why Does Communication Become the Core Problem in Multi-Agent Systems?”The Biggest Risk in Multi-Agent Systems Is Not “Not Doing the Work,” but “Not Staying Aligned”
Section titled “The Biggest Risk in Multi-Agent Systems Is Not “Not Doing the Work,” but “Not Staying Aligned””Even if each Agent is strong on its own, the system can still fail because of poor communication design:
- Repeated work
- Lost messages
- Inconsistent understanding of information
- Continuing to discuss a task after it has already been completed
A Very Intuitive Analogy
Section titled “A Very Intuitive Analogy”A multi-Agent system is a lot like a small team working together:
- Division of labor is only the first step
- What often determines efficiency is the communication mechanism: meetings, handoffs, synchronization, and feedback
That is why communication is not an “extra module” — it is a core structure.
Three of the Most Common Communication Patterns
Section titled “Three of the Most Common Communication Patterns”Direct Message Passing
Section titled “Direct Message Passing”One Agent explicitly sends a message to another Agent.
Pros:
- Simple
- Clear
- Easy to trace
Cons:
- The coupling between Agents is relatively strong
Shared State / Blackboard
Section titled “Shared State / Blackboard”All Agents write to and read from one shared workspace.
Pros:
- No need for explicit point-to-point messaging every time
- Very suitable for multiple parties collaboratively observing the same task state
Cons:
- Easier to get messy
- Harder to control permissions and conflicts
Event Bus
Section titled “Event Bus”Agents do not necessarily know each other directly; instead, they publish messages to a bus, and subscribers receive them.
Pros:
- More decoupled
- Better for complex systems
Cons:
- More difficult to debug
Start with the Simplest Point-to-Point Message Passing
Section titled “Start with the Simplest Point-to-Point Message Passing”A Minimal Example
Section titled “A Minimal Example”message = { "from": "planner", "to": "worker", "type": "task_assignment", "content": "Please summarize the key conditions of the refund policy"}
print(message)Expected output:
{'from': 'planner', 'to': 'worker', 'type': 'task_assignment', 'content': 'Please summarize the key conditions of the refund policy'}Why Is This Already Important?
Section titled “Why Is This Already Important?”Because it makes the key elements of communication explicit:
- Who sent it
- Who it was sent to
- Message type
- Message content
This is much more robust than “just passing some natural language.”
Why Should Message Formats Be Standardized?
Section titled “Why Should Message Formats Be Standardized?”A Bad Message Format
Section titled “A Bad Message Format”bad_message = "Help me do this task"print(bad_message)Expected output:
Help me do this taskThe problem is:
- You do not know who sent it
- You do not know the task type
- You do not know the context
- You do not know what to do next
A More Reliable Message Structure
Section titled “A More Reliable Message Structure”good_message = { "from": "planner", "to": "researcher", "type": "search_request", "task_id": "task_001", "payload": { "query": "refund policy" }}
print(good_message)Expected output:
{'from': 'planner', 'to': 'researcher', 'type': 'search_request', 'task_id': 'task_001', 'payload': {'query': 'refund policy'}}This is much closer to a message that can enter a system pipeline.

A Minimal Event Bus Example
Section titled “A Minimal Event Bus Example”Runnable Code
Section titled “Runnable Code”from collections import defaultdict
class EventBus: def __init__(self): self.handlers = defaultdict(list)
def subscribe(self, event_type, handler): self.handlers[event_type].append(handler)
def publish(self, event_type, payload): for handler in self.handlers[event_type]: handler(payload)
def planner_handler(payload): print("[planner] received result:", payload)
def worker_handler(payload): print("[worker] received task:", payload) result = { "task_id": payload["task_id"], "summary": f"Finished retrieving information about {payload['query']}" } bus.publish("task_done", result)
bus = EventBus()bus.subscribe("task_assignment", worker_handler)bus.subscribe("task_done", planner_handler)
bus.publish("task_assignment", { "task_id": "task_001", "query": "refund policy"})Expected output:
[worker] received task: {'task_id': 'task_001', 'query': 'refund policy'}[planner] received result: {'task_id': 'task_001', 'summary': 'Finished retrieving information about refund policy'}What Does This Code Actually Teach?
Section titled “What Does This Code Actually Teach?”It teaches you:
- Communication does not have to be point-to-point coupled
- You can decouple components through event types
- Completion messages and result messages can use the same underlying infrastructure
This is already very close to the communication backbone of a real system.
Shared State: When Is It More Suitable?
Section titled “Shared State: When Is It More Suitable?”A Very Typical Scenario
Section titled “A Very Typical Scenario”If multiple Agents are working around the same task, such as:
plannerwriting the planretrievercollecting materialswritergenerating a draftreviewerwriting review comments
Then much of the information can be placed in a shared workspace.
A Minimal Example
Section titled “A Minimal Example”shared_state = { "goal": "Complete the refund policy summary", "plan": [], "evidence": [], "draft": None, "review": None}
# plannershared_state["plan"] = ["check policy", "organize key points", "output summary"]
# retrievershared_state["evidence"].append("Refunds are available within 7 days after purchase if study progress is below 20%")
# writershared_state["draft"] = "Refund conditions include time limits and study progress limits."
print(shared_state)Expected output:
{'goal': 'Complete the refund policy summary', 'plan': ['check policy', 'organize key points', 'output summary'], 'evidence': ['Refunds are available within 7 days after purchase if study progress is below 20%'], 'draft': 'Refund conditions include time limits and study progress limits.', 'review': None}Pros and Cons of This Approach
Section titled “Pros and Cons of This Approach”Pros:
- Everyone can see the same blackboard
- The state is more centralized
Cons:
- You need to control who can write what
- Conflicts are easy to create
How Should We Understand Synchronous and Asynchronous Communication?
Section titled “How Should We Understand Synchronous and Asynchronous Communication?”Synchronous Communication
Section titled “Synchronous Communication”After an Agent sends a request, it must wait for the other side to reply before it can continue.
Pros:
- Simple
- Easy to understand
Cons:
- Can easily block progress
Asynchronous Communication
Section titled “Asynchronous Communication”After sending a message, the Agent continues doing other work first, and handles the result later when the other side finishes.
Pros:
- More flexible
- Better for complex systems and high concurrency
Cons:
- More complex state management
A Very Practical Engineering Rule of Thumb
Section titled “A Very Practical Engineering Rule of Thumb”If your task chain is short and the process is clear, start with synchronous communication. If the task is long and waiting time is unstable, then consider asynchronous communication.
The Most Common Failure Points in Agent-to-Agent Communication
Section titled “The Most Common Failure Points in Agent-to-Agent Communication”Inconsistent Message Formats
Section titled “Inconsistent Message Formats”Today it is called task_id, tomorrow id, and the day after job_id — the system will quickly become messy.
A Message Was Sent, but Nobody Handles It
Section titled “A Message Was Sent, but Nobody Handles It”This is a very common issue in event systems:
- It was published
- But there are no subscribers
Multiple Agents Interpret the Same Message Differently
Section titled “Multiple Agents Interpret the Same Message Differently”For example:
- One Agent thinks it is a “retrieval request”
- Another Agent thinks it is a “summary request”
This will cause the system to drift off course.
No Timeouts or Retries
Section titled “No Timeouts or Retries”If one Agent gets stuck, the whole system may keep waiting forever.
How Can Real Systems Make Communication More Reliable?
Section titled “How Can Real Systems Make Communication More Reliable?”Unify the Message Protocol
Section titled “Unify the Message Protocol”At minimum, standardize:
fromtotypetask_idpayload
Unify State Tracking
Section titled “Unify State Tracking”Each task should ideally have a unique ID to make it easier to:
- Trace the full chain
- Replay
- Debug
Unify Timeout and Failure Policies
Section titled “Unify Timeout and Failure Policies”For example:
- Automatic fallback after timeout
- Escalate to a human on failure
- Stop after multiple retries
Evidence to Keep
Section titled “Evidence to Keep”Keep this page’s proof of learning as a small evidence card:
- Roles
- owner, worker, reviewer, or specialist responsibilities
- Message Contract
- artifact, request, response, and handoff state
- Coordination
- routing, task split, conflict resolution, and final owner
- Failure Check
- duplicated work, lost context, no accountable owner, or message loop
- Eval Action
- compare multi-agent result against single-agent baseline
Summary
Section titled “Summary”The most important thing in this section is not memorizing the terms “message passing,” “event bus,” and “shared state,” but understanding this:
The key to multi-Agent communication is not just sending messages out, but making the message structure stable, responsibilities clear, and failures controllable.
Only when the communication layer is solid can a multi-Agent system avoid wasting model capability due to organizational chaos.
Exercises
Section titled “Exercises”- Add a
reviewer_handlerto the event bus example and make it subscribe totask_done. - Design your own unified message protocol. It should include at least
type,task_id, andpayload. - Think about it: when would you prefer shared state over point-to-point messaging?
- Explain in your own words: why is communication design often just as important as task division in a multi-Agent system?
Reference implementation and walkthrough
reviewer_handlershould subscribe totask_done, read the payload, check whether the result satisfies the criteria, and publish a review event or attach review status to shared state.- A useful protocol might include
type,task_id,from,to,payload,evidence,status, andtimestamp. The exact fields can vary, but message meaning should be stable. - Prefer shared state when many agents need the same evolving artifact or when point-to-point messages would duplicate large context. Prefer direct messages for simple handoffs and narrow requests.
- Communication design matters because even good roles fail if they receive ambiguous inputs, lose evidence, duplicate work, or cannot tell whether a task is done.