Event-Driven Agents: Building AI Systems That React in Real Time
Decouple your agents from your triggers. Here's the architecture.
Why request-response breaks down
Most agent systems start with the same architecture: user sends a request, agent processes it, agent returns a response. This works until it doesn't. A code review triggered by a GitHub push shouldn't block the developer's terminal. A security scan triggered by a dependency update shouldn't wait for the scan to complete before returning. A content moderation agent shouldn't require the publisher to poll for results.
Event-driven architecture decouples the trigger from the execution. Something happens (a push, a webhook, a message in a queue). An agent picks it up. The agent does its work asynchronously. Results are delivered when ready — via callback, webhook, or status update.
This decoupling buys you three things: agents that react to real-world events without polling, independent scaling (add more agents for a busy event type without touching others), and resilience (a crashed agent doesn't take down the whole system).
Event bus patterns
The event bus is the backbone. It receives events from producers and routes them to subscribed agents. Three common implementations:
Message queues (BullMQ, SQS, RabbitMQ). Events are published to named queues. Agents subscribe to queues and process events in order. Messages are persisted — if an agent crashes, the event stays in the queue for retry. This is the most common pattern for production agent systems.
Webhooks. External systems (GitHub, Stripe, Slack) push events to your HTTP endpoints. Your endpoint validates the event, enqueues it for processing, and returns 200 immediately. The agent processes the event asynchronously. This is how most CI/CD agents receive triggers.
Database change streams (Supabase Realtime, MongoDB Change Streams, PostgreSQL LISTEN/NOTIFY). Events are generated by database changes. An agent subscribes to changes on a specific table or collection. When a row is inserted or updated, the agent reacts. Useful for internal event sourcing where the database is the source of truth.
Real-world event-driven agents
Codex's queue-based architecture. OpenAI's Codex uses a fully event-driven design internally. Tasks are submitted to a queue with a channel capacity of 512 — meaning up to 512 tasks can be buffered before backpressure kicks in. Each task is typed using an Op enum that defines the operation: code generation, test execution, file modification, and so on.
When a task enters the queue, it's picked up by the next available agent. The agent processes it, writes results back to a state store, and the task lifecycle moves to the next status. If the agent crashes mid-task, the message remains in the queue and another agent picks it up. This fail-closed design means no task is ever silently dropped.
Codex's Guardian adversarial verifier runs as a separate event consumer. Every completed task emits a verification event. The Guardian subscribes, validates the result against safety criteria (risk score 0-100), and either approves or rejects. Rejected tasks are re-queued or escalated. The key insight: verification is decoupled from execution. The coding agent doesn't know or care about the Guardian. They communicate only through events.
AutoGen's asynchronous 3-layer architecture. Microsoft's AutoGen framework is built on an event-driven core. Its three layers — Core (message passing), AgentChat (high-level agents), and Extensions (tools and integrations) — communicate through asynchronous events. Agents don't call each other directly. They publish messages to topics. Other agents subscribe to those topics and react.
This design is what gives AutoGen its 54,700 GitHub stars' worth of flexibility. You can add a new agent by subscribing it to existing topics. You can replace an agent by unsubscribing the old one and subscribing the new one. The event bus is the integration layer.
Agent as subscriber
Each agent in an event-driven system is a subscriber with three responsibilities:
Event filtering. Not every event is relevant to every agent. A security scanner doesn't need to react to documentation changes. Agents should filter events by type, source, and metadata before processing.
Idempotent processing. Events can be delivered more than once (at-least-once delivery is the default for most message queues). Your agent must handle duplicate events gracefully. The simplest approach: check if you've already processed this event ID before starting work.
Bounded execution. An event-driven agent that takes 30 minutes to process a single event is a queue bottleneck. Set timeouts. If the task is too large, break it into sub-events and re-queue them.
interface AgentEvent {
id: string;
type: "push" | "pr_opened" | "dependency_update" | "schedule";
source: string;
payload: Record<string, unknown>;
timestamp: number;
metadata: { retryCount: number; correlationId: string };
}
class EventDrivenAgent {
private processedIds = new Set<string>();
constructor(
private name: string,
private eventTypes: string[],
private handler: (event: AgentEvent) => Promise<AgentResult>,
private options: { timeoutMs: number; maxRetries: number }
) {}
async onEvent(event: AgentEvent): Promise<void> {
// Filter
if (!this.eventTypes.includes(event.type)) return;
// Idempotency
if (this.processedIds.has(event.id)) return;
// Process with timeout
const timeout = setTimeout(
() => { throw new Error("Agent timeout"); },
this.options.timeoutMs
);
try {
const result = await this.handler(event);
this.processedIds.add(event.id);
await this.emitResult(event.metadata.correlationId, result);
} catch (error) {
if (event.metadata.retryCount < this.options.maxRetries) {
await this.requeueWithRetry(event);
} else {
await this.emitFailure(event, error);
}
} finally {
clearTimeout(timeout);
}
}
private async emitResult(correlationId: string, result: AgentResult) {
// Publish result as a new event for downstream consumers
}
private async requeueWithRetry(event: AgentEvent) {
// Re-publish with incremented retryCount
}
private async emitFailure(event: AgentEvent, error: unknown) {
// Publish failure event for monitoring/alerting
}
}State management: event sourcing and sagas
Event-driven agents are stateless by design. But workflows have state. The question is where that state lives.
Event sourcing. Instead of storing current state, store the sequence of events that produced it. To reconstruct state, replay the events. This gives you a complete audit trail and the ability to rewind and replay. For an AI agent, this means you can replay a failed workflow from any point by replaying events from that point forward.
Saga pattern. For multi-step workflows that span multiple agents, use sagas. A saga is a sequence of events where each step either completes or triggers a compensating action. If step 3 fails, step 2's result is rolled back (or a compensation event is emitted).
// Saga: Code Review Pipeline
const reviewSaga = {
steps: [
{ event: "pr.parsed", agent: "parser", compensate: null },
{ event: "pr.analyzed", agent: "reviewer", compensate: "pr.analysis_reverted" },
{ event: "pr.commented", agent: "commenter", compensate: "pr.comment_deleted" },
],
};Temporal makes sagas practical for production. It handles durability, retries, and compensation out of the box. Without Temporal (or a similar workflow engine), implementing sagas from scratch is a significant engineering investment.
Example: CI/CD agent on GitHub push
Here's a complete event flow for an agent that reacts to GitHub push events:
GitHub push webhook
|
v
[API endpoint] -- validate signature -- enqueue
|
v
[Message Queue]
|
|--> [Test Runner Agent]
| |
| v
| Run tests --> emit "tests.completed"
|
|--> [Security Scanner Agent]
| |
| v
| Scan dependencies --> emit "scan.completed"
|
+--> [Lint Agent]
|
v
Check style --> emit "lint.completed"
| (all three complete)
v
[Aggregator Agent]
|
v
Post GitHub status checkThe three scanning agents run in parallel, each subscribed to the "push" event. The aggregator agent is subscribed to the completion events from all three. It waits for all three to arrive (using a correlation ID tied to the commit SHA) and then posts the combined result.
If the security scanner crashes, its event stays in the queue. Another instance picks it up. The aggregator doesn't care about the delay — it's just waiting for three events with matching correlation IDs.
When event-driven is overkill
Not every agent system needs an event bus. If your workflow is a simple request-response (user asks a question, agent answers), event-driven architecture adds complexity without benefit.
Use event-driven orchestration when:
- Agents are triggered by external systems (webhooks, schedules, database changes)
- Multiple agents need to react to the same event independently
- Workflows take minutes or hours, not seconds
- Resilience matters more than simplicity — you can't afford to drop events
Use simpler patterns when:
- The workflow is synchronous and fast
- There's one trigger and one agent
- You're prototyping and don't need production resilience yet
For synchronous workflows, [tool-calling loops](/guides/tool-calling-loops-ai-agents) are simpler and more debuggable. For workflows that need human checkpoints, add [human-in-the-loop](/guides/human-in-the-loop-ai-agents) gates — the human becomes another event emitter in the system.
Event-driven architecture is the most powerful orchestration pattern. It's also the most complex. Earn that complexity by starting with simpler patterns and graduating to events when your requirements demand it.