The Hard Part of AI Agents Isn't the Model

11 minute read · 1,800 words|Published March 2026

The first time an AI agent fails in production, it usually is not because the model hallucinated.

It is because something around the model broke.

The API timed out. The workflow retried the wrong step. The agent ran twice and scheduled the same meeting two times.

Suddenly the problem is not prompting.

It is distributed systems.

The first production agent we built failed in a surprisingly boring way: a retry loop executed the same action twice and created duplicate calendar entries for hundreds of users. The model worked perfectly. The infrastructure around it did not.

Most discussions about AI agents focus on prompts. Better prompts. Better models. Better reasoning. But once an agent is responsible for executing real work, the problems look very different.

Orchestration. State management. Failure recovery. Observability. Retry logic. The same problems backend engineers have been solving for decades.

The Model Call Is the Easy Part

An agent calls a model.

A response comes back a few hundred milliseconds later.

Architecturally, this is just a function call.

Except the function is slow, expensive, and occasionally wrong.

Prompt engineering is the easiest part of building an agent. Everything else is infrastructure.

# This is the part everyone focuses on
response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=messages,
    tools=tools,
)

# This is the part that actually matters
# - What if this call times out?
# - What if the response is malformed?
# - What if we need to retry with different context?
# - What if this is step 4 of 12 and step 3 failed?
# - What if we need to resume this workflow tomorrow?
# - What if we need to understand why this agent
#   made a bad decision last Tuesday?

Every external call can fail.

Model APIs. Vector databases. Tool APIs. Internal services.

Eventually one of them will. The question is whether the system handles it gracefully or creates duplicate calendar entries for hundreds of users.

What Reliable AI Agents Actually Need

Orchestration

An agent that performs a multi-step task (researching a topic, analyzing data, drafting a report, requesting approval) is a workflow. It has steps. Steps have dependencies. Some run in parallel. Some must run sequentially.

If you have ever debugged a distributed workflow at 2 AM, this architecture will look familiar.

Loading diagram...
An AI agent workflow. The model call is just another task in the pipeline.
Loading diagram...
A transaction processing workflow. The structure is identical.

If you have ever looked at a payment processing workflow, the structure is identical. Authorization. Fraud checks. Settlement. Reconciliation. Each step persists state and can resume safely after failure.

Agent workflows look exactly the same. A workflow engine already provides everything the agent needs: step management, dependency resolution, parallel execution, and durable state that survives process restarts.

Rebuilding this from scratch for every agent is unnecessary work.

Agent workflows are not request-response interactions. They run for minutes. Sometimes hours. And occasionally they fail halfway through. Production systems have a way of finding those failure modes quickly.

State Management

AI agents are stateful. A research agent accumulates context across multiple interactions. A coding agent tracks file changes, test results, and previous attempts. A customer support agent carries conversation history and customer state.

This state must survive failures.

Suppose an agent schedules meetings, updates project tasks, and sends notifications. If it crashes after scheduling the meeting but before updating the task, it must resume from the right point. Not re-schedule the meeting.

Your calendar will not appreciate that.

interface AgentState {
  workflow_id: string;
  current_step: string;
  context: Record<string, unknown>;
  tool_results: ToolResult[];
  conversation_history: Message[];
  retry_count: number;
  created_at: string;
  updated_at: string;
}

// State must be persisted durably, not held in memory
async function advanceAgent(state: AgentState): Promise<AgentState> {
  const step = getStep(state.current_step);

  const result = await step.execute(state.context);

  const nextState = {
    ...state,
    current_step: step.next(result),
    context: { ...state.context, [step.name]: result },
    updated_at: new Date().toISOString(),
  };

  // Persist before proceeding - crash recovery depends on this
  await stateStore.save(nextState);

  return nextState;
}

If this looks familiar, it should. It is the same durable execution pattern used by systems like Temporal and Inngest. The agent's state machine is persisted after every transition, making the workflow resumable from any point.

Retry Logic

Model calls fail.

APIs time out. Rate limits trigger. Context windows overflow. Tool calls return errors.

Sooner or later one of these will happen in the middle of a workflow.

Each failure type needs a different response.

This is not simple exponential backoff.

It looks much closer to saga compensation.

class AgentRetryPolicy:
    def should_retry(self, error, attempt, step):
        if isinstance(error, RateLimitError):
            # Back off and retry - this is transient
            return RetryDecision(
                retry=True,
                delay=error.retry_after or (2 ** attempt),
            )

        if isinstance(error, ContextOverflowError):
            # Retry with summarized context - not transient,
            # but recoverable with a different strategy
            return RetryDecision(
                retry=True,
                delay=0,
                modify_context=self.summarize_context,
            )

        if isinstance(error, ToolExecutionError):
            # Retry the tool call, not the model call
            if attempt < step.max_tool_retries:
                return RetryDecision(retry=True, delay=1)
            # Tool is broken - ask model to use alternative
            return RetryDecision(
                retry=True,
                delay=0,
                modify_context=self.disable_tool(error.tool),
            )

        if isinstance(error, MalformedResponseError):
            # Model returned unparseable output - retry with
            # explicit format instructions
            if attempt < 3:
                return RetryDecision(
                    retry=True,
                    delay=0,
                    modify_context=self.add_format_reminder,
                )

        return RetryDecision(retry=False)

Agent retry logic is fundamentally different from service retry logic. The recovery strategy often involves modifying the input, not just repeating the same call. Context overflow recovery means summarizing. Tool failures mean disabling a tool and asking the model to use an alternative. Malformed responses need format reinforcement. This is compensation, not backoff.

Observability

When an agent makes a bad decision in production, you need to know why. Not eventually. Immediately.

This requires tracing every step of the workflow: what context was provided, what the model returned, what tools were called, and how the agent decided to proceed.

A real trace might look something like this:

Loading diagram...
Agent observability requires tracing across model calls, tool executions, and state transitions. Total: 5.75s, Cost: $0.042, Retries: 1.

This trace shows something traditional logs cannot: the model returned malformed JSON on the first attempt, the retry policy added format instructions, and the second attempt succeeded. Without this level of detail, debugging agent failures is guesswork.

This is distributed tracing applied to AI workflows. The same correlation ID that links all events in a payment processing workflow serves the same purpose here, connecting every model call, tool execution, and state transition into a single debuggable trace.

Events Are the Foundation

The most scalable agent architectures are event driven. The agent emits events for each significant action. Downstream systems consume them for logging, monitoring, billing, and audit.

class AgentEventEmitter:
    def emit_step_started(self, workflow_id, step, context):
        self.stream.produce("agent.step.started", {
            "workflow_id": workflow_id,
            "step": step.name,
            "context_size": len(json.dumps(context)),
            "timestamp": datetime.utcnow().isoformat(),
        })

    def emit_model_call(self, workflow_id, step, request, response):
        self.stream.produce("agent.model.called", {
            "workflow_id": workflow_id,
            "step": step.name,
            "model": request.model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "latency_ms": response.latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })

    def emit_tool_executed(self, workflow_id, step, tool, result):
        self.stream.produce("agent.tool.executed", {
            "workflow_id": workflow_id,
            "step": step.name,
            "tool": tool.name,
            "success": result.success,
            "latency_ms": result.latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })

None of this is new. It is the same event-driven pattern we use in payment processing and fulfillment workflows, applied to a new domain. The event stream becomes the audit trail, the debugging tool, and the billing data source.


If you have ever run distributed systems in production, none of this will feel new. AI agents just make those problems visible again.

A reliable AI agent looks less like a chatbot and more like a workflow engine coordinating distributed work. Every requirement maps to an established pattern:

  • Orchestration: workflow engines (Temporal, Inngest, Step Functions)
  • State: durable execution, event sourcing
  • Retries: backoff, circuit breakers, retry budgets, compensation
  • Observability: distributed tracing, structured logging, metrics pipelines
  • Failure recovery: saga pattern, compensation logic
  • Idempotency: idempotency keys, deduplication

None of these problems are new. Distributed systems solved them years ago. Workflow engines solved them years ago. AI agents just reintroduce the same problems under a different name.

If you are building agents for production:

  • Use a workflow engine for orchestration. Do not build state machines from scratch.
  • Persist agent state durably. In-memory state dies with the process.
  • Design retry policies per failure type. Not every error deserves the same response.
  • Emit events for every significant action. You will need the audit trail.
  • Trace model calls like you trace service calls: with correlation IDs, latency, and cost.

Once agents start executing real work instead of generating text, they stop being an AI problem.

They become an infrastructure problem.

References

Next

Idempotency Is the Most Important Concept in Distributed Systems

Why every workflow, event consumer, API, and AI agent needs idempotency guarantees.