AI Agents Need Infrastructure, Not Prompts

11 minute read|Published March 2026

The AI discourse is dominated by prompts. Better system messages. Chain-of-thought tricks. Temperature tuning. Prompt templates shared on Twitter like trading card collections.

None of this matters for production systems.

The hard problems in AI agents are not linguistic. They are infrastructural. Orchestration. State management. Failure recovery. Observability. Retry logic. Exactly the same problems that backend engineers have been solving in distributed systems for decades.

Prompt engineering is the least interesting part of building AI systems that actually work. The interesting part is everything that happens around the model call.

The Model Call Is the Easy Part

An AI agent makes a model call. The model returns a response. This takes a few hundred milliseconds to a few seconds. It is, architecturally, a function call with a slow and occasionally unreliable external dependency.

The rest of the system - deciding when to call the model, what context to provide, how to handle failures, how to manage state across multi-step workflows, how to observe what happened when things go wrong - is the actual engineering challenge.

# This is the part everyone focuses on
response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=messages,
    tools=tools,
)

# This is the part that actually matters
# - What if this call times out?
# - What if the response is malformed?
# - What if we need to retry with different context?
# - What if this is step 4 of 12 and step 3 failed?
# - What if we need to resume this workflow tomorrow?
# - What if we need to understand why this agent
#   made a bad decision last Tuesday?

Every one of those questions has been answered before in workflow engines, transaction processing systems, and event driven architectures. The AI industry is rediscovering distributed systems problems and solving them from scratch because most AI engineers have never built a payment system or a fulfillment pipeline.

What AI Agents Actually Need

Orchestration

An AI agent that performs a multi-step task - researching a topic, analyzing data, drafting a report, requesting approval - is a workflow. It has steps. Steps have dependencies. Some steps can run in parallel. Some must run sequentially. Some steps depend on the output of previous steps.

This is exactly what workflow engines solve.

Loading diagram...
An AI agent workflow is structurally identical to a transaction processing workflow. The model call is just another task.

The workflow engine provides exactly what the agent needs: step management, dependency resolution, parallel execution, and durable state that survives process restarts. Building this from scratch for every AI agent is waste.

State Management

AI agents are stateful. A research agent accumulates context across multiple interactions. A coding agent tracks file changes, test results, and previous attempts. A customer support agent carries conversation history and customer state.

This state must survive failures. If the agent process crashes mid-workflow, the state must be recoverable. If the agent needs to pause and resume hours later (waiting for human approval, for example), the state must persist.

interface AgentState {
  workflow_id: string;
  current_step: string;
  context: Record<string, unknown>;
  tool_results: ToolResult[];
  conversation_history: Message[];
  retry_count: number;
  created_at: string;
  updated_at: string;
}

// State must be persisted durably, not held in memory
async function advanceAgent(state: AgentState): Promise<AgentState> {
  const step = getStep(state.current_step);

  const result = await step.execute(state.context);

  const nextState = {
    ...state,
    current_step: step.next(result),
    context: { ...state.context, [step.name]: result },
    updated_at: new Date().toISOString(),
  };

  // Persist before proceeding - crash recovery depends on this
  await stateStore.save(nextState);

  return nextState;
}

This is durable execution - the same pattern that Temporal, Inngest, and similar systems implement for distributed workflows. The agent's state machine is persisted after every transition, making the workflow resumable from any point.

Retry Logic

Model calls fail. APIs time out. Rate limits trigger. Context windows overflow. Tool calls return errors. Every external interaction in an agent workflow can fail, and each failure type requires a different response.

class AgentRetryPolicy:
    def should_retry(self, error, attempt, step):
        if isinstance(error, RateLimitError):
            # Back off and retry - this is transient
            return RetryDecision(
                retry=True,
                delay=error.retry_after or (2 ** attempt),
            )

        if isinstance(error, ContextOverflowError):
            # Retry with summarized context - not transient,
            # but recoverable with a different strategy
            return RetryDecision(
                retry=True,
                delay=0,
                modify_context=self.summarize_context,
            )

        if isinstance(error, ToolExecutionError):
            # Retry the tool call, not the model call
            if attempt < step.max_tool_retries:
                return RetryDecision(retry=True, delay=1)
            # Tool is broken - ask model to use alternative
            return RetryDecision(
                retry=True,
                delay=0,
                modify_context=self.disable_tool(error.tool),
            )

        if isinstance(error, MalformedResponseError):
            # Model returned unparseable output - retry with
            # explicit format instructions
            if attempt < 3:
                return RetryDecision(
                    retry=True,
                    delay=0,
                    modify_context=self.add_format_reminder,
                )

        return RetryDecision(retry=False)

AI agent retry logic is more complex than typical service retry logic because the recovery strategy often involves modifying the input, not just repeating the same call. This is closer to saga compensation than exponential backoff.

Context overflow recovery requires summarizing previous context. Tool failures might require disabling a tool and asking the model to use an alternative. Malformed responses need format reinforcement. Each of these is a retry with modified input - a pattern uncommon in traditional service retries but natural in workflow engines that support step-level compensation.

Observability

When an AI agent makes a bad decision in production, you need to understand why. This requires tracing every step of the workflow: what context was provided, what the model returned, what tools were called, what the results were, and how the agent decided to proceed.

  Agent Workflow Trace
─────────────────────────────────────────────

[step_1] gather_context          200ms
  ├─ tool: search_database        45ms
  ├─ tool: fetch_document        120ms
  └─ result: 3 documents found

[step_2] analyze                 2.4s
  ├─ model_call                  2.1s
  │   ├─ input_tokens: 4,200
  │   ├─ output_tokens: 890
  │   ├─ model: claude-sonnet
  │   └─ cost: $0.018
  ├─ tool: run_query              280ms
  └─ result: analysis complete

[step_3] generate_report         3.1s
  ├─ model_call (attempt 1)      1.8s  FAILED
  │   └─ error: malformed JSON
  ├─ model_call (attempt 2)      1.3s  OK
  │   ├─ input_tokens: 5,100
  │   ├─ output_tokens: 2,400
  │   └─ context_modified: format_reminder added
  └─ result: report generated

[step_4] send_report              50ms
  └─ result: delivered

Total: 5.75s | Cost: $0.042 | Retries: 1
─────────────────────────────────────────────
Agent observability requires tracing across model calls, tool executions, and state transitions - not just request/response logs.

This trace shows something traditional request logs cannot: the model returned malformed JSON on the first attempt, the retry policy added format instructions, and the second attempt succeeded. Without this level of observability, debugging agent failures is guesswork.

The observability infrastructure needs to capture:

  • Input and output for every model call, including token counts and cost
  • Tool call parameters and results
  • State transitions and context modifications
  • Retry decisions and their reasoning
  • End-to-end latency and cost per workflow

This is distributed tracing applied to AI workflows. The correlation ID that links all events in a transaction processing workflow serves the same purpose here - connecting every model call, tool execution, and state transition into a single debuggable trace.

Event Driven Architecture Is the Foundation

The most scalable AI agent architectures are event driven. The agent emits events for each significant action. Downstream systems consume these events for logging, monitoring, billing, and audit.

class AgentEventEmitter:
    def emit_step_started(self, workflow_id, step, context):
        self.stream.produce("agent.step.started", {
            "workflow_id": workflow_id,
            "step": step.name,
            "context_size": len(json.dumps(context)),
            "timestamp": datetime.utcnow().isoformat(),
        })

    def emit_model_call(self, workflow_id, step, request, response):
        self.stream.produce("agent.model.called", {
            "workflow_id": workflow_id,
            "step": step.name,
            "model": request.model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "latency_ms": response.latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })

    def emit_tool_executed(self, workflow_id, step, tool, result):
        self.stream.produce("agent.tool.executed", {
            "workflow_id": workflow_id,
            "step": step.name,
            "tool": tool.name,
            "success": result.success,
            "latency_ms": result.latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })

These events feed into the same infrastructure that powers transaction monitoring: real-time dashboards showing agent throughput, error rates, cost per workflow, and latency distributions. The event stream becomes the audit trail, the debugging tool, and the billing data source.

This is not novel architecture. It is the same event driven pattern used in payment processing, fulfillment workflows, and data pipelines - applied to a new domain.

What the AI Industry Is Rediscovering

Every infrastructure problem in AI agents has been solved before:

  • Orchestration → Temporal, Inngest, Airflow, Step Functions
  • State management → Durable execution, event sourcing
  • Retry logic → Exponential backoff, circuit breakers, retry budgets
  • Observability → Distributed tracing, structured logging, metrics pipelines
  • Failure recovery → Saga pattern, compensation logic
  • Idempotency → Idempotency keys, exactly-once semantics

The AI ecosystem is rebuilding these primitives from scratch in agent frameworks because most AI engineers come from ML backgrounds, not systems engineering backgrounds. They are solving orchestration problems for the first time and discovering through painful experience why workflow engines exist.

The engineers who will build reliable AI infrastructure are not prompt engineers. They are the same engineers who built payment processing systems, workflow engines, and event driven architectures. The domain is new. The infrastructure problems are not.

What This Means for Building AI Systems

If you are building AI agents for production:

  • Use a workflow engine for orchestration - do not build state machines from scratch
  • Persist agent state durably - in-memory state dies with the process
  • Design retry policies per failure type - not every error deserves the same response
  • Emit events for every significant action - you will need the audit trail
  • Trace model calls like you trace service calls - with correlation IDs, latency, and cost
  • Budget for model costs the way you budget for infrastructure - per workflow, not per call

The prompt is the least interesting part of the system. The infrastructure around it determines whether the agent works once in a demo or works reliably at scale.