Lessons from Building an AI Agent Backend

I learned the hard way that getting an AI agent to “work in a demo” is very different from getting it to survive production. On Alpaca Chat, we had to move from a pleasant proof of concept to a system that could run for hours, recover from failures, and explain itself when things broke. The lesson was clear: reliability in agent systems is not an afterthought, it is architecture.

The problem with naive agent loops

Most tutorials start with a seductive pattern: input → LLM → tool → repeat. It is fast to prototype, but it collapses under concurrency pressure. In production, tool order matters, state must remain consistent, and every failure leaves traces that someone has to debug later.

On our early build, agents could finish a request while still leaving work uncommitted in the middle of a chain. We needed a control model that respected causality before it respected cleverness. That requirement drove every major design choice that followed.

Sequential execution as a reliability boundary

The first architectural pivot was a StepRunner that enforces deterministic sequencing. Instead of letting multiple tools fire in parallel, each step now has one parent and one clear dependency chain:

class StepRunner:
    async def run(self, steps: list[Step]) -> list[StepResult]:
        results = []
        for step in steps:
            result = await self.execute_step(step)
            results.append(result)
            if result.status == "failed":
                break  # Stop on failure, don't continue
        return results

The early failure stop is not glamorous, but it is the right trade-off. If step B consumes the partial output of step A and step A failed, step B must not pretend everything is healthy. We deliberately choose correctness over raw throughput, because retries and partial recovery are expensive once business state is out of sync.

Delegation without recursion debt

Delegation became the next hard problem. AI systems can quickly spiral into accidental recursion: an agent delegates to itself, each child spawns a child, and context gets fragmented. We introduced a strict guard around delegation decisions:

@dataclass
class DelegationGuard:
    max_depth: int = 3
    visited: set[str] = field(default_factory=set)

    def can_delegate(self, agent_id: str, current_depth: int) -> bool:
        if current_depth >= self.max_depth:
            return False
        if agent_id in self.visited:
            return False  # Cycle detection
        return True

The guard does two things at once. It caps depth before token and runtime costs explode, and it tracks visited agents so cycles terminate quickly. That one small component removed an entire class of silent failures that were almost impossible to reproduce under load.

Tool governance and trust boundaries

I then pushed the same discipline into tool registration. Every tool now has typed input/output contracts, capability metadata, retry policy, and a permission level. That sounds bureaucratic, but it gives us a safe negotiation layer before execution starts.

The impact is practical: invalid tool calls are rejected before they touch state, capability filtering prevents overprivileged dispatch, and failure handling behaves the same way across every path. In other words, the system stops being “LLM says run it” and starts being “LLM may run it only within explicit boundaries.”

Testing for behavior, not happy paths

Testing changed from checking that code runs to checking that behavior holds when conditions diverge. We run unit tests on individual tools, integration tests across tool chains, and end-to-end conversations with replayed traces. The shift was subtle in process and huge in outcome.

Replays became especially important because LLM outputs are stochastic. The same prompt does not guarantee the same intermediate path, so we needed deterministic guardrails to detect when a change altered outcomes.

What I kept after launch

At scale, the Alpaca Chat backend now handles thousands of requests daily with clear boundaries between delegation, execution, and recovery. It is slower than the cleverest possible design, and that is intentional. A reliable system does not maximize speed per request; it maximizes confidence per request.

Lessons from Building an AI Agent Backend

The problem with naive agent loops

Sequential execution as a reliability boundary

Delegation without recursion debt

Tool governance and trust boundaries

Testing for behavior, not happy paths

What I kept after launch

Short notes on building AI agents in production.

Related Posts

How I Run Parts of calvinkennedy.com with a Governed AI System

Governed AI Workflows Beat Autonomy Theater

How I Used AI Agents to Automate Content Creation (End to End)

Short notes on building AI agents in production.