Skip to content
All writing
AI Agent

How we build production AI agents on Cloudflare

A look at the stack we reach for when an agent has to be reliable, observable, and cheap to run, not just a demo.

By Tealfig

A demo agent and a production agent are different animals. The demo works once, in good conditions, with someone watching. The production one runs unattended against real systems, recovers when a provider hiccups, and does not quietly run up a bill. Most of the work is in that gap. Here is the stack we use to close it.

The agent is a stateful object, not a stateless function

The first decision is where an agent’s memory lives. A plain serverless function forgets everything between calls, which pushes you into bolting on a database and a cache just to remember a conversation. We build agents on Cloudflare’s Durable Objects instead. Each agent is a single, individually addressable instance with its own storage built in. The same user always routes back to the same agent, and its state, history, and task queue live right next to the code that uses them. No separate session store to run.

Cloudflare’s Agents SDK sits on top of this. An agent is a class with persistent state, scheduling, and real-time connections out of the box, so we spend our time on behavior instead of plumbing.

Every model call goes through a gateway

The single most useful habit we have is never calling a model provider directly. Every request goes through Cloudflare’s AI Gateway, which sits between the agent and whichever model it is using. That one move gives us, with no change to the agent code:

  • Observability. Every request and response is logged, with latency and token counts. When an agent misbehaves, we can see exactly what it sent and got back.
  • Retries and fallback. If a provider times out or errors, the gateway retries, and can fall back to a different model or provider rather than failing the task.
  • Caching. Identical requests can be served from cache, cutting both cost and latency.
  • Spend limits and guardrails. Budgets that cap spend per agent or per user, and content moderation on prompts and responses.

This is the difference between “the agent broke and we have no idea why” and a system you can actually operate.

Inference, memory, and long tasks

For the model itself we stay flexible. Workers AI runs open models and embeddings on Cloudflare’s network, close to the agent, which is ideal for fast, cheap calls and for generating the embeddings that power retrieval. When a job needs a frontier model, we route to Anthropic or OpenAI through the same gateway, so swapping models is a config change, not a rewrite.

Memory for retrieval lives in Vectorize, Cloudflare’s vector database. The agent embeds content, stores it, and pulls back the most relevant pieces to ground its answers. This is the same machinery behind our knowledge assistants.

For anything long-running or multi-step, like ingesting a pile of documents or working through a plan that spans minutes, we hand the work to Workflows. Each step is durable and retries on its own, so a restart resumes where it left off instead of starting over.

Why this stack

It comes down to four things: the agent and its state live in the same place, every model call is observable and controllable through the gateway, work survives restarts, and it runs on infrastructure billed by use rather than by the hour. That is what makes an agent something you can put in front of customers and stop worrying about.

If you want an agent that does real work in your systems, this is the foundation we build it on. See AI agents for what that looks like in practice.

See what is worth automating first

Book a call and we will map where your hours leak, then rank the fixes by payback.