Building Production AI Agents with Claude 4.7 and Tool Use
What I learned shipping AI agents to production: tool design, prompt structure, durable execution, observability, and cost control. Practical patterns from real client work.
Agents that actually work
In the last year, AI agents stopped being demos and started being products. Some of my clients now have AI agents handling customer support intake, generating onboarding content, running internal workflows.
The gap between "tweet-worthy agent demo" and "production agent serving real users" is large. This post is what I've learned closing it.
The mental model
A production AI agent has 4 components:
The model isn't the hard part. The hard parts are tool design, prompt engineering, and observability. The "magic" is mostly engineering.
Tool design: the most important skill
Tools are the agent's hands. Bad tools equals bad agent, regardless of model intelligence.
Each tool does one thing
Don't make a tool called manage_user that does CRUD. Make get_user, update_user, delete_user. The model reasons better about narrow tools.
Tool descriptions are documentation
The model reads the description to decide when to use the tool. Write them like docs: "Look up a customer by email address. Returns customer name, account status, and most recent order date. Use this when the user mentions a specific customer or asks who is X." Not: "Get customer."
Parameters are typed and constrained
Use the API's JSON schema to constrain inputs. Email as string with email format, status as enum of active, paused, cancelled. Bad inputs caught before the function runs.
Return structured data, not prose
A tool should return JSON, not a sentence. The model will incorporate JSON into its response; it can't easily incorporate prose into structured outputs.
Tool failures are first-class
When a tool fails (DB down, API timeout), return a structured error the model can reason about: error rate_limited with retry_after_seconds. The model can decide to retry, fall back, or ask the user.
System prompt: less is more
The temptation: write a 2000-word system prompt that anticipates every situation. The reality: long system prompts confuse the model and dilute the instructions you actually care about.
My structure:
Aim for 300 words total. If it grows past 500, you've likely got requirements bleeding into the prompt that belong in tool descriptions or guardrails.
The execution loop
The loop is simple in principle: user sends message, send to LLM with system prompt plus history plus tool schemas, if LLM calls tools execute them, send tool results back to LLM, repeat until LLM returns a final message.
In practice you need to add max iteration cap (stop after 10 iterations), token budget tracking, tool call timeout, and parallel tool execution when the model returns multiple tool calls at once.
The Anthropic SDK handles most of this if you use their loop helpers. For full control, I write my own ~150 line loop.
Durable execution
The single biggest production lesson: agent runs are long-lived multi-step processes. They will fail mid-way. You need durability.
Save state at each step
After every tool call, persist the conversation state to your DB (or Inngest). If the worker dies, resume from the last saved step.
Idempotent tools
Tools must be safe to retry. A send_email tool should use idempotency keys so a retry doesn't send twice.
Step-level retries
If a tool fails transiently, retry just that step, not the whole agent run. Saves tokens, faster recovery.
I now build most agents on Inngest for exactly this reason.
Observability: trace everything
You will debug your agent. A lot. Without traces, debugging is impossible.
What I log per agent run:
Aggregate dashboard: avg cost per run, p50 and p95 cost, avg tools called per run, top failing tools, conversion rate from "agent run" to "user-marked-resolved."
I use LangSmith (LLM-specific) plus OpenTelemetry traces (general). Honeycomb is excellent for the trace side.
Cost control
Claude 4.7 (Opus) is powerful but expensive. Cost management techniques:
Use Sonnet for most things, Opus when needed
A two-model system: Sonnet 4.6 for routine tool calling and simple reasoning, escalate to Opus 4.7 when the request is complex.
Cache the system prompt
Anthropic's prompt caching means a stable system prompt is "free" after the first request. Don't dynamically build system prompts per request.
Trim history aggressively
Don't keep 50 turns of history. Summarize older turns into one "context summary" message, keep last 5 turns verbatim.
Tool results: include only what's needed
A tool returns 10KB of JSON; the model only needs 500 bytes of it. Project the result down before sending back.
These four techniques cut costs 60-80 percent on the production agents I've shipped.
Failure modes I've hit
In rough order of frequency:
My default stack for agents in 2026
TL;DR
If you're building AI agents and want a senior engineer who's shipped this to production, contact me.
You might also like
Background Jobs in Node.js 2026: BullMQ, Trigger.dev, or Inngest?
Compared on real client projects: BullMQ vs Trigger.dev vs Inngest for Node.js background jobs. What I pick for what, with cost, DX, and operational trade-offs.
Building a Production REST API with Node.js and Express in 2026
Layered architecture, validation, error handling, auth, rate limiting, observability — the patterns I use to ship Node.js + Express APIs that don't fall over in production.
Caching Strategies for Node.js APIs: Redis, In-Memory, and Edge
Five caching layers I use in production Node.js + Next.js APIs — in-memory LRU, Redis, edge CDN, stale-while-revalidate, and request coalescing. With when each one matters.