Building Production AI Agents with Claude 4.7 and Tool Use
AIBackendNode.js

Building Production AI Agents with Claude 4.7 and Tool Use

What I learned shipping AI agents to production: tool design, prompt structure, durable execution, observability, and cost control. Practical patterns from real client work.

HJ
Hassan Javed
March 2026
11 min read

Agents that actually work

In the last year, AI agents stopped being demos and started being products. Some of my clients now have AI agents handling customer support intake, generating onboarding content, running internal workflows.

The gap between "tweet-worthy agent demo" and "production agent serving real users" is large. This post is what I've learned closing it.

The mental model

A production AI agent has 4 components:

1.Tools — functions the model can call (search DB, send email, fetch data)
2.System prompt — the persistent context that defines behavior
3.Loop — orchestrates LLM calls and tool execution
4.State / memory — what the agent knows between turns

The model isn't the hard part. The hard parts are tool design, prompt engineering, and observability. The "magic" is mostly engineering.

Tool design: the most important skill

Tools are the agent's hands. Bad tools equals bad agent, regardless of model intelligence.

Each tool does one thing

Don't make a tool called manage_user that does CRUD. Make get_user, update_user, delete_user. The model reasons better about narrow tools.

Tool descriptions are documentation

The model reads the description to decide when to use the tool. Write them like docs: "Look up a customer by email address. Returns customer name, account status, and most recent order date. Use this when the user mentions a specific customer or asks who is X." Not: "Get customer."

Parameters are typed and constrained

Use the API's JSON schema to constrain inputs. Email as string with email format, status as enum of active, paused, cancelled. Bad inputs caught before the function runs.

Return structured data, not prose

A tool should return JSON, not a sentence. The model will incorporate JSON into its response; it can't easily incorporate prose into structured outputs.

Tool failures are first-class

When a tool fails (DB down, API timeout), return a structured error the model can reason about: error rate_limited with retry_after_seconds. The model can decide to retry, fall back, or ask the user.

System prompt: less is more

The temptation: write a 2000-word system prompt that anticipates every situation. The reality: long system prompts confuse the model and dilute the instructions you actually care about.

My structure:

1.Identity (1-2 sentences): who is the agent
2.Capabilities (1-2 sentences): what it can do
3.Constraints (3-5 bullets): what it should never do
4.Tools available (brief — most context is in the tool descriptions themselves)
5.Format expectations (1-2 sentences): how to respond

Aim for 300 words total. If it grows past 500, you've likely got requirements bleeding into the prompt that belong in tool descriptions or guardrails.

The execution loop

The loop is simple in principle: user sends message, send to LLM with system prompt plus history plus tool schemas, if LLM calls tools execute them, send tool results back to LLM, repeat until LLM returns a final message.

In practice you need to add max iteration cap (stop after 10 iterations), token budget tracking, tool call timeout, and parallel tool execution when the model returns multiple tool calls at once.

The Anthropic SDK handles most of this if you use their loop helpers. For full control, I write my own ~150 line loop.

Durable execution

The single biggest production lesson: agent runs are long-lived multi-step processes. They will fail mid-way. You need durability.

Save state at each step

After every tool call, persist the conversation state to your DB (or Inngest). If the worker dies, resume from the last saved step.

Idempotent tools

Tools must be safe to retry. A send_email tool should use idempotency keys so a retry doesn't send twice.

Step-level retries

If a tool fails transiently, retry just that step, not the whole agent run. Saves tokens, faster recovery.

I now build most agents on Inngest for exactly this reason.

Observability: trace everything

You will debug your agent. A lot. Without traces, debugging is impossible.

What I log per agent run:

Run ID, user ID, started and completed timestamps
Each LLM call: prompt tokens, completion tokens, latency, cost
Each tool call: name, parameters, result, latency, success or failure
Final response, total cost, total duration

Aggregate dashboard: avg cost per run, p50 and p95 cost, avg tools called per run, top failing tools, conversion rate from "agent run" to "user-marked-resolved."

I use LangSmith (LLM-specific) plus OpenTelemetry traces (general). Honeycomb is excellent for the trace side.

Cost control

Claude 4.7 (Opus) is powerful but expensive. Cost management techniques:

Use Sonnet for most things, Opus when needed

A two-model system: Sonnet 4.6 for routine tool calling and simple reasoning, escalate to Opus 4.7 when the request is complex.

Cache the system prompt

Anthropic's prompt caching means a stable system prompt is "free" after the first request. Don't dynamically build system prompts per request.

Trim history aggressively

Don't keep 50 turns of history. Summarize older turns into one "context summary" message, keep last 5 turns verbatim.

Tool results: include only what's needed

A tool returns 10KB of JSON; the model only needs 500 bytes of it. Project the result down before sending back.

These four techniques cut costs 60-80 percent on the production agents I've shipped.

Failure modes I've hit

In rough order of frequency:

1.Model loops on the same tool call — usually because the tool returns the same error repeatedly. Add iteration cap.
2.Model hallucinates tool names — happens when tools are renamed mid-development. Validate tool name against schema before executing.
3.Tool returns gigantic result — model context blows up. Add max result size, truncate or summarize.
4.User goes off-topic — agent starts wandering. Solid system prompt boundaries plus topic guards help.
5.Race conditions in parallel tool calls — two tools both write to user state, one wins, the other is lost. Serialize state-mutating tools.

My default stack for agents in 2026

Anthropic Claude 4.7 Opus (primary) plus Sonnet 4.6 (escalation routing)
Inngest for durable execution
PostgreSQL for state, conversation history
OpenTelemetry plus Honeycomb for tracing
Anthropic SDK with prompt caching enabled
Custom tool framework — don't use generic agent frameworks; write the loop yourself for control

TL;DR

Tools are the agent's hands. Spend more time on tool design than prompt engineering.
Keep system prompts short and structured (~300 words)
Build for durability from day one — agent runs WILL fail mid-way
Trace everything. Without observability, debugging is impossible.
Cost control: dual models, prompt caching, history trimming, tool result projection

If you're building AI agents and want a senior engineer who's shipped this to production, contact me.

Related Reads

You might also like