Four Function-Calling Patterns That Survived Production
Tool use is where LLM systems fail most reliably. A catalog of four patterns that held under production load — and the anti-patterns they replaced.
Why this matters
Tool use is where LLM systems fail most reliably in production. Not in the LLM reasoning, not in the prompt design — in the gap between "the model knows what tool to call" and "the tool executes and returns something useful". Every team I've worked with or audited has invented a version of the same four patterns. Most also invented the same three anti-patterns and paid for them in latency regressions, context window overflow, and agents that looped infinitely at 3 AM.
I've been building tool-augmented agents for financial workflows, research pipelines, and multi-agent systems. What follows is the catalog I wish I'd had two years ago: four patterns that survived contact with production, and what the alternatives that didn't survive looked like.
1. Schema-first tool contracts
The most common anti-pattern: define a Python function, decorate it with @tool, and trust that the LLM will pass valid arguments. It works 95% of the time. The 5% failure — max_results: "ten", date: "last week", query: null — produces a TypeError somewhere in your executor, which you catch, log as an error, and move on. What you've actually done is make your tool silently unreliable.
The fix: every tool must have a Pydantic schema that validates inputs before execution. Critically, the schema is also the documentation — it's what the LLM reads to decide how to call the tool.
# Anti-pattern: raw function with no input schema
@tool
def search_documents(query: str, max_results: int = 5):
# LLM can pass anything — no validation before executor runs
return document_store.search(query, limit=max_results)
# Pattern: explicit schema with field constraints and descriptions
class SearchInput(BaseModel):
query: str = Field(description="Search terms for semantic retrieval. No boolean operators.")
max_results: int = Field(default=5, ge=1, le=20, description="Number of documents to return.")
date_filter: str | None = Field(default=None, description="ISO 8601 date prefix, e.g. '2024-Q1'.")
@tool(args_schema=SearchInput)
def search_documents(query: str, max_results: int, date_filter: str | None):
results = document_store.search(query, limit=max_results)
if date_filter:
results = [r for r in results if r.date.startswith(date_filter)]
return results
Three things the schema does: validation (catches invalid inputs before the executor runs), documentation (the LLM reads description fields to construct valid calls), and type coercion (max_results: "5" becomes 5). None of these happen with a bare function.
The field descriptions are not comments — they are part of the tool specification the LLM receives. Unclear descriptions produce unclear calls. The best schemas I've written read like short API references: what each field means, what format it expects, what the constraints are, what the default is. If you're getting malformed tool calls in production, rewrite the field descriptions before changing the model or the prompt.
2. Typed error unions, not exception traces
When a tool fails, the naive approach is to catch the exception and return the traceback as a string. The model reads the traceback, "understands" the error, and tries a different call. This works sometimes and fails badly when it doesn't.
The failure mode: a Python traceback fed to the model as error context is unstructured, verbose, and uses internal names the model wasn't trained to interpret. It inflates the context window on every retry. And if the error is transient — rate limit, network timeout — the model doesn't know whether to retry, backoff, or abort. It usually retries immediately, producing a cascade.
# Anti-pattern: exception trace as error recovery context
try:
result = search_documents(query=query, max_results=max_results)
return result
except Exception as e:
return f"Error: {traceback.format_exc()}" # 20 lines of internal stack trace
# Pattern: typed error union in the return schema
class SearchResult(BaseModel):
status: Literal["ok", "empty", "rate_limited", "auth_error"]
documents: list[Document] = []
retry_after_s: int | None = None # set when status == "rate_limited"
error_detail: str | None = None # set on auth_error
def search_documents(query: str, max_results: int) -> SearchResult:
try:
docs = document_store.search(query, limit=max_results)
if not docs:
return SearchResult(status="empty")
return SearchResult(status="ok", documents=docs)
except RateLimitError as e:
return SearchResult(status="rate_limited", retry_after_s=e.retry_after)
except AuthError:
return SearchResult(status="auth_error", error_detail="API key invalid or expired.")
The model can reason about status: rate_limited and retry_after_s: 30 — that's structured information matching what the model was trained on. It cannot reason productively about a stack trace. The union type also makes the tool contract explicit: here are all the ways this can fail, and here's what each failure means for the caller.
Include retry_after_s in the rate-limit case. Models with function calling support will correctly backoff when this is present — they'll stop calling the tool and either wait or escalate to the supervisor. Without it, they retry immediately, which amplifies rate limit problems.
3. Parallel dispatch and why sequential is the default anti-pattern
Most tutorial agents dispatch tool calls sequentially: call tool A, wait for result, then decide to call tool B. This works. It's also 3–5× slower than necessary for independent tools.
Modern LLMs (GPT-4o, Claude 3.5 Sonnet) can request multiple tool calls in a single response turn when the calls are independent. Your executor should dispatch them concurrently.
import asyncio
# Anti-pattern: sequential dispatch
async def run_tools_sequential(tool_calls: list[ToolCall]) -> list[ToolResult]:
results = []
for call in tool_calls:
result = await execute_tool(call) # wait for each before starting next
results.append(result)
return results
# Pattern: concurrent dispatch
async def run_tools_parallel(tool_calls: list[ToolCall]) -> list[ToolResult]:
tasks = [execute_tool(call) for call in tool_calls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [
ToolResult(error=str(r)) if isinstance(r, Exception) else r
for r in results
]
The concrete gain: three independent tool calls averaging 300ms each take 900ms sequential and ~320ms parallel (network jitter included). Over an agent loop with 4–6 tool steps, parallel dispatch routinely halves wall-clock time.
Two caveats. First, asyncio.gather with return_exceptions=True is important — a single failing tool should not crash the entire gather. Handle exceptions per-result. Second, not all tool calls are independent. If call B depends on the result of call A, sequential is correct. The LLM usually models this correctly — it puts dependent calls in separate turns. If you see sequential calls in the same turn for tools that should be independent, the issue is usually in the tool descriptions: they aren't clear enough about what each returns.
4. Loop detection via call fingerprinting
The infinite tool loop is one of the most common production failures: the model calls the same tool with the same arguments repeatedly, either because it never receives a satisfying result or because it's entered a reasoning loop. Left unchecked, this exhausts your token budget and never resolves.
The fix is a fingerprint check before each tool dispatch:
from hashlib import sha256
import json
class LoopDetector:
def __init__(self, max_repeats: int = 2):
self.call_counts: dict[str, int] = {}
self.max_repeats = max_repeats
def is_looping(self, tool_name: str, args: dict) -> bool:
key = sha256(
json.dumps({"tool": tool_name, "args": args}, sort_keys=True).encode()
).hexdigest()[:16]
self.call_counts[key] = self.call_counts.get(key, 0) + 1
return self.call_counts[key] > self.max_repeats
# In the executor:
detector = LoopDetector(max_repeats=2)
for call in tool_calls:
if detector.is_looping(call.name, call.arguments):
return ToolResult(
status="loop_detected",
message=f"Tool '{call.name}' called with identical arguments {detector.max_repeats + 1} times."
)
result = await execute_tool(call)
When the executor returns status: loop_detected, the graph's conditional edge routes to an escalation node — usually a supervisor agent or a HITL interrupt — rather than back to the tool-calling agent. The model gets a clear, structured signal that it needs a different strategy.
The fingerprint is computed from (tool_name, sorted args) after JSON normalization. max_results: 5 and max_results: 5 are the same fingerprint. query: "revenue Q3" and query: "revenue Q4" are different. Set max_repeats: 2 as a starting point — two identical calls may mean "transient failure, retrying". Three identical calls almost always means "stuck".
5. Status-based routing between tool calls
Once you have typed error unions, you can route the graph based on status rather than on parsing the tool's content. This is the same principle as LangGraph conditional edges: routing logic lives in state, not in edge functions.
After each tool result, a lightweight router node reads result.status and writes a routing signal to state. The conditional edge reads the signal. Your routing logic becomes: rate_limited → wait and retry; auth_error → escalate to HITL; empty → try alternate tool; ok → continue. Clean, testable, visible in state traces.
The alternative — parsing tool content inside the edge function to infer what happened — is fragile. Tool output format varies with model versions and system prompts. A routing decision based on "does the output contain the word Error" will break when the output format changes. A routing decision based on status: Literal["ok", "rate_limited", "auth_error", "empty"] won't.
One concrete consequence: you can write unit tests for routing logic without mocking the LLM. Create a SearchResult(status="rate_limited", retry_after_s=30), pass it through the router, assert the next node is "wait_and_retry". That test is fast, deterministic, and covers the failure mode you care about.
6. Tool choice and prompt hygiene
One sign that your tool definitions need work: you're using tool_choice={"type": "function", "function": {"name": "specific_tool"}} to force the model to call a particular tool. Forced tool choice is a valid escape hatch. As a default pattern, it's a smell.
When you force tool choice, you're compensating for a prompt or schema that doesn't make it clear enough to the model when each tool should be used. The better fix is to clarify the tool descriptions and system prompt until tool_choice="auto" routes correctly on your actual query distribution. This is cheaper at inference time and more robust to query variation.
The exception: for the last step of a structured output pipeline — where you want the model to always call a "finalize" tool and produce its output as a typed schema — forcing tool choice is correct. It's a structural constraint, not compensation for unclear descriptions.
Prompt hygiene for tool-heavy agents: keep the system prompt focused on the agent's role and scope, not on tool selection logic. If you find yourself writing "use the search tool when the user asks about documents" in the system prompt, that's a sign the tool description doesn't say it clearly enough. Fix the tool description.
The category frame
Four patterns, one common thread: make the interface between the LLM and the tool layer explicit, typed, and inspectable. Schema contracts instead of raw dicts. Typed error unions instead of exception traces. Concurrent dispatch instead of sequential. Loop detection instead of hoping the model self-corrects. Status signals instead of content parsing in routing logic.
The anti-patterns are all versions of "let the LLM figure it out." That works at demo quality. In production, under real query distributions and real failure modes, the implicit contracts break first.
If you're building or reviewing a tool-use agent and hitting reliability or latency issues, write me. These failure modes have reliable fixes once you know what to look for.