Skip to content
Home
2024-11-2511 min read
LLMFunction CallingOpenAIProduction

Five Function-Calling Patterns That Survived Production

Tool use is where LLM systems fail most reliably. A catalog of five patterns that held under production load — and the anti-patterns they replaced.

Why this matters

Tool use is a frequent failure point in LLM systems I have built or audited. The issue is not only model reasoning or prompt design. It often sits in the gap between "the model knows what tool to call" and "the tool executes and returns something useful". Across those systems, I have seen the same five patterns recur. I have also seen the same three anti-patterns cause latency regressions, context window overflow, and agents that loop indefinitely during unattended runs.

I have built tool-augmented agents for financial workflows, research pipelines, and multi-agent systems. The patterns below are the ones I now use in production: five approaches that have held up in systems I have built or audited, plus the alternatives that failed.

1. Schema-first tool contracts

A common anti-pattern is to define a Python function, decorate it with @tool, and trust the LLM to pass valid arguments. It works until it does not. Inputs such as max_results: "ten", date: "last week", or query: null produce a TypeError somewhere in the executor. If I only catch that error, log it, and continue, I have made the tool silently unreliable.

I avoid that by giving every tool a Pydantic schema that validates inputs before execution. The schema also documents the tool. It is what the LLM reads when it decides how to call the tool.

# Anti-pattern: raw function with no input schema
@tool
def search_documents(query: str, max_results: int = 5):
    # LLM can pass anything — no validation before executor runs
    return document_store.search(query, limit=max_results)

# Pattern: explicit schema with field constraints and descriptions
class SearchInput(BaseModel):
    query: str = Field(description="Search terms for semantic retrieval. No boolean operators.")
    max_results: int = Field(default=5, ge=1, le=20, description="Number of documents to return.")
    date_filter: str | None = Field(default=None, description="ISO 8601 date prefix, e.g. '2024-Q1'.")

@tool(args_schema=SearchInput)
def search_documents(query: str, max_results: int, date_filter: str | None):
    results = document_store.search(query, limit=max_results)
    if date_filter:
        results = [r for r in results if r.date.startswith(date_filter)]
    return results

The schema handles three jobs: validation, documentation, and type coercion. It catches invalid inputs before the executor runs. It exposes description fields the LLM can use to build valid calls. It can also coerce values such as max_results: "5" into 5. A bare function does none of this.

Field descriptions are not comments. They are part of the tool specification the LLM receives. Unclear descriptions produce unclear calls. The most useful schemas I have written read like short API references: what each field means, which format it expects, which constraints apply, and what the default is. When I see malformed tool calls in production, I rewrite the field descriptions before I change the model or the prompt.

2. Typed error unions, not exception traces

When a tool fails, the naive approach is to catch the exception and return the traceback as a string. The model reads the traceback, may infer the error, and tries a different call. In my experience, this sometimes works. It fails when the traceback becomes the recovery interface.

A Python traceback is a poor error context for a model. It is unstructured, verbose, and full of internal names the model may not interpret correctly. It also inflates the context window on every retry. If the error is transient, such as a rate limit or network timeout, the model may not know whether to retry, back off, or abort. In systems I have observed, that ambiguity can cause immediate retries and cascades.

# Anti-pattern: exception trace as error recovery context
try:
    result = search_documents(query=query, max_results=max_results)
    return result
except Exception as e:
    return f"Error: {traceback.format_exc()}"  # 20 lines of internal stack trace

# Pattern: typed error union in the return schema
class SearchResult(BaseModel):
    status: Literal["ok", "empty", "rate_limited", "auth_error"]
    documents: list[Document] = []
    retry_after_s: int | None = None  # set when status == "rate_limited"
    error_detail: str | None = None   # set on auth_error

def search_documents(query: str, max_results: int) -> SearchResult:
    try:
        docs = document_store.search(query, limit=max_results)
        if not docs:
            return SearchResult(status="empty")
        return SearchResult(status="ok", documents=docs)
    except RateLimitError as e:
        return SearchResult(status="rate_limited", retry_after_s=e.retry_after)
    except AuthError:
        return SearchResult(status="auth_error", error_detail="API key invalid or expired.")

The model can reason about status: rate_limited and retry_after_s: 30 because that structure matches common API patterns. A stack trace gives it less reliable material. The union type also makes the tool contract explicit: these are the known failure modes, and this is what each one means for the caller.

I include retry_after_s for rate limits. Models with function calling support can use this signal to back off or escalate, depending on the agent loop and executor logic. Without it, an agent loop may retry immediately and amplify rate limit problems.

3. Parallel dispatch and why sequential is the default anti-pattern

Most tutorial agents dispatch tool calls sequentially: call tool A, wait for the result, then decide whether to call tool B. This works. For independent tools, it can also be slower than necessary.

Some models and tool-calling setups can request multiple tool calls in a single response turn when the calls are independent. When there is no dependency between calls, I route them to concurrent execution.

import asyncio

# Anti-pattern: sequential dispatch
async def run_tools_sequential(tool_calls: list[ToolCall]) -> list[ToolResult]:
    results = []
    for call in tool_calls:
        result = await execute_tool(call)  # wait for each before starting next
        results.append(result)
    return results

# Pattern: concurrent dispatch
async def run_tools_parallel(tool_calls: list[ToolCall]) -> list[ToolResult]:
    tasks = [execute_tool(call) for call in tool_calls]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return [
        ToolResult(error=str(r)) if isinstance(r, Exception) else r
        for r in results
    ]

The concrete gain depends on tool latency, executor overhead, scheduling overhead, network overhead, and whether the calls are truly independent. Three independent tool calls averaging 300ms each take about 900ms sequentially. If tool latency dominates and the executor can run them concurrently, parallel dispatch can reduce wall-clock time toward the duration of the slowest call, plus overhead. Across an agent loop with multiple independent tool steps, this can reduce wall-clock time under the same conditions.

Two caveats matter. First, asyncio.gather with return_exceptions=True is important: a single failing tool should not crash the entire gather. I handle exceptions per result. Second, not all tool calls are independent. If call B depends on the result of call A, sequential execution is correct. The LLM may represent this by placing dependent calls in separate turns. If I see sequential calls in the same turn for tools that should be independent, I inspect the tool descriptions first. They may not be clear enough about what each tool returns.

4. Loop detection via call fingerprinting

The infinite tool loop is a production failure I have seen. The model calls the same tool with the same arguments repeatedly, either because it never receives a satisfying result or because it has entered a reasoning loop. If nothing stops it, the loop exhausts the token budget and still does not resolve the task.

I check a fingerprint before each tool dispatch:

from hashlib import sha256
import json

class LoopDetector:
    def __init__(self, max_repeats: int = 2):
        self.call_counts: dict[str, int] = {}
        self.max_repeats = max_repeats

    def is_looping(self, tool_name: str, args: dict) -> bool:
        key = sha256(
            json.dumps({"tool": tool_name, "args": args}, sort_keys=True).encode()
        ).hexdigest()[:16]
        self.call_counts[key] = self.call_counts.get(key, 0) + 1
        return self.call_counts[key] > self.max_repeats

# In the executor:
detector = LoopDetector(max_repeats=2)
for call in tool_calls:
    if detector.is_looping(call.name, call.arguments):
        return ToolResult(
            status="loop_detected",
            message=f"Tool '{call.name}' called with identical arguments {detector.max_repeats + 1} times."
        )
    result = await execute_tool(call)

When the executor returns status: loop_detected, I route the graph's conditional edge to an escalation node, usually a supervisor agent or a HITL interrupt. I do not send it back to the tool-calling agent. The model gets a clear, structured signal that it needs a different strategy.

The fingerprint is computed from (tool_name, sorted args) after JSON normalization. max_results: 5 and max_results: 5 are the same fingerprint. query: "revenue Q3" and query: "revenue Q4" are different. I use max_repeats: 2 as a starting point. Two identical calls may mean "transient failure, retrying". In the systems I have inspected, three identical calls is often a sign that the agent is stuck.

5. Status-based routing between tool calls

Once I have typed error unions, I can route the graph based on status instead of parsing tool content. This follows the same principle as LangGraph conditional edges: routing logic lives in state, not in edge functions.

After each tool result, a lightweight router node reads result.status and writes a routing signal to state. The conditional edge reads that signal. The routing map becomes direct: rate_limited → wait and retry; auth_error → escalate to HITL; empty → try alternate tool; ok → continue. The behavior is clean, testable, and visible in state traces.

Parsing tool content inside the edge function is more fragile. Tool output format can vary with model versions and system prompts. A routing decision based on "does the output contain the word Error" can break when the output format changes. A routing decision based on status: Literal["ok", "rate_limited", "auth_error", "empty"] is more stable.

One concrete consequence: I can write unit tests for routing logic without mocking the LLM. I can create a SearchResult(status="rate_limited", retry_after_s=30), pass it through the router, and assert the next node is "wait_and_retry". That test is fast, deterministic, and covers the failure mode I care about.

6. Tool choice and prompt hygiene

One sign that tool definitions need work: I am using tool_choice={"type": "function", "function": {"name": "specific_tool"}} to force the model to call a particular tool. Forced tool choice is a valid escape hatch. As a default pattern, it is a smell.

When I force tool choice, I am often compensating for a prompt or schema that does not make tool selection clear enough for the model. I prefer to clarify the tool descriptions and system prompt until tool_choice="auto" routes correctly on the actual query distribution. This can reduce unnecessary control logic at inference time and makes behavior less dependent on one narrow query shape.

There is one exception. For the last step of a structured output pipeline, where I want the model to always call a "finalize" tool and produce its output as a typed schema, forcing tool choice is correct. It is a structural constraint, not compensation for unclear descriptions.

For tool-heavy agents, I keep the system prompt focused on the agent's role and scope. I avoid putting tool selection logic there. If I find myself writing "use the search tool when the user asks about documents" in the system prompt, I treat that as a sign the tool description does not say it clearly enough. I fix the tool description.

The category frame

Five patterns share one principle: make the interface between the LLM and the tool layer explicit, typed, and inspectable. Use schema contracts instead of raw dicts. Use typed error unions instead of exception traces. Use concurrent dispatch instead of sequential dispatch when calls are independent and tool latency dominates. Use loop detection instead of hoping the model self-corrects. Use status signals instead of content parsing in routing logic.

The anti-patterns are versions of "let the LLM figure it out." That can work at demo quality. In production-like conditions, under real query distributions and real failure modes, I have seen implicit contracts break first.

When I build or review a tool-use agent and see reliability or latency issues, I look for these failure modes and apply the corresponding fixes.

FAQ

How should I validate LLM tool arguments before execution?

I use a Pydantic schema for every tool and pass it as the tool argument schema. The schema validates inputs before the executor runs, documents each field for the model, and handles type coercion such as converting max_results: "5" into 5.

Why avoid returning Python tracebacks to the model?

I avoid tracebacks because they are unstructured, verbose, and full of internal names the model may not interpret correctly. They also inflate the context window on retries. A typed error union gives the model explicit statuses such as rate_limited, auth_error, empty, and ok.

When should tool calls run in parallel instead of sequentially?

I dispatch tool calls concurrently when they are independent and there is no dependency between them. Sequential execution is correct when call B depends on call A. Independent calls in the same turn can be gathered concurrently and handled per result when the executor supports it and tool latency dominates.

How can I detect an infinite LLM tool loop?

I fingerprint each call from the tool name and JSON-normalized, sorted arguments before dispatch. If the same fingerprint appears more than the configured repeat limit, the executor returns status: loop_detected and routes to escalation instead of sending the agent back into the same loop.

How should routing work after a tool result?

I route from structured status values, not by parsing tool content. A router reads result.status and writes a routing signal to state, so cases like rate_limited, auth_error, empty, and ok can map to clear next steps that are visible in state traces and easy to unit test.

Share this article

Related articles

  • RAG in Production: Fix Chunking and Re-Ranking Before Touching Embeddings

    Most RAG pipelines fail on chunking or re-ranking before embedding quality. A diagnostic-first framework for finding and fixing the right bottleneck.

    Dec 20, 202412 min read
    #RAG#Retrieval#LLM#Production
  • Why Autonomous Research Agents Hallucinate — and How a Critic Loop Fixes It

    Planner-executor agents fail on verifiability. Adding a critic agent with independent source access is the structural fix that survives adversarial queries.

    Nov 15, 202410 min read
    #AI Agents#Research#LLM#Production
  • Where CrewAI Breaks in Production — and What to Use Instead

    The role abstraction in CrewAI works for demos and struggles under production load. Four specific failure modes and the LangGraph patterns that replaced them.

    Jan 15, 202510 min read
    #CrewAI#Multi-Agent#LangGraph#Production