Don't Use Model Outputs in Production

A pattern I keep running into. Someone builds an LLM integration. They prompt the model, collect the output, and pipe it straight into production. An email gets sent. A webhook fires. A customer sees whatever the model decided to say.

Then something goes wrong and I get the call.

The Problem Is the Chat

When you prompt a model and capture its output directly, you are capturing a chat response. The model doesn’t know it’s writing a production artifact. It thinks it’s talking to you. It behaves accordingly.

You get thinking out loud. You get self-corrections. “Actually, let me rephrase that.” You get conversational warmth that has no business being in a quarterly sales report. You get a friendly sign-off on an automated system notification. The output is dirty. Not wrong, necessarily. Just uncontrolled.

This happens because the model’s output channel is dual-purpose. It’s the place where it reasons, where it qualifies, where it hedges. It’s also, apparently, where your production email is supposed to come from. These two uses are in direct tension.

I see this most often with email generation, but it shows up everywhere. Webhook payloads that contain stray commentary. Report text with conversational preamble stripped by regex (and sometimes not stripped well enough). Notification copy that reads like someone left their inner monologue in the commit.

The Fix Is a Tool Call

People resist this because it feels like overkill. You’re generating a string. Why do you need a tool definition for that?

Because the model treats tool call arguments differently than it treats chat output.

When a model writes to a tool call, it understands the output has a destination. It’s filling a parameter. It’s completing a structured request. The conversational impulse still gets an outlet in the chat response, which you can ignore. The tool call gets the clean version.

Compare these two approaches:

Direct output: “Write a professional email reporting this quarter’s watermelon sales figures to the distribution team.”

The model will probably produce a decent email. It will also probably produce it with a preamble. “Here’s a professional email for you:” or “Sure! Here’s a draft.” Even if you strip those lines, the body itself will often carry a conversational register. “I’m happy to share that sales were strong this quarter.” That’s chat leaking into copy.

Tool call: “Use the send_email tool to send a report of this quarter’s watermelon sales to the distribution team.”

The model still gets to be chatty in its response. “Sure, I’ll send that report now.” Fine. Nobody sees that. The tool call parameters contain the email subject, body, and recipients, and these will be markedly cleaner. The model shifts register when it knows the output is structured and functional.

This isn’t just anecdotal. Research on guided-structured templates for function calling has found 3-12% relative improvement over free-form chain-of-thought prompting, with notably fewer formatting errors and less output drift.¹ The Berkeley Function Calling Leaderboard consistently shows that models with explicit function-calling produce more reliable outputs than free-form generation in multi-step production tasks.² The model literally performs better when you give it a structured place to put its answer.

The Null Case Is Where This Really Matters

Direct output has another failure mode that’s worse and harder to catch. The null case.

Say your system should sometimes send an email and sometimes not. You prompt the model: “If there’s nothing notable to report, don’t produce an email.” Sounds reasonable.

Models are terrible at this. They are completion engines. Their entire training reward structure pushes them to produce output. Telling a model to produce nothing is like telling a golden retriever to not fetch the ball. It knows what you said. It’s going to fetch the ball anyway.

In testing, you’ll see it work. The model correctly identifies a null case and outputs nothing, or outputs “No email needed.” You ship it. In production, over thousands of runs, it will randomly generate an email when it shouldn’t have. Not often. Often enough to be a problem. The failure rate is low but nonzero, and it’s unpredictable, which is the worst kind of unreliable.

The fix is the same pattern. Give the model a tool for the null case.

“If an email should be sent, use the send_email tool. If no email should be sent, use the skip_email tool with a brief reason.”

Now the model always does something. It calls one tool or the other. You route on which tool was called. The model is much more comfortable making a binary choice between two actions than it is choosing between action and inaction. You’ve turned “do nothing” into “do a thing that means nothing.” The reliability difference is dramatic.

This works because it aligns with how the model actually operates. It wants to produce. Let it produce. Just give it a structured container for the production, even when the container is empty.

Structured Output Is Better, Tool Calls Are Best

There’s a middle ground that people reach for: structured output modes. JSON mode. Response schemas. These help. They’re better than raw text. But they still use the chat output channel, and they still suffer from the dual-purpose problem. The model is reasoning and producing structured output in the same stream. Tool calls separate these concerns entirely.

The hierarchy, in my experience:

Tool calls — cleanest. Separate output channel. Model shifts register.
Structured output / JSON mode — good. Enforces format. Still in the chat stream.
Raw text with parsing — worst. You’re building a parser for a nondeterministic output format.

If your integration is critical enough to be in production, it’s critical enough for a tool call.

The Cost Is Trivial

Adding a tool definition to your prompt is a few dozen tokens. The latency overhead of a tool call versus direct output is negligible. You need a handler for the tool call, but you already need a handler for the raw output. You’re not adding complexity. You’re moving it from “parse unpredictable text” to “read structured parameters.” That’s a reduction in complexity.

The real cost is the conceptual overhead of thinking about your LLM integration as a tool-calling system rather than a text-generation system. But that’s the correct way to think about it, so the overhead is just learning.

The Pattern

For any LLM integration that produces output a user or system will consume:

Define a tool for the happy path. Give it typed parameters for every field you need.
Define a tool for the null case. Even if it’s just { reason: string }.
Ignore the chat output. It’s the model’s scratch paper. Your production data comes from tool calls only.

You will immediately notice cleaner output, more reliable null handling, and fewer of those mysterious “why did it say that” incidents that haunt text-parsing integrations.

Stop treating the model’s chat response as a production artifact. It was never meant to be one.

Dang et al., Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates, arXiv:2509.18076 (2025) ↩
Berkeley Function Calling Leaderboard (BFCL) ↩