Don't Distract the Model


When someone talks to you, you listen. One voice, full attention. Add a second person talking at the same time. You manage. You’re switching, triaging, doing your best. Add a third and you’re no longer really listening to anyone. You’re deciding who’s most important and letting the rest wash over you.

That dynamic isn’t a metaphor for what happens when you overload a prompt. It is what happens. The attention mechanism isn’t poetic. It’s a real allocation of computational resources. When a model is generating output for two competing objectives, neither objective has its full attention.

What “Competing” Means

Not every multi-task prompt is a problem. A model can handle a large, dense task: reviewing 10 files, refactoring an entire module, writing a comprehensive guide. The task can be enormous. What it can’t be is internally contradictory about what a good output looks like.

“Redesign the site and conduct a security review” is the kind of prompt that feels efficient and produces two mediocre outputs. The redesign optimizes for clarity, visual hierarchy, user experience. The security review optimizes for exhaustiveness, skepticism, edge cases. Same output, two opposing aesthetics. What you get is a design pass that stayed shallow because part of the model’s attention was elsewhere, and a security review that missed things because the output shape was being pulled toward clean presentation.

The model isn’t being lazy. It’s being split.

Hard tasks are fine. Dense tasks are fine. The question is whether the success criteria for one task fight the success criteria for another in the same output. If they do, you’ve built in a trade-off the model has to navigate. It navigates by producing something that partially satisfies both and fully satisfies neither.

The Mechanism

Transformer attention distributes over context during generation.1 When the model generates output for task A, it queries context for task A signals. Task B is in there too, competing for activation. The output shape gets pulled in two directions. This is the same dynamic that makes emphasis work as an attention anchor: if you want something attended to, it has to win the query competition. Two tasks are competing. One tends to lose.

IFEval, Google’s instruction-following benchmark, shows the empirical version of this cleanly: as the number of verifiable constraints in a prompt increases, prompt-level compliance drops.2 The model doesn’t refuse the instructions. It satisfies some at the expense of others, or satisfies all of them shallowly. More is less.

Developing the Feel

The rule I’ve settled on: if you could grade the outputs independently, and a passing grade on one would lower your grade on the other, split the prompt. They’re competing.

Some things are fine to combine:

  • Tasks that share context and reinforce the same output shape (summarize this, then identify the three main risks)
  • Sequential tasks where the output of one feeds naturally into the next
  • Volume work that’s high in quantity but coherent in type (refactor these 15 functions to use the new interface)

Some things shouldn’t share a prompt:

  • Tasks with different audiences (write a technical spec and a one-page executive summary)
  • Tasks with different quality signals (generate creative options and evaluate which is most technically sound)
  • Tasks where thoroughness in one direction is noise in the other

Agents compound this. An agent holding two objectives simultaneously while navigating a multi-step loop will drift. One objective tends to get optimized at each decision point, and the other accumulates small losses. By the end, one task got done well. You may not know which one.

The Tell

You develop a feel for this. Overstuffed prompts produce outputs with a certain quality: hedged, covering ground without committing to any of it. A prompt that asks for both a recommendation and a counterargument in the same output often produces a wishy-washy version of both. The model found the middle and sat there.

The correction is simple: one output shape per prompt. Give the model room to do one thing well. You can chain the outputs. You can run prompts in parallel. Two excellent outputs take less time to review than one mediocre combined one. The overhead is worth it.

Prompts that pack in enough related work to keep the model fully occupied are great. Prompts that ask the model to simultaneously serve two masters are the problem. The model will try. That’s the issue.

Footnotes

  1. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172 (2023). Documents how attention distributes over context and degrades for information not in salient positions.

  2. Zhou et al., Instruction-Following Eval (IFEval), arXiv:2311.07911 (2023). Prompts with multiple verifiable constraints show lower prompt-level compliance than single-constraint prompts.