Why Emphasis Works in Prompts
People treat prompt emphasis like folk medicine. Bold this. CAPS that. Put it in XML tags. It works, which is enough for most people. But knowing why it works changes how you use it.
I get asked regularly whether caps actually do anything. The answer is yes. Used sparingly, all of these mechanisms are genuinely effective, and the reason is mechanical, not mystical.
Emphasis changes your tokens. It doesn’t signal importance to the model the way it would to a human reader. It changes the token representation in ways that make those tokens more retrievable during the attention lookup. Understanding that distinction matters.
How Attention Actually Works
Every token in the context gets projected into three vectors: a Query, a Key, and a Value.
When the model generates the next token, its Query asks: what in this context is relevant to what I’m doing right now? It computes a similarity score against every Key in the context. High similarity means high attention weight. The output is a weighted sum of the corresponding Values.
Think of it as a soft database lookup. The Query is the search term. The Keys are the index. The Values are the data you retrieve. The model doesn’t read the context sequentially. It queries it.
The Similarity Problem
The similarity score measures how close the Query vector is to each Key vector. Tokens that are lexically similar to what the model is currently processing score high. This works well for nearby, recently-seen, closely-matched content.
Semantically similar content that uses different words? Lower score. The model might miss it entirely. This is a documented failure mode, not a theoretical one. Liu et al. showed it empirically with what’s now called the “Lost in the Middle” problem1: a U-shaped performance curve where models retrieve information well from the start and end of context, poorly from the middle. The information is there. It just doesn’t get attended to.
This is why restating instructions works. Multiple retrieval targets at different positions.
Why Emphasis Changes the Retrieval
When you write **critical instruction**, the model doesn’t see bold text. It sees tokens: **, critical, instruction, **. Those delimiter tokens are unusual. They don’t appear often in normal prose, but they appear consistently in the training data co-located with content humans marked as important. CAPS works the same way. IMPORTANT is a different token than important. It’s rarer in the training distribution, it co-occurs with human-flagged urgency, and it creates a distinct Key vector. The model learned that association. Research on prompt token saliency has found that capitalization changes shift model attention to specific terms, even in isolation from other formatting changes.2
Three things happen:
Token distinctiveness. The ** markers create a token pattern that stands out against surrounding prose. The Key vectors for these tokens have a different signature than ordinary text.
Attention head specialization. Transformer attention heads specialize during training. Some track syntax. Some track semantic relationships. Some track instruction boundaries and formatting. Emphasis markers activate the heads that learned to attend to human-flagged importance. This is a direct consequence of training on web data full of markdown, HTML bold tags, and structured documents.
Retrieval boost. When the model is generating output and its Query is looking for “what was I told to do,” the Key vectors for emphasized tokens are more likely to surface. The markers act as index anchors. Not because the model understands importance. Because the tokens are distinct and retrievable.
The same logic applies to other structural markers: headers, numbered lists, XML tags. They all create distinct token patterns that are easier to retrieve than unmarked prose.2 The caveat with all of this is sparingly. Emphasize everything and you’ve emphasized nothing. The distinctiveness effect depends on the markers standing out from the surrounding context. If your whole prompt is bolded, you’ve just changed the baseline.
The Position Problem
Emphasis also partially counteracts positional bias. Content at the beginning and end of a prompt gets more attention. Content in the middle decays. This is well-documented and it’s not a bug in specific models. It’s a property of how attention distributes over long sequences.
A local emphasis signal says “attend to this regardless of position.” It doesn’t fully overcome the bias, but it helps. If you have a long system prompt with a critical constraint buried in the middle, emphasis is doing real work. Exact wording helps too. If your prompt uses the exact tokens the model will need to retrieve during generation, attention finds them more reliably than if it has to bridge a semantic gap.
What This Means in Practice
For prompts:
Put critical instructions at the start and end. Emphasize non-negotiable constraints. Use structural markup to create attention anchors. Restate key instructions rather than assuming one occurrence is enough.
For system design with long prompts:
Don’t write a long paragraph and trust the model to extract the relevant rule when it needs it. Break it up. Make each instruction independently retrievable. Test retrieval, not just comprehension. The model might technically “know” your instruction and still fail to attend to it during generation.
The formatting variance research puts a number on this. Studies on prompt structure have found up to 40% performance difference from formatting choices alone3, and up to 76 accuracy points on specific tasks from prompt design changes.4 Those numbers are about template-level formatting, not caps specifically, but the underlying mechanism is the same: token distinctiveness changes what the model retrieves.
Emphasis isn’t voodoo. It’s just attention engineering.
Footnotes
-
Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172 (2023) ↩
-
Unveiling and Manipulating Prompt Influence in Large Language Models, ICLR 2024 (token saliency and capitalization effects) ↩ ↩2
-
Does Prompt Formatting Have Any Impact on LLM Performance?, arXiv:2411.10541 (2024) ↩
-
Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design, ICLR 2024 ↩