Skip to main content
Back to writing

AI Engineering

Why I'm more excited about a prompting paper than Gemini 3.1 Pro

Google dropped Gemini 3.1 Pro yesterday and the timeline is buzzing. But while everyone’s debating benchmarks, I’ve been reading a research paper that I think has more practical value for teams shipping AI features in production.

Not because the results are earth-shattering. They’re not, for the most part. But because the insight behind them changes how you think about prompt design.

The paper

“Prompt Repetition Improves Non-Reasoning LLMs” by Yaniv Leviathan, Matan Kalman and Yossi Matias at Google Research. The technique is dead simple: take your prompt, send it twice, and the model performs better.

They tested it across Gemini, GPT, Claude and Deepseek. 47 statistically significant wins out of 70 benchmark tests. Zero losses.

Let’s be honest about the numbers though

Looking at the actual results on standard benchmarks (ARC, MMLU-Pro, GSM8K, MATH), most improvements are modest. We’re talking a couple of percentage points in a lot of cases. Meaningful in aggregate, but not the kind of thing that transforms a use case.

Where it gets interesting is on tasks that are specifically sensitive to information ordering. On a name-index retrieval task, Gemini 2.0 Flash-Lite went from 21% to 97%. That’s not a marginal gain. That’s broken versus working.

The difference matters and it tells you something important about when this technique is useful.

Why it works (and why that’s the real takeaway)

LLMs are causal language models. Every token can only “look back” at what came before it. Information flows in one direction. So if your answer options appear before the question, those option tokens have zero awareness of what’s being asked. If context sits after your question, the question tokens can’t attend to it.

By repeating the prompt, every token in the second copy gets full visibility of every token in the first. The model finally sees the complete picture.

The researchers also noted that reasoning models trained with reinforcement learning often teach themselves to repeat parts of the user’s request before answering. Prompt repetition just shifts that behaviour into the prefill stage where it’s parallelisable.

Understanding this mechanism is what makes the paper valuable. Not because you should blindly repeat every prompt, but because it teaches you to think about how your prompt structure interacts with the way the model actually processes it. That’s a skill that pays off well beyond this one technique.

Where this actually matters

The technique is most useful for tasks where the model needs to cross-reference information across different parts of a prompt. Retrieval tasks, context-heavy instructions, anything where the ordering of your input could mean the model is working with incomplete attention.

For straightforward question-answering or generation tasks, you’re probably not going to see a dramatic difference.

The conversation I keep having with teams

This feeds into a broader pattern I see when advising product teams on their AI implementations. Something doesn’t perform well and the first instinct is to upgrade to whatever model just launched. The newest frontier option becomes the default answer.

But the newest models are big. Multi-modal, mixture-of-experts architectures built to handle everything from image generation to multi-step reasoning. That breadth comes at a price: higher latency and higher token costs. For a lot of use cases, it’s far more than what’s needed.

Before reaching for the next model up, I push teams to ask whether they’ve actually understood why the current one is failing. Is it a capability gap, or is it a prompt structure problem? Are you hitting a fundamental model limitation, or are you just presenting the task in a way the architecture struggles with?

Getting this right matters at scale. When you’re running AI features across dozens of products, the difference between a lightweight model and a frontier model compounds into response times your users feel, infrastructure costs that shape what you can build next and throughput limits that determine how many people you can serve.

The bigger point

I’m not suggesting everyone go and double their prompts tomorrow. The gains on standard tasks are modest and you’re doubling your input tokens to get them.

But the insight is worth internalising: prompt structure isn’t just about clarity for the human writing it. It directly affects how the model processes the information. Understanding that relationship is what separates teams that get consistent results from smaller, cheaper models from teams that keep upgrading because “the model can’t handle it.”

Sometimes the model genuinely can’t handle it. But sometimes it just needed to see the full picture.

Paper: https://arxiv.org/abs/2512.14982