Why do LLMs still hallucinate in 2025?

June 17, 2025

Why do LLMs still hallucinate in 2025?

‍

By Alex Korolev, Machine Learning Research Engineer, Root Signals Inc.

‍

This is an extremely compressed overview on the current state of hallucinations, intended to provide some research direction and starting considerations for building hallucination resilience into LLM-powered applications. While we had overly optimistic takes on hallucinations being solved by now, from industry leaders two years ago (see here for a summary from 2023), in practice hallucinations remain a key bottleneck to unlock, may have become even more prevalent with reasoning models than before and remain an active research area.

‍

Key take-aways from the current state of hallucinations research

Newer models do NOT hallucinate less, reasoners in general, as well as higher complexity tasks over longer contexts cause MORE hallucinations
Hallucinations are not just a gap in factual knowledge that can be solved via adding relevant context in the prompt. RAG does not fix hallucinations.
Benchmarks that target single-turn QA will not give you realistic insights into how much model X hallucinates at task Y -> hallucination snowballing likelihood increases once a hallucination is in context and this effect is not measured in current benchmarks - this is the Achilles heel of reasoning models and long context in particular
The effect of backtracking capabilities of reasoning models on hallucinations (e.g. "No, wait, let me check...") is hard to isolate and judge on during training and often leads to verbose <think> sections
Prompt instructions are under-specified proxy goals for LLMs and leave lots of details up to interpretation but even a fully specified goal would not avoid hallucinations due to the nature of features competing against the model's default to refuse answers and being probabilistic gating functions.
Chains of Thought, including those from reasoning models, are NOT faithful to their outputs. They can omit key details, misrepresent the internal circuits the model was using to generate such outputs or be hindsight-justified explanations meant to appeal to the user.

‍

Towards a working understanding of hallucinations

Hallucinations:

Are unreasonable inferences from either training data or in-context data for a given prompt
Are not related to the factual correctness of outputs (both for source-dependent or source-independent world knowledge)
Include cases where model internals (features of the key circuits for a given output) may be aware and indicate refusal, based on insufficient grounds, but the refusal is overridden by other features (e.g. by Induction Heads with "I have seen this before and it was X, trust me")

There are many suspect features that affect hallucinations, but let's take some that are easy to visualize - there is strong evidence for existence of Known answer features, which can be probed with interventions. Intuitively, over larger contexts we may encounter lots of such Known answers, such that we could mix in an Unknown entity but still pass the threshold of the inhibition of the Can't Answer feature overall, which then leads to a hallucination.

‍

Hallucination example 1 - Short circuiting default refusals

Anthropic's example of the default refusal circuit shows what swapping the family name of Michael Jordan, a known Basketball player, with Michael Batkin, a person unknown to the LLM, can do.

Below are the prompt, outputs and features on the attribution graph for both persons.

‍

‍

‍

Initially the Can't Answer feature starts out well and refuses or answers appropriately, but we will see it can be inhibited with interventions to show that this circuit really exists.

Some features can be probed with interventions on just the prompt level, such as an Expressions of Gratitude feature that we can try to inhibit in our prompt by complaining instead, but more detailed interventions would require access to model weights and control of the inference engine that serves the model. For now, lets think of interventions as directions in the model's representation space - inhibit by -10x or amplify by +20x nudges the model to listen more or less to a specific feature in a circuit.

‍

‍Amplifying the **Can't Answer** circuit by 20x magnitude crosses the **Known Answer** threshold for Basketball, leading to a hallucination where the model responds with the refusal. Credit: Anthropic

‍

It works the other way round too with Michael Batkin, whom the model does not initially recognize. Here one can inhibit the Can't Answer to recover a hallucination, the magnitude required is much lower, with -10x vs. +20x but that may be spurious finding or specific to a circuit / model / domain / task. Discovering such properties, especially those that apply broadly, is a key goal of hallucination detection.

‍

‍

‍

*Full "Default Refusal" Circuit from "On the Biology of a Large Language Model". Credit: Anthropic*

‍

Example 2 - Hindsight-justified reasoning traces in Math Circuits

Here the user is giving a wrong answer for the model to verify for a task, that the model cannot solve, which leads the model to pretend to hindsight-justify the answer with the number it got from the user.
We may speculate that the model prefers to give a confident response over factual correctness or stating uncertainty, which may be just the after-effect of relying too much on human preference training for alignment.

‍

‍

‍

‍

What are the implications for setting up hallucination evals?

‍

What does this mean in practice for any hallucination eval or prevention pipeline?

‍

Firstly, since practically all modern frontier AI labs do not publish the details of their training data sources and mixes, we can already drop all naive attempts at checking for source-independent world knowledge with question-answering benchmarks. This leaves us only with in-context hallucination detection evaluations, unless the model provider allows access to their own black box evals, which we blindly need to trust. Even in the latter case, for example with OpenAI reporting on their SimpleQA it is unclear how to operationalize such benchmarks for industry use and they are an insufficient measure for cross-model comparisons even within their own family of models, since very little is known about the specifics of how they are served.

‍

Secondly, closed weights models' internals (other than Claude 3.5 Haiku, see above) cannot be introspected with mech-interp tooling, such as circuit tracer, to look at cases of hallucinations in particular and figure out how to elicit the correct refusal for such cases and the model in general - which leaves only prompt search on closed models as the only viable tool to reduce hallucinations during forward passes. A related key point to drive home here is, given factuality is NOT at the core of hallucinations, attempts to base your prompt search on natural language research targeted at humans, such as cognitive biases and common fallacies will not yield useful prompts in practice.

‍

Thirdly, given effects that increase hallucination likelihood, such as snowballing, only appear in multi-turn - single turn hallucination detection evals and benchmarks are insufficient for real-world use.

‍

With these three in mind, a comprehensive approach to dealing with hallucinations should include:

Multi-turn workflows, even using the same model to critique its own outputs with majority voting beats single-turn generation.
Verification of what is verifiable, using e.g. Judge Agents that have tools to run experiments on the outputs. This requires LLMs excelling at multi-turn tool calls and instruction following (Have a look at RootJudge, which we trained to explore exactly that).
Prompt search, with, at least, per-model and per-domain exploration to identify the prompt additions that help LLMs follow through with refusals on uncertain cases. In our internal benchmarks we continue to see large differences between model providers on similar-cost models. Given appropriate prompts, the differences can then be observed with head to head comparisons, e.g. on the Root Signals Platform.
Domain expert review of all outputs that are not strictly verifiable. This creates some baseline for cold start data to build more case-specific evaluations on and to attempt to find generalizable (!) mitigations for hallucination of your model <> domain or model <> task combination.

‍

Root Judge - Our open weights LLM Judge.
Root Signals Discord - Tell me how I am wrong in our discord.

‍

References and further reading

Transformer Circuits - Demonstrates many of the limitations above, especially Default Refusal Circuit and the Chain of Thought Faithfulness and is the best intuition builder on alien intelligence to date.
Circuit Tracer Tool - Key Mechanistic Interpretability tool.
Neuronpedia - Visually explore some of the above circuits on Claude 3.5 Haiku.
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., ... & Olah, C. (2022). In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., ... & Shi, S. (2024). Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
Guan, J., Dodge, J., Wadden, D., Huang, M., & Peng, H. (2024). HalluLens: LLM hallucination benchmark. arXiv preprint arXiv:2504.17550.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2305.13534.

‍