Why do LLMs still hallucinate in 2025?
.jpg)
By Alex Korolev, Machine Learning Research Engineer, Root Signals Inc.
This is an extremely compressed overview on the current state of hallucinations, intended to provide some research direction and starting considerations for building hallucination resilience into LLM-powered applications. While we had overly optimistic takes on hallucinations being solved by now, from industry leaders two years ago (see here for a summary from 2023), in practice hallucinations remain a key bottleneck to unlock, may have become even more prevalent with reasoning models than before and remain an active research area.
Hallucinations:
There are many suspect features that affect hallucinations, but let's take some that are easy to visualize - there is strong evidence for existence of Known answer features, which can be probed with interventions. Intuitively, over larger contexts we may encounter lots of such Known answers, such that we could mix in an Unknown entity but still pass the threshold of the inhibition of the Can't Answer feature overall, which then leads to a hallucination.
Anthropic's example of the default refusal circuit shows what swapping the family name of Michael Jordan, a known Basketball player, with Michael Batkin, a person unknown to the LLM, can do.
Below are the prompt, outputs and features on the attribution graph for both persons.
Initially the Can't Answer feature starts out well and refuses or answers appropriately, but we will see it can be inhibited with interventions to show that this circuit really exists.
Some features can be probed with interventions on just the prompt level, such as an Expressions of Gratitude feature that we can try to inhibit in our prompt by complaining instead, but more detailed interventions would require access to model weights and control of the inference engine that serves the model. For now, lets think of interventions as directions in the model's representation space - inhibit by -10x or amplify by +20x nudges the model to listen more or less to a specific feature in a circuit.
Amplifying the **Can't Answer** circuit by 20x magnitude crosses the **Known Answer** threshold for Basketball, leading to a hallucination where the model responds with the refusal. Credit: Anthropic
It works the other way round too with Michael Batkin, whom the model does not initially recognize. Here one can inhibit the Can't Answer to recover a hallucination, the magnitude required is much lower, with -10x vs. +20x but that may be spurious finding or specific to a circuit / model / domain / task. Discovering such properties, especially those that apply broadly, is a key goal of hallucination detection.
Here the user is giving a wrong answer for the model to verify for a task, that the model cannot solve, which leads the model to pretend to hindsight-justify the answer with the number it got from the user.
We may speculate that the model prefers to give a confident response over factual correctness or stating uncertainty, which may be just the after-effect of relying too much on human preference training for alignment.
What does this mean in practice for any hallucination eval or prevention pipeline?
Firstly, since practically all modern frontier AI labs do not publish the details of their training data sources and mixes, we can already drop all naive attempts at checking for source-independent world knowledge with question-answering benchmarks. This leaves us only with in-context hallucination detection evaluations, unless the model provider allows access to their own black box evals, which we blindly need to trust. Even in the latter case, for example with OpenAI reporting on their SimpleQA it is unclear how to operationalize such benchmarks for industry use and they are an insufficient measure for cross-model comparisons even within their own family of models, since very little is known about the specifics of how they are served.
Secondly, closed weights models' internals (other than Claude 3.5 Haiku, see above) cannot be introspected with mech-interp tooling, such as circuit tracer, to look at cases of hallucinations in particular and figure out how to elicit the correct refusal for such cases and the model in general - which leaves only prompt search on closed models as the only viable tool to reduce hallucinations during forward passes. A related key point to drive home here is, given factuality is NOT at the core of hallucinations, attempts to base your prompt search on natural language research targeted at humans, such as cognitive biases and common fallacies will not yield useful prompts in practice.
Thirdly, given effects that increase hallucination likelihood, such as snowballing, only appear in multi-turn - single turn hallucination detection evals and benchmarks are insufficient for real-world use.
With these three in mind, a comprehensive approach to dealing with hallucinations should include:
Root Judge - Our open weights LLM Judge.
Root Signals Discord - Tell me how I am wrong in our discord.