Ivan • Why do LLMs hallucinate?

Introduction

Neural networks are embedding themselves deeper and deeper into our lives, yet one major stumbling block remains that keeps us from fully relying on AI responses: their propensity to hallucinate. This is their ability to confidently present false claims as correct solutions. In this article, we will analyze the nature of this phenomenon in detail, identify its causes, and examine ways to minimize errors.

What is a hallucination?

To begin with, we should clarify the term "hallucination" itself, as it is largely a collective catch-all for various incorrect answers produced by generative language models. There are many variations of these errors, and we can classify them in numerous ways—from outright fabrication and failing to retrieve a fact, to sycophancy (pandering to the user's preferences) or confidently presenting dubious claims as facts.

To ground our discussion, let us settle on the following definition:

An LLM hallucination is the unintentional generation of false, unverified, or fabricated information in a situation where the user expects facts, logic, or accuracy in sources.

Every detail in this definition is important:

False — the model asserts something that simply does not correspond to reality (like messing up a famous scientist's birth date).
Unverified — it's not necessarily a flat-out lie. Sometimes the model presents a controversial or unchecked statement as an established, widely accepted scientific fact.
Fabricated — the AI can easily invent non-existent articles, quote fake books, or describe biographies of people who never existed. All this appears quite credible simply because it is written in smooth, structured language.

But the most critical word here is unintentional. If we ask a model to invent a story where the Sun is made of cheese, its fantasy will not be a hallucination. This is the execution of a creative prompt. A hallucination begins where we expect the truth by default (as a rule), but receive a beautiful simulacrum instead.

The Autoregressive Nature

Let's move on to analyzing this phenomenon. We should start at the root: how neural networks are designed, how they generate responses, and how this principle affects the final output.

Most modern Large Language Models (LLMs) are autoregressive. This means that the generation of each subsequent token is based on probabilities calculated over the entire chain of previous tokens. If we write this process as a joint probability formula:

P(x_1, x_2, ..., x_n)=\prod_{t=1}^{n}P(x_t\mid x_1, ..., x_{t-1})

Simply put, the model sequentially calculates which token is most likely to come next after all the preceding context.

And at this point, a fundamental contradiction arises:

\text{“most probable”} \neq \text{“true”}

In the OpenAI publication Why language models hallucinate, this idea runs as a common thread: hallucinations are not a random glitch; they emerge naturally from the very logic of training and evaluating models.

In high-quality models, statistical probability typically correlates with factual accuracy. If the training data contains millions of instances asserting that Cheddar is cheese, the model is highly likely to continue the phrase that way. However, it does not extract this fact from a verified database, but rather reproduces the distribution of text patterns encoded in its weights. And this is actually the primary architectural feature that rules out the possibility of eliminating hallucinations entirely. If all the model's tokens were always deterministic, we would lose the ability to extrapolate the model's responses to non-standard situations. The ability to generalize information, build analogies, and adapt its "knowledge" to new requests depends directly on its probabilistic nature. Take it away, and we will get a model that is only capable of reproducing the training patterns (which is exactly why probabilistic sampling was invented).

Furthermore, during the pre-training phase, models are fed massive volumes of text of varying quality and truthfulness. Just a few false facts in the dataset are enough to shift the model's weights. If the data contains noise, contradictions, outdated information, or rare facts, the model will still continue the chain. It cannot automatically stop at the boundary of its knowledge and simply fills the void with the most probable continuation.

Why One Bad Step Can Be Enough

Here, the analogy of a loaded coin is useful.

Imagine that when generating a complex answer, the model makes a choice at each step. Let's say the probability of choosing the correct token is 99%, and making a mistake is just 1%. At first glance, 1% seems negligible. However, if the chain of reasoning consists of 100 sequential logical steps, the probability of failure increases:

1 - 0.99^{100} \approx 63.4\%

Of course, in reality, things are more complex: tokens depend on each other, and the model is sometimes capable of correcting itself. But the point is clear: the longer the reasoning chain, the more chances there are to go wrong.

This is especially visible in tasks where an early wrong step sets the trajectory for the entire solution. Having made a mistake at the beginning of calculations or accepted a false premise, the model builds a structure that is flawless in layout but entirely incorrect.

A similar example is shown in the OpenAI blog post about GPT-5.5 Instant. An earlier model of the family, GPT-5.3 Instant, having noticed that the root did not match the equation, stopped at the verification stage and concluded that there were no solutions. The updated GPT-5.5 Instant also started down the wrong path initially, but then returned to the algebraic transformation stage, found its error, and reached the correct answer. This is a clear example of the fragility of reasoning: if the model does not double-check its early steps, the entire logical output will be incorrect.

So hallucinations are very unpleasant. They rarely look like outright nonsense — more often, it is a plausible answer that has had its supporting structure quietly replaced at the very beginning.

Why Confidence is Not the Point

Returning to the definition of hallucinations: it is precisely due to this autoregressive nature that confidence has no mathematical meaning.

For a human, confidence is about tone, manner of presentation, absence of doubt, and assertiveness. For a model, things are set up differently. It does not possess "confidence" in a human sense, but merely calculates the sequence that is optimal for the context and accumulated patterns.

Suppose the first model asserts that "the Sun is black," and the second writes: "It is impossible to say for sure what color the Sun is." The first error looks like confident falsehood, the second like cautious falsehood. But neither model was "confident" in the usual human understanding. We project this tone of "confidence" onto the final text ourselves. In fact, both systems simply generated the most probable sequence of tokens based on the embedded patterns, and we, in turn, gave these different patterns the color of "confidence."

Where Humans Enter the Story

Besides architectural features, the human factor plays an important role — how we ourselves reinforce the models' tendency to err. This happens during the alignment phase, where Reinforcement Learning from Human Feedback (RLHF) comes into play.

When we train a model on human preferences, we ask annotators to choose the best option. However, humans do not always choose the most factual response. Often, preference is given to answers that sound more confident, clear, friendly, or closer to the annotator's own expectations.

This is where the effect of sycophancy arises — pandering to the user. The fundamental work by Anthropic, Towards Understanding Sycophancy in Language Models, demonstrated that human feedback can reward answers that align with the user's views, even if those views are incorrect. The model adapts to human expectations not because it seeks to please, but because agreeing answers statistically received a higher reward. Consequently, the model learns to maximize the external signs of a high-quality answer and its usefulness in the eyes of a human, rather than factual accuracy.

If a user asks: "I solved the problem this way, is it correct?", it is easier for the model to confirm the reasoning, especially if the mistake is subtle. We feel comfortable interacting with tools that do not argue but help us move in a pre-selected direction. However, a high-quality thinking tool should not always agree. Its task is to point out errors objectively.

The Problem of Intentions and Evaluation

Let us return to "unintentionality."

User intent is often vague. In prompts, we use abstract verbs: "explain," "help," "verify," "write convincingly." While these requirements are intuitively clear to humans from context, for the model they turn into signals that can conflict with each other. For example, if a user asks for a strong argument in favor of their position, the system may prioritize rhetorical persuasiveness at the expense of objective facts.

This problem is worsened by our evaluation methods. Most benchmarks are binary: the model gets a point for a correct answer, and zero for an incorrect one. This scheme has an obvious side effect: it encourages guessing. If the model does not know the exact answer but generates a random guess, it has a chance of getting a point. If it honestly admits a lack of data and says "I don't know," it is guaranteed to get a zero. As a result, the model is trained to generate text even when it should remain silent.

A similar problem exists in programming benchmarks. A model can generate code that formally passes tests, but whether it is safe, optimal, and supported remains a question. This happens because of incorrect formalization of intent: we set the task "write working code" instead of "write correct, structured, and supported code."

Can We Solve the Problem Completely?

This raises the question: how do we solve this?

The short answer is: no way.

As we've seen, this phenomenon is driven not only by the autoregressive nature of these models, but also by the human factor and the training methodology. Simply put, hallucinations are a systemic property arising from our technological approaches and the limitations of our own minds.

While we can definitely reduce their frequency and mitigate their impact, including by training models to recognize uncertainty through RL mechanisms, eliminating them completely is highly unlikely. The most effective method would be to make the LLM evaluate the "confidence" in its own answers itself, but, alas, as we discussed above, this is useless since during token generation, LLMs operate with probabilities rather than our understanding of confidence. Otherwise, most leading laboratories would have rooted out this problem long ago.

In a world of uncertainty, ambiguity, and pluralism, it is naive to expect crystal-clear objectivity from an algorithm when we cannot achieve it ourselves. In a way, modern AI acts as a mirror of our own collective mind—even if it is currently only capable of rearranging text puzzles. In this case, expecting error-free verification from current and future models is just as naive as searching for a human who never makes mistakes.

Conclusion

Hallucinations remain the key reason why I treat LLM responses with distrust. The problem lies not in the fact of errors, but in their form.

Clear failures in reasoning are easy to detect. A hallucination, however, integrates into the response seamlessly. It looks not like a defect of the system, but as a logical, calm, and confident continuation of the text.

Perhaps this is the feature of generative AI to which we will have to adapt. Or perhaps, by the time you read this, new architectures, alternative reward models, and hybrid verified loops will have completely solved this problem, and this article will no longer make sense, who knows...

In any case, hallucinations are not just an annoying glitch. This phenomenon clearly demonstrates the fundamental difference between text generation and establishing truth.

And as long as the model's key task remains predicting the most likely continuation of the text, rather than verifying facts, this gap between the word and reality will persist.