Do LLMs Think? The Calculator, Searle and Wittgenstein

Why this matters

I build and evaluate systems based on language models, and the question "but do these models actually think?" runs through every serious conversation on the subject: the client deciding how much autonomy to give an agent, the colleague dismissing everything as autocomplete, the paper announcing emergent capabilities. The answer you give to that question shapes concrete choices: how much to delegate, how to verify, what vocabulary to use in technical documentation. A quip that has been circulating for a while compresses the whole dispute into one sentence, and taking it apart piece by piece is the most honest way I know to reach a defensible position.

The calculator as analogy, and as trap

"Saying an LLM doesn't think is like saying a calculator can't do numbers." The line circulates as a joke, and like all good jokes it contains a compressed argument. It is worth unpacking carefully, because the analogy is both more solid and more fragile than it looks: understanding where it holds and where it breaks says something important about language models and, at the same time, about whoever judges them.

The calculator first. The sentence "a calculator doesn't know numbers" is, in a precise sense, true. A calculator possesses no concept of number: it switches electrical states according to rules fixed by its designer, and nothing in its operation resembles the understanding a child acquires when learning to count. Yet the same sentence, spoken in front of someone using the calculator to file their taxes, sounds absurd. The calculator does arithmetic: it produces correct results, systematically, better than any human. The absurdity comes from "knowing" being used in two different senses, one constitutive (possessing understanding) and one functional (correctly performing the function), and the sentence is true in the first sense and false in the second.

Whoever claims "an LLM doesn't think" often performs, without declaring it, the same oscillation. They start from a defensible premise on the constitutive side (there is no evidence of consciousness, no intentionality in the strong sense of the term) and use it to suggest a much broader functional conclusion: that there is no inference, no abstraction, nothing that deserves the vocabulary of reasoning. The calculator quip serves to make this move visible. Making it visible, however, is not the same as refuting it. The controversy, at this point, does not yet stem from a technological difference: it stems from a linguistic ambiguity. The rest of the article sets out to establish whether, once the ambiguity is dissolved, a substantive question remains. The answer, I can anticipate, is yes — but not the one either faction expects.

What an LLM actually does

To argue honestly, you first have to mark out what is known. And on this point the situation has changed considerably since 2020: the language model is no longer a completely black box, even if it is far from being an explained algorithm.

The base level is familiar: an LLM is trained to predict the next token over enormous corpora of text. From this description, correct but poor, comes the slogan "it's just autocomplete". The slogan omits what training produces: high-dimensional distributed representations in which concepts are encoded in superposed, compositional form. These representations support non-trivial in-context generalization: mechanistic interpretability identified, as early as 2022 with the work of Olsson and colleagues on induction heads, specific circuits — attentional mechanisms that copy and complete patterns — causally linked to in-context learning. This is not a hypothesis about behavior: it is a localized mechanism, verified through causal interventions on the model.

On explicit reasoning the picture is more nuanced, and it needs stating precisely. Wei and colleagues showed in 2022 that inducing the model to produce intermediate steps (chain-of-thought) markedly improves performance on arithmetic, commonsense and symbolic tasks; Kojima and colleagues showed that the bare instruction "let's think step by step", with no examples, is enough to extract latent capabilities. These results are solid as behavioral phenomena. What they do not show is that the verbalized steps are the internal mechanism by which the model reaches its answer. And here the skeptic today has their best empirical weapon: a growing literature on chain-of-thought unfaithfulness, opened by the work of Turpin and colleagues, documents cases where the verbalized explanation does not reflect the computation that actually produced the answer: the model can reach the result by one route and narrate another. Add to this the fragility of reasoning under superficial perturbations: irrelevant reformulations of a problem can degrade performance that, if it rested on robust abstract competence, should not be affected. Anyone who wants to dismiss chain-of-thought as linguistic theater therefore has data to cite, not just intuitions.

The reply is not to deny these data but to place them. The work of Dutta and colleagues on the mechanistic interpretability of chain-of-thought shows that the model deploys multiple parallel neural pathways for step-by-step reasoning: the internal computation exists and is structured, even when the verbal account is unfaithful. Unfaithful self-reports, after all, are a phenomenon cognitive psychology knows well in humans, who confabulate post-hoc rationalizations with remarkable ease. This does not prove that LLMs reason the way a human does; it proves that "the verbal explanation is unfaithful" is not equivalent to "there is no underlying computation worth the name".

Finally, the residual opacity. The work of Elhage and colleagues on superposition explains why interpretability is hard: models compress more concepts than they have dimensions available, producing polysemantic neurons that resist direct reading. Opacity is neither magic nor mystery: it is an architectural property, with known and partly tractable causes. The honest balance of this section is this: some internal mechanisms of LLMs can be explained today, and that is documented; that these mechanisms constitute a form of genuine reasoning is plausible but contested; that they constitute thought is a question the data alone do not close. To understand why they do not close it, you have to change terrain.

What "thinking" means

The terrain is philosophical, and the first thing you discover crossing it is that the verb "to think" has no neutral definition the parties could agree on before looking at the data. Every position carries its own criteria of attribution, and therefore different verdicts on the same facts.

Functionalism, in the formulation Putnam gave it in 1967, holds that mental states are defined by the causal role they occupy — by their relations to inputs, outputs and other states — and not by the substrate that realizes them. If pain is what pain does, then it can be realized in a brain, in a circuit, in principle in any system with the right causal organization. For a consistent functionalist, the question about LLMs is empirical: do they have the functional organization of thought, or not? The silicon substrate is not, by itself, an argument.

Searle's Chinese Room (1980) is the classic attack on this framework: a man who manipulates Chinese symbols by following rules, without understanding Chinese, produces answers indistinguishable from a speaker's. Therefore, Searle concludes, syntactic manipulation is not sufficient for understanding, and no program, qua program, can understand anything. It is a constitutive objection, not a behavioral one: it strikes at the very idea that function exhausts the mental. The replies are well known — the strongest, the systems reply, observes that it is not the man in the room who must understand, but the overall system of which the man is a component — and the debate, after forty-five years, is not closed. That it is not closed is precisely the relevant datum: if a thought experiment from 1980 still divides philosophers, the notion of understanding is not mature enough to serve as an arbitrating criterion.

Chalmers introduced a hygienic distinction into the recent debate: the question of phenomenal consciousness, whether there is something it is like to be the system, is separable from the question of cognitive capabilities. His conclusion about current LLMs is probabilistic and cautious: most likely not conscious, without the possibility being ruled out for future systems. The distinction serves here in the negative: whoever denies thought to LLMs by appealing to the absence of consciousness must first argue that thought requires consciousness. A respectable thesis, but far from obvious, given how much human thought proceeds without conscious accompaniment.

And then there is Wittgenstein, who in the Philosophical Investigations (§§359–360) confronts the question "can a machine think?" directly and defuses it: "only of a human being and what resembles (behaves like) a living human being can one say: it thinks". The point is not a thesis about the metaphysical impossibility of thinking machines: it is the observation that "thinking" belongs to a grammar, to a network of uses, criteria and forms of life, and that extending it or denying it to new entities is not a discovery about the world but a decision about language. Millière and Buckner, in their two-part survey (part one and part two), show how much the contemporary debates on grounding, compositionality and world models in LLMs retrace philosophical controversies that were never resolved: the technical novelty did not bring with it the criteria to judge it.

The lesson of this section is not skeptical but structural: whoever says "an LLM doesn't think" — or "thinks" — is applying a theory of thought, even when they believe they are stating a fact. And the available theories diverge exactly at the points that would be needed to decide the case.

Where the analogy breaks

Here the opening analogy must be deliberately broken, because its limit is more instructive than its strength.

When you say the calculator "does arithmetic without knowing it", you can say it with total precision because arithmetic is a completely formalized domain. There is an exact syntax of the operations, an exact semantics of the results, public and shared criteria of correctness. It is known what the calculator realizes, it is known how it realizes it, and it is known that the mechanism exhausts the task: nothing of what is called "computing correctly" is left over. Precisely for this reason you can isolate with a scalpel what it lacks, understanding, and observe that, for the function, it is not needed. The sentence about the calculator is decidable because the domain is.

For thought nothing equivalent exists. No shared definition, no formalization, no verification criterion independent of the competing theories. The previous section showed it: functionalists, Searleans and Wittgensteinians do not diverge on the data, they diverge on what would count as thought. It follows that the statement "an LLM doesn't think" cannot have the epistemic status of the statement "a calculator doesn't understand numbers". The second is an observation inside a formalized domain; the first claims the same precision in a domain that does not possess it. To know that something does not think, you would need to know what thinking is, and nobody knows: not for the models, nor, in the end, for human beings.

But — and this is essential to say — this argument cuts in both directions. If the absence of a theory of thought makes "doesn't think" undecidable, it makes "thinks" undecidable too. Whoever used the disanalogy only against the skeptics, keeping it sheathed in front of the enthusiasts, would be practicing a strategic agnosticism that is a vice, not a position. The correct consequence is symmetric: neither attribution is, as things stand, a scientific finding. Both are proposals, more or less motivated, more or less useful, about how to extend a concept whose conditions of application were never fixed for cases like this. The middle position recently defended by Tayyar Madabushi, Torgbi and Bonial, which describes LLM capabilities as context-directed extrapolation over training data priors — beyond the stochastic parrot, short of human reasoning — is valuable precisely because it rejects the false dichotomy; but its vocabulary too remains a descriptive choice, not a verdict on essence.

The decisive difference, then, does not run between calculators and LLMs. It runs between a formalized domain, where attributions of function and understanding can be surgically separated, and a non-formalized domain, where every attribution carries an undeclared theory with it.

The real gaps

Nothing said so far licenses triumphalism, and this section is here to prevent it. Current LLMs have deep limits, which should be described as architectural properties and not brandished as slogans, in either direction.

A language model, in its base form, has no persistence: every conversation starts from zero, and what the interface presents as memory is context re-inserted from outside, not sedimented experience. It has no agency of its own: it forms no goals that survive the context window, it has no history of interactions with the world constraining its future dispositions. It has no body: its relation to reality is entirely mediated by text, which restates in technical form the old grounding problem, forcefully posed by Bender and Koller: whether the symbols it manipulates ever touch anything that is not more text. Shanahan proposed describing the conversational behavior of these systems as role-play: the model is not an interlocutor with a stable point of view but a generator of plausible characters, and treating it as a unitary subject is a category error the conversational interface encourages. The proposal is healthy as an antidote to naive anthropomorphism, even if, taken literally, it risks reducing to theater what is functionally real in the computation.

These limits are real and documented. What is not proven is that they are constitutive of thought rather than of its human variant. Persistence, body, biographical continuity: are they traits of thought as such, or traits of the only exemplar of thinker available so far? Whoever uses them as definitive arguments assumes the second option as if it were the first: again, a theory dressed up as an observation. The honest formulation is: the current limits of LLMs accurately describe their architecture; whether they also describe the impossibility of thought depends on a definition of thought nobody possesses. It is also possible, and it is an open question to be treated as such, that some of these limits will be eroded by architectural developments, and that the question will then return under different conditions.

Wittgenstein had moved the question

I come back to the opening quip, which can now be weighed. Yes, saying an LLM doesn't think resembles saying a calculator can't do numbers: in both cases a true statement about understanding is used to insinuate a false conclusion about function. But the resemblance ends where formalization ends: for the calculator the question can be closed, for LLMs it cannot. Not because the data are missing, but because the definition the data would have to satisfy or violate is missing.

This is why the controversy is not waiting for the decisive experiment. No measurement, benchmark or interpretability result can, by itself, promote "thinks" or "doesn't think" to scientific fact, because the dispute is not about facts: it is about what grammar to give a verb that has always been calibrated on human beings and is now pressed against cases it was not designed for. Wittgenstein saw it right: asking whether a machine can think is not formulating a hypothesis to verify, it is negotiating a use. If one day the answer changes, and it might, it will not be only because the models have changed. It will be because the working meaning of "thinking" has changed, as it already has before, silently, when "memory" extended to computers and "intelligence" to tests.

In the meantime, the two shortcuts remain shortcuts. "It's just a statistical parrot" denies functions that are observable and partly explained. "It thinks like a human" attributes what no criterion licenses attributing. The defensible position is the uncomfortable one in the middle: systems that realize functions for which, in any other context, the vocabulary of thought would be used, on a radically different substrate, with real gaps, inside a conceptual debate that machines have reopened and that only a decision about language can close — deciding, not discovering, what that word should mean.

Transparency note: this article was co-produced with a large language model, through a structured workflow of research, drafting and supervised revision. Given the thesis of the piece, I invite the reader to consider this circumstance materially relevant — one way or the other.

References

Bender & Koller (2020), Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data, ACL 2020.
Bender, Gebru, McMillan-Major & Shmitchell (2021), On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, FAccT 2021.
Wei et al. (2022), Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Kojima et al. (2022), Large Language Models are Zero-Shot Reasoners.
Olsson et al. (2022), In-context Learning and Induction Heads, Transformer Circuits Thread.
Elhage et al. (2022), Toy Models of Superposition, Transformer Circuits Thread.
Turpin et al. (2023), Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting, NeurIPS 2023.
Dutta, Singh, Chakrabarti & Chakraborty (2024), How to Think Step-by-Step: A Mechanistic Understanding of Chain-of-Thought Reasoning.
Mirzadeh et al. (2024), GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.
Shanahan, McDonell & Reynolds (2023), Role-Play with Large Language Models.
Chalmers (2023), Could a Large Language Model be Conscious?.
Millière & Buckner (2024), A Philosophical Introduction to Language Models: Part I and Part II.
Searle (1980), Minds, Brains, and Programs; SEP entry: The Chinese Room Argument.
Putnam (1967), Psychological Predicates; SEP entry: Functionalism.
Wittgenstein (1953), Philosophical Investigations, §§359–360; SEP entry: Ludwig Wittgenstein.
Tayyar Madabushi, Torgbi & Bonial (2025), Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors.

FAQ

Is an LLM just autocomplete?

The description "it predicts the next token" is correct but poor. Training produces distributed representations in which concepts are encoded in superposed, compositional form, and mechanistic interpretability has identified specific circuits, such as induction heads, causally linked to in-context learning. The slogan omits exactly this part.

Does chain-of-thought prove that an LLM reasons?

Not by itself. Intermediate steps improve performance as a behavioral phenomenon, but the unfaithfulness literature shows that the verbalized explanation may not reflect the computation that produced the answer. At the same time, structured neural pathways for step-by-step reasoning do exist: an unfaithful account does not imply the absence of underlying computation.

Why is the calculator a different case from an LLM?

Arithmetic is a completely formalized domain, with public criteria of correctness: function and understanding can be separated precisely. For thought there is no shared definition and no verification criterion independent of the competing theories, so neither "thinks" nor "doesn't think" has the status of a scientific finding.

What limits do current LLMs have?

In their base form they lack persistence (every conversation starts from zero), agency of their own (no goal survives the context window) and a body (their relation to reality is mediated by text alone). These are real, documented architectural properties; whether they are constitutive of thought, rather than of its human variant, remains unproven.

Is Saying an LLM Doesn't Think Like Saying a Calculator Can't Do Numbers?