Can machines think?
Assessing machine vs human cognition
Following the confrontation of Turing (1950) and Marr (1982), if Turing’s test 1.0 was about imitating humans in verbal exchanges, Turing’s test 2.0 must be about playing Baseball and competing in Poker.
Seventy years of studying cognition
We have been studying human and machine cognition for more than seventy years now. Across this period, both have been treated within the same paradigm: cognition is comprehended as information processing that can be decomposed into formal systems that represent things and transform them. This framing originates in the cybernetics program, where cognition was modeled in terms of control, feedback, and information flow.
Turing and the operationalization of thinking
Addressing cognition can be summarized, in its most primary form, as answering the question “What is thinking?”. Alan Turing sought to answer the follow-up question “Can machines think?” in his famous paper Computing Machinery and Intelligence.
Turing operationalized the vague metaphysical question into a testable one: “Could a universal digital computer, properly resourced and programmed, imitate a human well enough in conversation?” His imitation game evaluated thinking through externally observable behavior, specifically symbolic dialogue. In this game, an interrogator attempts to distinguish between two hidden entities – a human and a machine – on the basis of written responses. Success for the machine is not defined by replicating cognition itself but by achieving behavioral indistinguishability.
This operationalization established a methodological template that is still shaping AI testing. For many engineers, the best hopes of reaching Artificial General Intelligence rest on benchmarks where performance is assessed by whether outputs approximate or outperform human responses in dialogue, summarization, or reasoning tasks. Success remains defined in terms of symbol manipulation – embeddings seem to be the machine equivalent of human engrams.
Critiques of Turing’s Operationalization
This operationalization is still powerful. Equating “thinking” with “imitating humans” provides a test through behavior. We apply the same logics in animal cognition research, where we infer internal processes from observable actions – though because animals cannot fulfill questionnaires. Qualitative research is not without issues as well and one cannot escape the qualia issue: subjective experience is not easily captured through behavioral proxies. However, this operationalization has never been without critics, including critics of the way imitation is tested.
Turing chose not to evaluate machines on “racing” capabilities, nor or IQ-style problem solving (early 20th century). While IQ tests can be reduced to symbolic manipulation, what Turing calls racing capabilities cannot. Consider a baseball player catching a fly ball: the player must compute distance and angle, predict trajectory, and adjust the body’s motion. This can be solved by calculus-like derivations or by heuristics (Gigerenzer & Brighton, 2009). Either way, it involves processing information and predicting outcomes – hence, thinking. By excluding such cases, Turing privileged symbolic tasks and ignored embodied cognition.
This framing has further consequences. While animals may manipulate symbols, Turing’s test rejects non-linguistic forms of problem solving found in animals. Machines also diverge in their information streams. They receive constant data streams and solve optimization problems for estimation purposes. Humans sample information sparsely for inference purposes, minimizing energy costs within bodily and environmental constraints. Whether real cognition requires embodiment remains debated. That said, can we equate a robot hand using multiple layers of transformer architectures to solve 2d-to-3d mapping in order to plan the trajectory towards a glass of water partially hidden by a forefront object, while flies and c. elegans solve orientation in 3d spaces despite body-environment interactions, simple architectures and a few rules?
The Imitation game: A game of Game Theory
Other critics involve thought experiments like the Chinese Room, arguing that syntactic manipulation of symbols does not equal semantic understanding. The imitation game involves deception, thus strategic interactions. Solving strategic interactions requires backward induction, payoff matrices, incomplete information, and theory of mind reasoning. Humans approximate such reasoning with inductions and bounded rationality.
Consider the 11-20 game, where best replies cycle and no pure Nash equilibrium exists. Humans’ responses are distributed between 17-18-19 in a large majority, and solving the game requires a Bayesian strategy whereby one should answer 15-16-17 with 20-25% probability. Humans iterate through backward induction based on their most probable belief about others’ beliefs. Consider an LLM in reasoning mode solving the 11-20 game. If the LLM outputs tokens that describe some form of theory of mind when they iterate through levels of reasoning (simulating depths of thinking, or perhaps just convoluted reasoning) because Ayad and Rubinstein (2012) was present in its training database, is that “imitating humans”? Is ChatGPT-5 response to choose 15 with high probability the result of bounded backward induction? Tokens are chosen by matter of logits; is the most probable token equivalent to the most probable belief in a human being?
In the Ultimatum game, human players of both roles tend to naturally converge on the 50/50 division. A reasoning LLM that has been trained on game-theory textbooks can carry out full proper backward induction and articulate reasoning across several depths. They would propose the minimal positive split (99/1) because it is a perfect equilibrium. Humans, however, rarely go beyond two or three steps. In sequential games with incomplete information, many human players truncate recursions, mis-specify beliefs, stop at a shallow depth and especially rely on (optimal) heuristics. An LLM that prints the full chain of expected-utility computations is solving normatively. It would lose unless it was trained on papers on bounded rationality and k-levels of reasoning.
To imitate human thinking in strategic settings, a model would need to reproduce humans’ limited depth, noise, heterogeneous priors, and heuristics. This circles back to Turing’s insight that passing the imitation game requires matching human errors and biases. In case the machine is competing against a competent human opponent with knowledge of game theory that deceives the interrogator by providing non-bounded-rationality answer, can we equate the machine’s ability to human thinking?
Child programs with learning is all we need
Anyway, despite its limits, Turing’s imitation game remains relevant. Modern LLMs do indeed replicate human-like outputs. They can answer questions, summarize, and even engage in reasoning-like tasks. They produce text that can indeed deceive human judges, showing that discrete-state machines and the Turing Test were well anticipated. Yet we quickly learnt to spot them through characteristic stylistic patterns. On the other side, humans can keep deceiving one another continuously. This shows that Turing’s test is incomplete. Performance assessed via symbolic representations at a given moment is insufficient to characterize stable feats of intelligence comparative to that of humans.
Where Turing was prescient was in noting that a true test of the imitation game would require “child machines” that learn through experience, rather than being pre-loaded with exhaustive training. The task ahead was, and remains, to construct child machines capable of learning through experience, education, and reinforcement.
Proposing the Imitation game 2.0: A game of Baseball-Poker
This raises the broader issue that defining machine counterparts of human cognitive functions (vision, reasoning, language, emotions) is always arbitrary. We get deceived by LLM’s large context windows, , whereby they seem to learn about us personally and display a wide range of behaviors. Let’s instead consider an LLM with the ability to retrain on the fly its logits. If cognition is an ensemble of classes (cognitive functions), each described by an ensemble of diverse processes, rather than a single class of symbolic manipulations, then any imitation benchmark risks being partial.
The concern is not whether machines truly “think” like humans but to properly define the threshold at which machines imitate humans well enough to pass as comparable. To improve the cognitive science program, I propose the imitation game 2.0, a game of racing-deceiving I call Baseball-Poker.
Cognitive science requires more than studying outputs
The next step of the cognitive science program may be to reflect on its own goals. What is the purpose of evaluating LLMs on human-like benchmarks if the outcome is not to prove they think as we do, but to decide when their performance is close enough to count as imitation?
Three decades after Turing, David Marr advanced the computational program of cognitive science by proposing a framework for analyzing complex information-processing systems. Marr argued that perception, and by extension cognition, could not be understood by reducing systems to their elementary parts alone. Complex systems require descriptions across multiple scales (neurons, neural ensembles, areas, loops, systems). Importantly, Marr addresses processes rather than loci.
Marr distinguished three levels for analyzing cognition: the computational (what problem is being solved and why), the algorithmic (how representations are manipulated), and the implementational (how processes are physically realized).
Raw retinal image must be transformed into increasingly structured representations of orientations, lines, surfaces, shapes, and objects, with each level solving distinct problems constrained by the environment. In theory, while the three levels of analysis are independent and one can study one level alone, two levels at once strengthen the response to objections and the integration of three levels at once is required to ensure a full understanding. In practice, the three levels may be hierarchical, with dysfunction at the implementational level constraining higher-level processes.
Cognitive science must understand what it’s seeking to model
Bayesian models describe cognition as probabilistic inference. At the computational level, they formalize inference under uncertainty; at the algorithmic level, they specify Bayesian estimation; at the implementational level, they explore neural coding of probability distributions. Whether the brain literally computes probabilities – meaning that neurons compute probabilities – remains debated. One would need to address the three levels at once to resolve the question; much work stays at the computational and algorithmic levels, without bridging to implementation.
This highlights a distinction between instrumental models, which prioritize predictive utility, and mechanistic models, which seek biological accuracy. Marr himself seemed to emphasize explanatory adequacy. Explaining processes via modelling is necessary to approximate biological reality and, potentially, clinical intervention. For him, a model must capture a sufficient range of the processes it aims to explain in order to count as a proper explanation.
Turing argues that the fly controls its flight through a collection of about five systems, and three processes account for probably about 60% of the fly vision. Let’s consider that a model of the fly that can account for about sixty percent of its vision begins to have explanatory value. Applying this logic to machines such as large language models, we may ask: if they display linguistic competence but lack other essential capacities, such as motor control or perceptual processing, have we achieved a good-enough model of intelligence?
Applications of Marr’s framework now extend beyond vision and language, which remain the dominant focus of AI evaluation. Unlike perception or syntax, domains such as motor control, decision-making, emotion, or peculiar properties of the brain such as those good old somatic markers, are less straightforward to formalize. Yet humans and animals solve these problems through input-output transformations. Cognition, Marr argued, is precisely this ability to transform inputs into outputs to meet environmental demands.
Some animals solve similar problems with different strategies; others solve entirely different problems. We often recognize certain animal capacities as “thinking,” while relegating others to “mechanistic” behavior. Yet the threshold between mechanistic response and cognition remains unclear. This same ambiguity shapes our evaluation of machines.
Concluding reflections
Turing framed intelligence not as a single object but as a performance criterion: could machines imitate humans well enough to deceive us? This instrumental perspective launched a powerful research program, but it left aside how cognition is physically realized. In this sense, we should only care about the capacity to imitate. To what level should we consider imitation to be a success? Is mapping a machine output (yhat) to a human output (y) with a success rate above chance-level sufficient?
Marr, by contrast, emphasized explanation through modelling processes across computational, algorithmic, and implementational levels. A model only counts as explanatory if it captures a sufficient range of the processes it claims to represent. In Turing’s terms, programming a system to reproduce human behavior without physical implementation may suffice to win the imitation game. In Marr’s terms it leaves out the implementational level.
The two approaches are not opposed. Instrumental successes can serve as stepping stones toward mechanistic models. But if our goal is explanation, we cannot remain satisfied with models that only simulate outputs.
Resources
Turing, A.M. (1950). Computing machinery and intelligence. Mind, 59, 433-460.
Marr, D. (1982). Vision. In R. Cummins & D. D. Cummins (Eds.), Minds, Brains, and Computers: The Foundations of Cognitive Science (pp. 69–83). Blackwell Publishers.
Gigerenzer, G., & Brighton, H. (2009). Homo heuristicus: Why biased minds make better inferences. Topics in cognitive science, 1(1), 107-143.
Arad, A., & Rubinstein, A. (2012). The 11–20 money request game: A level-k reasoning study. American Economic Review, 102(7), 3561-3573.

