Sally–Anne test

The Sally–Anne test is a psychological test originally conceived by Daniel Dennett, used in developmental psychology to measure a person's social cognitive ability to attribute false beliefs to others.^[1] Based on the earlier study by Wimmer and Perner (1983),^[2] the Sally–Anne test was so named by Simon Baron-Cohen, Alan M. Leslie, and Uta Frith (1985) who developed the test further;^[3] in 1988, Leslie and Frith repeated the experiment with human actors (rather than dolls) and found similar results.^[4]

Test description

To develop an efficacious test, Baron-Cohen et al. modified the puppet play paradigm of Wimmer and Perner (1983), in which puppets represent tangible characters in a story, rather than hypothetical characters of pure storytelling.^{[clarification needed]}

In the test process, after introducing the dolls, the child is asked the control question of recalling their names (the Naming Question). A short skit is then enacted; Sally takes a marble and hides it in her basket. She then "leaves" the room and goes for a walk. While she is away, Anne takes the marble out of Sally's basket and puts it in her own box. Sally is then reintroduced and the child is asked the key question, the Belief Question: "Where will Sally look for her marble?"^[3]

In the Baron-Cohen, Leslie, and Frith study of theory of mind in autism, 61 children—20 of whom were diagnosed autistic under established criteria, 14 with Down syndrome and 27 of whom were determined as clinically unimpaired—were tested with "Sally" and "Anne".^[3]

Outcomes

For a participant to pass this test, they must answer the Belief Question correctly by indicating that Sally believes that the marble is in her own basket. This answer is continuous with Sally's perspective, but not with the participant's own. If the participant cannot take an alternative perspective, they will indicate that Sally has cause to believe, as the participant does, that the marble has moved. Passing the test is thus seen as the manifestation of a participant understanding that Sally has her own beliefs that may not correlate with reality; this is the core requirement of theory of mind.^[5]

In the Baron-Cohen et al. (1985) study, 23 of the 27 clinically unimpaired children (85%) and 12 of the 14 children with Down syndrome (86%) answered the Belief Question correctly. However, only four of the 20 autistic children (20%) answered correctly. Overall, children under the age of four, along with most autistic children (of older ages), answered the Belief Question with "Anne's box", seemingly unaware that Sally does not know her marble has been moved.^[3]

Criticism

Ruffman, Garnham, and Rideout (2001) further investigated links between the Sally–Anne test and autism in terms of eye gaze as a social communicative function. They added a third possible location for the marble: the pocket of the investigator. When autistic children and children with moderate learning disabilities were tested in this format, they found that both groups answered the Belief Question equally well; however, participants with moderate learning disabilities reliably looked at the correct location of the marble, while autistic participants did not, even if the autistic participant answered the question correctly.^[6] These results may be an expression of the social deficits relevant to autism.

Tager-Flusberg (2007) states that in spite of the empirical findings with the Sally–Anne task, there is a growing uncertainty among scientists about the importance of the underlying theory-of-mind hypothesis of autism. In all studies that have been done, some children with autism pass false-belief tasks such as Sally–Anne.^[7]

In other hominids

Eye tracking of chimpanzees, bonobos, and orangutans suggests that all three anticipate the false beliefs of a subject in a King Kong suit, and pass the Sally–Anne test.^[8]^[9]

Artificial intelligence

Artificial intelligence and computational cognitive science researchers have long attempted to computationally model humans' ability to reason about the (false) beliefs of others in tasks like the Sally–Anne test. Many approaches have been taken to replicate this ability in computers, including neural network approaches,^[10] epistemic plan recognition,^[11] and Bayesian theory-of-mind.^[12] These approaches typically model agents as rationally selecting actions based on their beliefs and desires, which can be used to either predict their future actions (as in the Sally–Anne test), or to infer their current beliefs and desires. In constrained settings, these models are able to reproduce human-like behavior on tasks similar to the Sally–Anne test, provided that the tasks are represented in a machine-readable format.

With the rise of large language models (LLMs), researchers have found that frontier models can now routinely pass classic false-belief tasks like the Sally–Anne test. A 2023 paper from Microsoft Research first reported that GPT-4 could pass an instance of the test, interpreting this as evidence of "a very advanced level of theory of mind."^[13] Kosinski (2024) tested eleven LLMs on 40 bespoke false-belief tasks requiring correct answers across eight scenarios each; GPT-4 solved 75% of tasks, matching the performance of six-year-old children, while older models solved none.^[14] Strachan et al. (2024) compared GPT and LLaMA models against 1,907 human participants on a broad battery of theory-of-mind tests and found that GPT-4 performed at or above human levels on false beliefs, indirect requests, and misdirection, though it struggled with detecting faux pas.^[15] Street et al. (2025) tested LLMs on higher-order theory-of-mind tasks involving recursive mental state reasoning (e.g., "I think that you believe that she knows") and found that GPT-4 reached adult-level performance overall, exceeding adult performance on sixth-order inferences.^[16]

While classic false-belief tasks thus appear to be largely solved by frontier LLMs, debate has shifted to whether this reflects genuine social reasoning or exploitation of surface-level textual patterns. Early work by Ullman (2023) showed that GPT-3.5 failed on trivial alterations to false-belief tasks that humans handle flexibly, though later models have proven more robust to such perturbations.^[17]^[16] Gu et al. (2024) introduced the SimpleToM benchmark, which distinguishes between explicit ToM (identifying what a character knows) and applied ToM (predicting their behavior or judging its rationality). Frontier models scored near-perfectly on explicit ToM but performed far worse on applied ToM without prompting interventions: GPT-4o's behavior prediction accuracy was 49.5% without assistance, rising to 93.5% with chain-of-thought prompting and mental-state reminders.^[18] A 2025 commentary responding to Kosinski argued that passing isolated false-belief tasks is insufficient evidence of theory of mind, and that simpler explanations such as associative learning from training data cannot yet be ruled out.^[19] A comprehensive survey presented at ACL in 2025 noted that the field has moved beyond simple Sally–Anne-style tasks toward benchmarks covering intentions, desires, emotions, and non-literal communication, and that debate continues over whether LLMs' ToM abilities are genuine or "often superficial and unstable."^[20]