AI chatbots like ChatGPT are designed to replicate human speech as closely as possible to enhance the user experience.
But as AI becomes increasingly developed, which it is, it’s also becoming increasingly difficult to distinguish these models from real people, whether be it in terms of language or art. And now, science has proven it.
Scientists at the University of California, San Diego (UCSD), United States, reveal in a study that two of the most widely used chatbots have already reached an important milestone: both GPT, which powers OpenAI’s ChatGPT, and LLaMa, which is behind Meta AI on WhatsApp and Facebook, have passed the famous Turing test.
Devised by British World War II mathematician and codebreaker Alan Turing in 1950, the Turing Test, or ‘imitation game,’ is a standard test designed to test a machine’s intelligence: an AI passes the test when a human cannot correctly differentiate between a response from another human and a response from the AI.
“The results constitute the first empirical evidence that any artificial system passes a standard three-part Turing test,” the UCSD scientists say. “If interrogators cannot reliably distinguish between a human and a machine, the machine is said to have passed the test,” they add.
This was the Turing Test Experiment
The researchers used four AI models: GPT-4.5 (released in February), an earlier iteration called GPT-4o, Meta’s flagship model, LLaMa, and a 1960s chat program called ELIZA, reportedly ‘the first AI program’.
The first three are ‘large language models’ (LLMs), deep learning algorithms that can recognise and generate text based on knowledge gleaned from massive data sets.
The researchers recruited 126 undergraduate students from the University of California, San Diego, and 158 people from the online database Prolific. Participants engaged in five-minute online conversations simultaneously with another human participant and of the AIs, but they didn’t know which was which and had to judge which one they though was human.
When asked to adopt a human persona, GPT-4.5 was judged to be the human 73% of the time, i.e., more often than the actual human participant.
New preprint: we evaluated LLMs in a 3-party Turing test (participants speak to a human & AI simultaneously and decide which is which).
GPT-4.5 (when prompted to adopt a humanlike persona) was judged to be the human 73% of the time, suggesting it passes the Turing test () pic.twitter.com/GBEtoFJHVY
— Cameron Jones (@camrobjones) April 1, 2025
Such a high percentage suggests that GPT-4.5 was more convincing than the human participants, as people were more likely to mistake it for a human than to mistake the actual humans for a machine. Meanwhile, Meta’s LLaMa-3.1, when also asked to adopt a human personality, was considered human 56% of the time.
This “was neither significantly more nor less common than the humans they were compared with,” the team notes, but it still counts as a pass mark.
Finally, the reference models (ELIZA and GPT-4o) achieved success rates significantly lower than chance: 23% and 21% respectively.
The researchers also tried giving the models more basic instruction, without the detailed instructions telling them to adopt a human-like personality.
As expected, the AI models performed significantly worse in this condition, exposing the importance of activating chatbots first.
The team says their new study, published as a preprint, is “strong evidence” that OpenAI and Meta’s bots passed the Turing test.
“This should be evaluated as one of many other pieces of evidence for the type of intelligence LLMs display,” lead author Cameron Jones said in an X thread.
Jones admitted that the AIs performed better when instructed in advance to mimic a human, but this doesn’t mean GPT-4.5 and LLaMa failed the Turing test.