Large language models (LLMs) such as GPT-4 are considered technological marvels capable of passing the Turing test successfully. But is this really the case?
ChatGPT has ascended to great heights in the field of artificial intelligence (AI). It may seem smart, fast and impressive. It skillfully demonstrates visible intelligence, engages in conversations with people, mimics human communication, and even passes exams in jurisprudence. However, it still provides completely false information in some cases.
So, can ChatGPT actually pass the Turing test, generating text indistinguishable from a human response?
Two researchers from the University of California at San Diego explored this question and uncovered intriguing results. Cameron Jones, who specializes in language, semantics and machine learning, and Benjamin Bergen, professor of cognitive science, conducted an online Turing test. In this test participants tried to determine whether the respondent was an AI or a human. GPT-3.5 or GPT-4 acted as artificial intelligence.
The experiment involved an online test that included conversations between 650 participants and GPT models. Participants were asked to identify who they were talking to, a human or an AI. Various configurations of GPT-3.5 and GPT-4 were used throughout the experiment. For the most part researchers studied the prompt, which directed the LLM to a specific topic and style of conversation. This prompt defined the nature of the interaction and the personality of the agent, and was varied to see how it affected the outcome.
Each dialogue had a 5-minute time limit and a message length under 300 characters. In total, 652 participants took part in the experiment and completed 1810 “games”. After filtering out dubious results, the number of “games” was reduced to 1,405.
The findings revealed that the GPT-4 models managed to deceive participants 41% of the time, while the GPT-3.5 models succeeded only 5 – 14% of the time. For comparison, the older model ELIZA (1966) was recognized as a person in 27% of cases. In addition, an interesting fact was discovered: participants managed to convince other participants that they were not machines only 63% of the time.
A rather interesting aspect was the motives behind the participants’ decisions. The reasons AI was considered AI were as follows. If the AI seemed too formal or informal, it could cause them anxiety. Similarly, answers that were too wordy or too short, a strange tone of conversation, or perfect grammar became important factors in determining whether the interlocutor was human or machine. It is also worth noting that a lack of knowledge turned out to be more important than an excess of it. This suggests that AI models should keep learning. Test participants also showed sensitivity to generic responses.
The researchers concluded that based on this experiment, it can’t be stated that GPT-4 successfully passed the Turing test. Nevertheless, the 41% success rate indicates that the use of AI for deception is becoming more realistic. This is especially relevant in situations where human interlocutors are less attentive to the possibility of communicating with a machine.
AI models adept at imitating human responses have the potential for far-reaching social and economic impacts. It will become increasingly important to monitor AI models and identify factors that lead to deception, as well as develop strategies to mitigate it. However, the researchers emphasize that the Turing test remains an important tool for evaluating machine dialogue and understanding human interaction with artificial intelligence.
It’s remarkable how quickly we have reached a stage where technical systems can compete with humans in communication. Despite doubts about GPT-4’s success in this test, its results indicate that we are getting closer to creating AI that can compete with humans in conversations.
Read more about the study here.