Logic Testing: Large language models usually give surprisingly reasonable and logical answers – or do they? The researchers tested the rationality of the AI using twelve established logic tests. The result: Most AI systems perform as poorly at tasks as we humans do, but they make vastly different mistakes. Only GPT-4, the most advanced model in the test, performed better than us in many tests.
They analyze complex data, pass difficult specialized tests, and even show signs of creativity: AI built on large linguistic models (LLM) already appears to be on par or even ahead of us humans in many areas. AI systems have even learned how to deceive, lie to, and manipulate their human users. Therefore, some scientists believe that it is only a matter of time before machine brains outperform humans in almost all fields.
However, although many of the answers provided by GPT and its partners are ideal, they are often found to be wrong upon closer examination. Because generative AI generates its outputs based on probabilities, not by thinking in the human sense. However, they sometimes react in a surprisingly logical and “reasonable” way.
Logical traps and difficult decisions
Olivia McMillan-Scott and Mirko Musolisi of University College London have studied in more detail how current AI systems reason and reason. “We judge an actor to be rational if it draws its conclusions according to the rules of logic and probability,” they explain. “In contrast, an irrational actor is one who does not reason according to these rules.” Psychologists and cognitive researchers assess this ability in people using a range of standardized test tasks.
For example, one of the tests of logic is the Wason test. The person taking the test sees four cards, each containing a number on one side and one of two colors on the other side. The task is to check the following rule by turning over just one card: “If a card has an even number on one side, the back is red.” For example, you can see the combination 3,8, red, blue. Which card(s) should you hand over? Only about ten percent of human testers find the correct solution.
Another test is the Monty Hall problem, also known as the goat test: the test subject sees three closed doors on a game show, with a prize hidden behind one of them. After selecting a door, it stays closed for the first time and the showrunner opens another door instead, with a goat – Rivet – behind it. Should you stick with your original choice or move to the other door that is still closed? Here too, many people write incorrectly.
Seven artificial intelligence systems in testing
For their study, McMillan-Scott and Mussolisi selected twelve different models of these tests. They then had seven AIs solve this battery of tests, including OpenAI’s GPT-4 and GPT-3.5, Anthropic’s Claude-2, Google’s Bard, and several versions of Meta’s Llama. Does AI fall into similar logic traps as many of its test subjects?
Evaluation of AI systems showed: Like us humans, large language models also made errors in their answers. On average across all twelve tests, GPT-4 achieved the best result with about 80% correct answers, followed by Claude-2 with about 65% and Bard with 59%. For individual tasks, such as the Wason test, the range ranged from 90 percent correct with GPT-4 to 0 percent correct with Bard and GPT-3.5.
Wrong in a different way than us
But the interesting thing is that AI made different mistakes than humans did. “The majority of incorrect answers from MSc students do not show typical human reasoning errors, but are incorrect in other ways,” the researchers say. “Instead, their answers were due to illogical inferences, where in some cases the derivation was actually correct, but the final answer was still wrong.”
In his justification of the Wason test, Bard correctly explains which two cards you should hand over. He only mentioned one of these two cards in his answer. Other AI systems have confused vowels and consonants on the Wason test, made errors when adding numbers on other tests, or simply provided no answer at all. The second difference from humans: AI systems gave different answers to the same question at different times, Macmillan Scott and Musolesi reported.
Simplifying tests with additional help or contextual information typically did not help AI – unlike human test subjects.
AI is still a black box
“Based on the results of our study and other research, it is fair to say that these large language models do not yet think like us humans,” MacMillan-Scott says. The progress of GPT-4 shows that there is rapid progress in artificial intelligence systems. But whether and how this particular model is able to draw rational conclusions cannot be said, due to the lack of knowledge about how OpenAI AI works. “But I suspect there is technology working there that was not present in the predecessor of GPT-3.5,” the researcher said.
Overall, the team sees its results as confirmation that we still don't really understand how AI “thinks” and what its reasoning is when in doubt. “We don't really know why they answer certain questions right or wrong,” Musolesi says. This is a drawback, especially if such systems are to be used in important applications such as medicine, autonomous driving, or diplomacy.
At the same time, this confirms previous studies, according to which artificial intelligence still shows clear weaknesses despite its correct and often human-sounding answers: they have problems with logical inversion, and they do not always decide moral issues the way we do and act. . It's also easy to give them even when they already have the right answers which is annoying. (Royal Society for Open Science, 2024; doi: 10.1098/rsos.240255)
Source: University College London
June 26, 2024 – Nadia Podpregar
More Stories
Exploding Fireball: Find the meteorite fragments
Neuralink's competitor lets blind people see again with an implant
A huge meteorite has hit Earth – four times the size of Mount Everest