If you're looking for a new reason to be worried about artificial intelligence, try this. Some of the smartest humans in the world struggle to create tests that AI systems cannot pass.
For many years, AI systems have been evaluated by subjecting new models to a variety of standardized benchmark tests. Many of these tests consisted of difficult SAT-level questions in areas such as math, science, and logic. Comparing model scores over time served as a rough measure of AI progress.
But AI systems ended up being so good at these tests that new, more difficult tests were created that included the kinds of questions that graduate students might encounter on their exams.
Those tests aren't in good shape either. New models from companies like OpenAI, Google, and Anthropic are scoring high on many PhD-level tasks, but the usefulness of those tests is limited, and “AI systems are becoming smarter than they can be measured.” This leads to the frightening question, “Is it true?”
This week, researchers at the Center for AI Safety and Scale AI present a possible answer to that question. It's a new assessment they're calling “humanity's last test,” and what they claim is the most difficult test ever conducted on an AI system.
“Humanity’s Last Test” is the brainchild of Dan Hendrycks, a renowned AI safety researcher and director of the Center for AI Safety. (The test's original name, “Humanity's Last Stand,” was rejected as being too dramatic.)
In collaboration with Scale AI, an AI company of which he is an advisor, Hendricks has created approximately 3,000 multiple-choice and short answer questions designed to test the capabilities of AI systems in areas ranging from analytical philosophy to rocketry. I created a test consisting of open-ended questions. .
The questions were posed by experts in these fields, including university professors and award-winning mathematicians, who asked them to think of very difficult questions to which they knew the answers.
Let's try some test questions about hummingbird anatomy.
Apodiformes hummingbirds have bilaterally paired oval bones, sesamoid bones embedded in the caudolateral part of the dilated cruciate aponeurosis at the m insertion. Caudal depressor muscle. How many pairs of tendons are supported by this sesamoid bone? Please answer in numbers.
Or if physics is more your speed, try this.
The blocks are placed on horizontal rails and can slide along the rails without friction. It is attached to the end of a rigid massless rod of length R. A mass is attached to the other end. Both objects have weight W. The system is initially at rest and the mass is directly above the block. The mass is given a small push parallel to the rail. Suppose the system is designed so that the rod can rotate 360 degrees without interruption. When the rod is horizontal, there is a tension T1 on the rod. When the rod is vertical again and the mass is directly below the block, there is a tension T2 on the rod. (Both of these quantities can be negative, indicating that the rod is compressed.) What is the value of (T1−T2)/W?
(I would print the answer here, but that would ruin the testing of the AI system being trained in this column. Also, I'm too stupid to verify the answer myself. )
The questions from Humanity's Last Exam went through a two-step filtering process. First, the questions you submit are answered by a leading AI model.
If the model cannot answer (or, in the case of multiple-choice questions, the model performs worse than a random guess), the question is given to a set of human reviewers who refine the question. to verify the correct answer. . The experts who created the highest-rated questions were paid between $500 and $5,000 per question and also received credit for their contribution to the exam.
Kevin Zhou, a postdoctoral researcher in particle theory at the University of California, Berkeley, submitted several questions to the test. Three of his questions were selected, all of which he told me were “on the high end of the spectrum of questions asked on postgraduate exams.”
Hendrix, who helped create a widely used AI test known as Massive Multitask Language Understanding (MMLU), said a conversation with Elon Musk inspired him to create a more difficult AI test. (Hendricks is also a safety advisor for Musk's AI company, xAI.) Musk raised his concerns because he believed existing tests given to AI models were too easy, he said.
“Elon looked at the MMLU question and said, 'This is at the undergraduate level.' We hope to have world-class experts,” Hendricks said.
There are other tests that attempt to measure advanced AI capabilities in specific areas, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test. It was developed by AI researcher François Cholet.
But humanity's last test aims to determine how well AI systems can answer complex questions across a variety of academic disciplines, providing what is considered a general intelligence score.
“We're estimating the extent to which AI can automate a lot of very difficult intellectual labor,” Hendricks said.
Once the list of questions was created, the researchers put six major AI models, including Google's Gemini 1.5 Pro and Anthropic's Claude 3.5 Sonnet, to the final test of humanity. They all failed miserably. OpenAI's o1 system received the highest score with a score of 8.3%.
(The New York Times sued OpenAI and its partner Microsoft for copyright infringement of news content related to AI systems. OpenAI and Microsoft have denied these claims.)
Hendrix said he expects these scores to rise quickly and could exceed 50% by the end of the year. At that point, he said, AI systems could be seen as “world-class oracles” that can answer questions on any topic more accurately than human experts. And we may need to look for other ways to measure the impact of AI, such as looking at economic data or determining whether new discoveries can be made in areas such as mathematics or science.
“Imagine a better version of this, where you could ask a question you don't yet know the answer to, and you could test whether the model can help you solve it,” says Summer Yue at Scale. . AI research director and test organizer.
What's so confusing about recent advances in AI is their jaggedness. We have AI models that can diagnose diseases more effectively than human doctors, have won silver medals at the International Mathematics Olympiad, and have defeated top human programmers in competitive coding challenges.
However, these same models may struggle with basic tasks such as arithmetic or writing metered poetry. This gives AI a reputation for being amazingly good at some things and completely useless at others, depending on whether you're looking at the best or worst output. Impressions about the speed of progress have varied greatly.
This jaggedness makes these models difficult to measure. I wrote last year that we need to improve the evaluation of AI systems. I still believe so. But I also believe that we need more creative ways to track AI progress that don't rely on standardized tests. That's because most of the things humans do, and most of the things we fear AI can do better than humans, cannot be captured by written exams. .
Zhou, a particle theorist who submitted questions for humanity's final exam, said AI models were often able to successfully answer complex questions, but their work involved a lot of , said he did not consider AI models to be a threat to him or his colleagues. More than just spitting out the correct answer.
“There's a huge gap between what it means to take an exam and what it means to be a working physicist or researcher,” he says. “Even though AI can answer these questions, it may not become useful in studies that are less structured in nature.”