Last month, an AI bot handling technical support for Cursor, a budding tool for computer programmers, warned several customers about changes to its corporate policy. They said they were no longer allowed to use cursors on another computer.
In an angry post on an internet message board, customers complained. Some have cancelled cursor accounts. And some were even more angry when they realized what had happened. AIBOT has announced a non-existent policy change.
“We don't have that policy. You can of course use your cursor on multiple machines at your own pace,” wrote Michael Truell, the company's CEO and co-founder, in a Reddit post. “Unfortunately, this is a false response from the frontline AI support bot.”
More than two years after the arrival of CHATGPT, tech companies, office workers and everyday consumers have been using AI bots for an increasingly wide range of tasks. However, there is no way yet to ensure that these systems generate accurate information.
The latest and most powerful technologies – so-called inference systems from companies such as Openai, Google, and Chinese startup Deepseek – are generating fewer errors. The handle to facts has become more intense as my mathematics skills have improved. The reason is not entirely clear.
Today's AI bots are based on complex mathematical systems that learn skills by analyzing huge amounts of digital data. They cannot decide, and cannot, decide, what is true and what is false. Sometimes they just make up things, and some AI researchers hallucinate. In one test, the hallucination rate for the new AI system was 79%.
These systems use mathematical probability to infer the best response, rather than the strict set of rules defined by human engineers. So you make a certain number of mistakes. “In spite of our best efforts, they are always hallucinated,” said Amr Awadallah, CEO of Vectara, a startup building AI tools for businesses and a former Google executive-building startup. “It will never go away.”
For several years, this phenomenon has raised concerns about the reliability of these systems. These can be useful in some situations, such as writing term papers, summarizing office documents, or generating computer code, but those mistakes can cause problems.
AI bots tied to search engines like Google or Bing can produce laughing and incorrect search results. If you ask for a good marathon on the West Coast, you might suggest a race in Philadelphia. If you tell you how many households in Illinois, they may quote a source that does not contain that information.
These hallucinations may not be a major issue for many people, but they are serious issues for those using technology using court documents, medical information, or sensitive business data.
“We spend a lot of time trying to figure out which responses are true and which aren't,” said Pratik Verma, co-founder and CEO of Okahu, who helps businesses navigate hallucination issues. “If these errors are not handled properly, essentially the value of an AI system is gone. This is supposed to automate the task.”
Cursor and Truell did not respond to requests for comment.
For over two years, companies like Openai and Google have steadily improved their AI systems and reduced the frequency of these errors. However, with the new inference system, the errors are increasing. According to the company's own testing, the latest Openai systems hallucinate at a higher rate than the company's previous systems.
The company found that O3 (its most powerful system) was accompanied by 33% hallucination when performing PersonQA benchmark tests. This includes answering questions about public figures. This is more than twice the hallucination rate of Openai's previous inference system called O1. The new O4-MINI hallucinated at an even higher rate: 48%.
Running another test called SimpleQA, which asks more general questions, hallucination rates for O3 and O4-MINI were 51% and 79%. The previous system, O1, is hallucinated for 44% of the time.
In a detailed paper on the test, Openai stated that further research is needed to understand the causes of these results. AI systems learn from more data than people can wrap their heads, so technicians have a hard time deciding why they act.
“Hazardization is not inherently common in reasoning models, but we are actively working to reduce the percentage of hallucinations we saw in O3 and O4-Mini,” Gaby Raila said. “We will continue our hallucination research across all models to improve accuracy and reliability.”
Hannane Hajisiltzi, a professor at the University of Washington and researcher at the Allen Institute of Artificial Intelligence, is part of a team that recently devised ways to track the behavior of a system on individual trained data. However, this new tool cannot explain everything because the system learns from so much data and can generate almost anything. “We still don't know how these models work exactly,” she said.
Tests by independent companies and researchers also show that inference models from companies such as Google and Deepseek are also showing an increase in hallucination rates.
Since late 2023, Awadara's company Vecthara has been tracking how often chatbots are turning from the truth. The company asks these systems to perform simple, easily verified tasks. Summary of specific news articles. Still, chatbots invent information permanently.
The original Vectara survey estimated that in this situation, Chatbots constituted information at least 3% of the time, and sometimes 27%.
In the next year and a half, companies like Openai and Google have pushed these numbers down to the 1% or 2% range. Others such as San Francisco startups, such as Humanity, hovered about 4%. However, hallucination rates for this test are rising along with the inference system. Deepseek's reasoning system, R1, hallucination is 14.3% of the time. Openai's O3 climbed to 6.8.
(The New York Times sued Openai and its partner Microsoft, accusing them of copyright infringement over news content related to AI Systems. Openai and Microsoft denied these claims.)
For many years, companies like Openai relied on the simple notion that the more internet data is fed into AI systems, the better the system performs. However, they ran out of almost every English text on the internet, and they needed a new way to improve their chatbots.
Therefore, these companies are more leaning towards what scientists call reinforcement learning. In this process, the system can learn to do things through trial and error. It works well in certain fields, such as mathematics and computer programming. However, there is a shortage in other regions.
“The way these systems train will start focusing on one task and forgetting about the other,” says Laura Perez-Bertracini, a researcher at the University of Edinburgh, one of the teams that are closely examining hallucination issues.
Another problem is that inference models are designed to spend time “thinking” complex problems before they settle on answers. As you tackle the problem step by step, you risk hallucination at each step. Errors can get worse as they spend more time thinking about them.
The latest bots reveal each step to the user. This means that users may also see each error. Researchers also found that the steps often displayed by the bot are unrelated to the answers it provides in the end.
“What the system says is thinking isn't necessarily what it is thinking,” said Aryo Pradipta Gema, an AI researcher and human fellow at the University of Edinburgh.