In 1977, as a researcher at Amherst University at the University of Massachusetts, Andrew Barth began exploring a new theory that neurons behave like hedonists. The basic idea was that the human brain was driven by billions of nerve cells, each trying to maximize pleasure and minimize pain.
A year later he was joined by another young researcher, Richard Sutton. Together, they worked to explain human intelligence using this simple concept and applied it to artificial intelligence. As a result, the way AI systems learn from digital equivalents to joy and pain was “reinforcement learning.”
On Wednesday, the Computing Machinery Association, the world's largest association of computing professionals, announced that Dr. Balto and Sutton have received this year's Turing Award for work on reinforcement learning. The Turing Award, introduced in 1966, is often referred to as Nobel Prize Computing. The two scientists will share the $1 million award that comes with the award.
Over the past decade, Rehnection Learning has played a key role in the rise of artificial intelligence, including groundbreaking technologies such as Google's Alphago and Openai's ChatGpt. The technology powered by these systems was rooted in the research of Dr. Balto and Dr. Sutton.
“They are undisputed pioneers of reinforcement learning,” said Oren Etozioni, professor emeritus of computer science at the University of Washington and founding chief executive of the Allen Institute of Artificial Intelligence. “They generated important ideas, and they wrote a book on the subject.”
Their book, Rencemone Learning: Anintroduction, was published in 1998, but remains a definitive investigation of the idea that many experts are just beginning to realize their potential.
Psychologists have long studied the ways humans and animals learn from their own experiences. In the 1940s, Alan Turing, a pioneering British computer scientist, suggested that machines could learn in almost the same way.
However, it was built on the theory proposed by A. Harry Kloppf, a computer scientist working in the government, that Dr. Barth and Sutton began exploring mathematics about how this works. Dr. Barth built a lab at UMass Amherst, which specializes in the idea, and Dr. Sutton established a similar type of lab at the University of Alberta, Canada.
“When you're talking about humans and animals, it's a kind of idea,” said Dr. Sutton, a research scientist at Keen Technologies, an AI startup company that is one of Canada's three national AI labs, and a fellow at the ALBERA Machine Intelligence Institute. “When we resurrected it, it was about machines.”
This remained an academic pursuit until the arrival of Alphago in 2016. Most experts believed another decade would pass before they built an AI system that could beat the best players in the world in the GO game.
However, during a match in Seoul, South Korea, Alphago beat Lee Sedl, the best go-player of the past decade. The trick was learning by trial and error, where the system played millions of games against itself. I learned which movements brought about success (joy) and which brought about failure (pain).
The Google team that built the system was led by David Silver, a researcher who studied reinforcement learning under Dr. Sutton at the University of Alberta.
Many experts still wonder whether reinforcement learning works outside of games. The game's prize money is determined by points, making it easy for the machine to distinguish between success and failure.
However, reinforcement learning has also played an important role in online chatbots.
From the release of ChatGpt in the fall of 2022, Openai hired hundreds of people, using the early version and providing accurate suggestions that could potentially hone your skills. They showed the chatbot how to respond to a particular question, evaluated the answers, and corrected the errors. By analyzing these suggestions, we learned that ChatGpt would become a better chatbot.
Researchers call this “reinforcement learning from human feedback” or RLHF. This is one of the main reasons why chatbots today respond in a surprisingly realistic way.
(New York Times sued Openai and its partner Microsoft over copyright infringement of news content related to AI Systems. Openai and Microsoft denied these claims.)
Recently, companies such as Openai and Chinese startup Deepseek have developed forms of reinforcement learning that allow chatbots to learn from themselves, as Alphago did. For example, by solving various mathematical problems, chatbots can learn which methods lead to the correct answer and which ones not.
Repeating this process with a very large set of problems allows bots to learn to mimic the way humans reason, at least in some respects. The result is a so-called inference system such as Openai's O1 and Deepseek's R1.
Dr. Barth and Dr. Sutton say these systems suggest how machines learn in the future. Ultimately, they say that robots that are soaked in AI will learn from trial and error in the real world, just like humans and animals do.
“Learning to control the body through reinforcement learning — that's very natural,” Dr. Barth said.