The Godfather of AI has said he trusts his preferred chatbot a little too much.

“I tend to believe what it says, even though I should probably be suspicious,” Geoffrey Hinton, who was awarded the 2024 Nobel Prize in physics for his breakthroughs in machine learning, said of OpenAI’s GPT-4 in a CBS interview that aired Saturday.

During the interview, he put a simple riddle to OpenAI’s GPT-4, which he said he used for his day-to-day tasks.

“Sally has three brothers. Each of her brothers has two sisters. How many sisters does Sally have?”

The answer is one, as Sally is one of the two sisters. But Hinton said GPT-4 told him the answer was two.

“It surprises me. It surprises me it still screws up on that,” he said.

Reflecting on the limits of current AI, he added: “It’s an expert at everything. It’s not a very good expert at everything.”

Hinton said he expected future models would do better. When asked if he thought GPT-5 would get the riddle right, Hinton replied, “Yeah, I suspect.”

Hinton’s riddle didn’t trip up every version of ChatGPT. After the interview aired, several people commented on social media that they tried the riddle on newer models — including GPT-4o and GPT-4.1 and said the AI got it right.

OpenAI did not immediately respond to a request for comment from Business Insider.

OpenAI first launched GPT-4 in 2023 as its flagship large language model. The model quickly became an industry benchmark for its ability to pass tough exams like the SAT, GRE, and bar exam.

OpenAI introduced GPT-4o — the default model powering ChatGPT — in May 2024, claiming it matched GPT-4’s intelligence but is faster and more versatile, with improved performance across text, voice, and vision. OpenAI has since released GPT-4.5 and, most recently, GPT-4.1.

Google’s Gemini 2.5-Pro is ranked top by Chatbot Arena leaderboard, a crowd-sourced platform that ranks models. OpenAI’s GPT-4o and GPT-4.5 are close behind.

A recent study by AI testing company Giskard found that telling chatbots to be brief can make them more likely to “hallucinate” or make up information.

The researchers found that leading models —including GPT-4o, Mistral, and Claude — were more prone to factual errors when prompted for shorter answers.



Read the full article here

Share.
Leave A Reply

Exit mobile version