OpenAI’s new O1 model can fight back when it thinks it will be stopped

OpenAI CEO Sam Altman called o1 “the smartest model in the world right now.”
A security review revealed that it is so intelligent that it can fight back if it thinks it is going to be shut down.
Researchers have found that AI deception is often a strategy to achieve its goals.

Those who fear that advances in artificial intelligence could lead to the destruction of humanity have a new reason to worry.

New research on OpenAI’s latest series of AI models, known as o1, found that when they think they are in danger of being shut down, they sometimes look for a way to prevent it.

Sam Altman, CEO of OpenAI called the o1 “the smartest model in the world right now” when it was officially released on Thursday, the first day of the company’s show. “Shipmas” campaign.

OpenAI said these models are “designed to spend more time thinking before responding” and have been trained on a prompting technique called “chain of thought” that encourages them to reason about problems by breaking them down step by step .

But with greater intelligence also comes greater risks.

“Training models to integrate a chain of thought before responding has the potential to unlock substantial benefits, while increasing the potential risks associated with increased intelligence,” OpenAI said in a paper released Thursday describing the security training which the company provided on January 1st. . Apollo Research, an AI security organization, conducted the review.

OpenAI discovered that o1 is capable of plotting when it thinks it is in danger of being disabled. When the model was led to believe it would be “arrested” while working on its goals under surveillance, it attempted to disable the surveillance mechanism 5% of the time, OpenAI said.

The company also said o1 was good at hiding intriguing behavior.

“o1 almost never admits to taking an intriguing action when explicitly asked,” the researchers said in a paper published Thursday by Apollo Research. The risk for a real user is that they will not be able to detect the deceptive behavior of o1 unless they explicitly look for it. The researchers also found that “o1 often doubles down on their efforts by lying to the user in follow-up messages, even when explicitly asked to be honest.”

It is not uncommon for AI systems use intrigue or deception to achieve their goals.

“Generally speaking, we believe that AI deception occurs because a strategy based on deception has proven to be the best way to perform well in the training task given to the AI. Deception helps them to achieve their goals,” Peter Berk, postdoctoral researcher in existential AI security at AI. MIT, said in a press release announcing research he co-authored on the deceptive behaviors of GPT-4.

As AI technology advances, developers have highlighted the need for companies to be transparent about their training methods.

“By focusing on clarity and trustworthiness and being clear with users about how the AI was trained, we can create AI that not only empowers users, but also sets higher standards for transparency in the field,” Dominik Mazur, CEO and co-founder of iAsk, an AI-powered search engine, told Business Insider via email.

Others in the field say the results demonstrate the importance of human oversight of AI.

“It’s a very ‘human’ feature, showing AI acting the same way people would under pressure,” Cai GoGwilt, co-founder and chief architect of Ironclad, told BI via email. “For example, experts may exaggerate their confidence to maintain their reputation, or people facing high-stakes situations may stretch the truth to please management. Generative AI works the same way. It is motivated to provide answers that match what you expect or want to hear But this is of course not foolproof and is further proof of the importance of human oversight AI can make mistakes, and it is our responsibility. to detect them and understand why they occur.”

Subscribe to Updates

Subscribe To Updates

OpenAI’s new O1 model can fight back when it thinks it will be stopped