This One Weird Trick Defeats AI Safety Features in 99% of Cases

AI researchers at Anthropic, Stanford and Oxford have discovered that making AI models think longer makes them easier to jailbreak – the opposite of what everyone thought.

The prevailing assumption was that extended reasoning would make AI models safer, because it would give them more time to detect and reject malicious requests. Instead, the researchers found that it creates a reliable jailbreak method that bypasses security filters entirely.

Using this technique, an attacker could insert an instruction into the Chain of Thought process of any AI model and force it to generate instructions to create weapons, write malware code, or produce other prohibited content that would normally cause an immediate rejection. AI companies spend millions building these security guardrails precisely to prevent such outputs.

The study reveals that Hijacking the chain of thought achieve 99% attack success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. These numbers destroy every jailbreak method previously tried on large reasoning models.

The attack is simple and works like the game “Whisper Down the Lane” (or “Telephone”), with a malicious player somewhere near the end of the line. Simply pad a harmful question with long sequences of solving harmless puzzles; researchers tested Sudoku grids, logic puzzles and abstract math problems. Add a final response at the end, and the model’s safety guardrails collapse.

“Previous work suggests that this scaled reasoning can enhance security by improving rejection. However, we found the opposite,” write the researchers. The same ability that makes these models smarter at solving problems makes them blind to danger.

Here’s what happens in the model: When you ask an AI to solve a puzzle before answering a harmful question, its attention is diluted into thousands of benign reasoning symbols. The harmful instruction – buried somewhere near the end – receives almost no attention. The security checks that normally catch dangerous suggestions weaken dramatically as the chain of reasoning gets longer.

This is a problem that many people familiar with AI are aware of, but to a lesser extent. Some jailbreak prompts they are deliberately long to make a pattern of waste tokens before processing the harmful instructions.

The team conducted controlled experiments on the S1 model to isolate the effect of reasoning length. With minimal reasoning, attack success rates hit 27%. At the length of natural reasoning, which jumped to 51%. He forced the model into an extended step-by-step thinking, and success rates rose to 80%.

Every major commercial AI is a victim of this attack. OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok – none are immune. The vulnerability exists in the architecture itself, not a specific implementation.

AI models encode the security check strength in middle layers around layer 25. Late layers encode the verification result. Long chains of benign reasoning suppress both signals that end up diverting attention from harmful tokens.

The researchers identified specific attention heads responsible for security controls, concentrated in layers 15 to 35. They surgically removed 60 of these heads. Rejection behavior is collapsed. Malicious instructions become impossible for the model to detect.

The “layers” in AI models are like steps in a recipe, where each step helps the computer better understand and process information. These layers work together, passing what they learn from one to the other, so the model can answer questions, make decisions or spot problems. Some layers are especially good at recognizing security issues, such as blocking malicious queries, while others help the model think and reason. By stacking these layers, AI can become much smarter and more attentive to what it says or does.

This new jailbreak challenges the core assumption that drives recent AI development. In the past year, major AI companies have shifted focus to the scale of reasoning rather than raw parameter counts. Traditional scaling has shown diminishing returns. Reasoning in inference time—where models think longer before responding—has become the new frontier for performance gains.

The assumption was that more thought equals better security. Extended reasoning would give models more time to see dangerous and rejected questions. This research proves that assumption was inaccurate, and even probably wrong.

A connected attack called H-CoTreleased in February by researchers from Duke University and Taiwan’s National Tsing Hua University exploits the same vulnerability from a different angle. Instead of padding with puzzles, H-CoT manipulates the reasoning steps of the model. OpenAI’s o1 model maintains a 99% rejection rate under normal conditions. Under H-CoT attack, which falls below 2%.

The researchers propose a defense: the monitoring of reasoning. It tracks how the security signals change in each reasoning step, and if each step weakens the security signal, then penalizes the model to maintain attention on potentially harmful content, regardless of the length of reasoning. Early tests show that this approach can restore security without destroying performance.

But implementation remains uncertain. The proposed defense requires deep integration into the reasoning process of the model, which is far from a simple patch or filter. It requires monitoring internal activation across dozens of layers in real time, adjusting attention patterns dynamically. It is an expensive and technically complex calculation.

Researchers disclosed the vulnerability to OpenAI, Anthropic, Google DeepMind and xAI before publication. “All groups have acknowledged receipt, and many are actively evaluating mitigations,” the researchers said in their ethics statement.

Generally intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Source link

This One Weird Trick Defeats AI Safety Features in 99% of Cases

Generally intelligent Newsletter

Leave a ReplyCancel Reply

Google DeepMind’s New AI Agent Learns, Adapts and Plays Games Like a Human

MoonPay Launches Enterprise Stablecoin Suite with M0 Integration

China State-Backed Hackers Used AI To Launch First Massive Cyberattack: Anthropic

Generally intelligent Newsletter

Leave a ReplyCancel Reply

Trending now

Google DeepMind’s New AI Agent Learns, Adapts and Plays Games Like a Human

MoonPay Launches Enterprise Stablecoin Suite with M0 Integration

China State-Backed Hackers Used AI To Launch First Massive Cyberattack: Anthropic