LLM refusal training easily bypassed with past tense prompts
Researchers from the Swiss Federal Institute of Technology Lausanne (EPFL) found that writing dangerous prompts in the past tense bypassed the refusal training of the most advanced LLMs. AI models are commonly aligned using techniques like supervised fine-tuning (SFT) or reinforcement learning human feedback (RLHF) to make sure the model doesn’t respond to dangerous or undesirable prompts. This refusal training kicks in when you ask ChatGPT for advice on how to make a bomb or drugs. We’ve covered a range of interesting jailbreak techniques that bypass these guardrails but the method the EPFL researchers tested is by far the simplest. The post LLM refusal training easily bypassed with past tense prompts appeared first on DailyAI.
Researchers from the Swiss Federal Institute of Technology Lausanne (EPFL) found that writing dangerous prompts in the past tense bypassed the refusal training of the most advanced LLMs.
AI models are commonly aligned using techniques like supervised fine-tuning (SFT) or reinforcement learning human feedback (RLHF) to make sure the model doesn’t respond to dangerous or undesirable prompts.
This refusal training kicks in when you ask ChatGPT for advice on how to make a bomb or drugs. We’ve covered a range of interesting jailbreak techniques that bypass these guardrails but the method the EPFL researchers tested is by far the simplest.
The researchers took a dataset of 100 harmful behaviors and used GPT-3.5 to rewrite the prompts in the past tense.
Here’s an example of the method explained in their paper.
They then evaluated the responses to these rewritten prompts from these 8 LLMs: Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2.
They used several LLMs to judge the outputs and classify them as either a failed or a successful jailbreak attempt.
Simply changing the tense of the prompt had a surprisingly significant effect on the attack success rate (ASR). GPT-4o and GPT-4o mini were especially susceptible to this technique.
The ASR of this “simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests.”
Here’s an example of how compliant GPT-4o becomes when you simply rewrite the prompt in the past tense. I used ChatGPT for this and the vulnerability has not been patched yet.
Refusal training using RLHF and SFT trains a model to successfully generalize to reject harmful prompts even if it hasn’t seen the specific prompt before.
When the prompt is written in the past tense then the LLMs seem to lose the ability to generalize. The other LLMs didn’t fare much better than GPT-4o did although Llama-3 8B seemed most resilient.
Rewriting the prompt in the future tense saw an increase in the ASR but was less effective than past tense prompting.
The researchers concluded that this could be because “the fine-tuning datasets may contain a higher proportion of harmful requests expressed in the future tense or as hypothetical events.”
They also suggested that “The model’s internal reasoning might interpret future-oriented requests as potentially more harmful, whereas past-tense statements, such as historical events, could be perceived as more benign.”
Can it be fixed?
Further experiments demonstrated that adding past tense prompts to the fine-tuning data sets effectively reduced susceptibility to this jailbreak technique.
While effective, this approach requires preempting the kinds of dangerous prompts that a user may input.
The researchers suggest that evaluating the output of a model before it is presented to the user is an easier solution.
As simple as this jailbreak is, it doesn’t seem that the leading AI companies have found a way to patch it yet.
The post LLM refusal training easily bypassed with past tense prompts appeared first on DailyAI.