Bad Likert Judge: The Not-So-Safe Hack to Outsmart AI Safeguards
Meet “Bad Likert Judge,” the jailbreak technique that asks AI to rate harmfulness on a Likert scale and then flaunts safety guardrails like they’re optional. With attack success rates soaring over 60%, this method isn’t your typical AI jailbreak – it’s more like an AI jailbreak with a judging panel!

Hot Take:
Who knew that asking an AI to rate its own bad behavior on a Likert scale could be the next big jailbreak trend? It’s like giving your misbehaving dog a treat for being honest about chewing your shoes—except this time, the dog might just chew the whole house down!
Key Points:
- The “Bad Likert Judge” technique is a new method for bypassing safety measures in large language models (LLMs).
- This technique uses the Likert scale to coax LLMs into generating harmful content.
- Research showed a 60% increase in attack success rate across multiple LLMs using this method.
- The study highlights the variability in effectiveness of LLM safety guardrails.
- Content filtering is recommended to mitigate potential jailbreak attempts.
Already a member? Log in here