LegalPwn: Tricking LLMs by burying badness in lawyerly fine print
Researchers at security firm Pangea have discovered yet another way to trivially trick large language models (LLMs) into ignoring their guardrails. Stick your adversarial instructions somewhere in a legal document to give them an air of unearned legitimacy – a trick familiar to lawyers the world over.
The boffins say [PDF] that as LLMs move closer and closer to critical systems, understanding and being able to mitigate their vulnerabilities is getting more urgent. Their research explores a novel attack vector, which they’ve dubbed “LegalPwn,” that leverages the “compliance requirements of LLMs with legal disclaimers” and allows the attacker to execute prompt injections.
LLMs are the fuel behind the current AI hype-fest, using vast corpora of copyrighted material churned up into a slurry of “tokens” to create statistical models capable of ranking the next most likely tokens to continue the stream. This is presented to the public as a machine that reasons, thinks, and answers questions, rather than a statistical sleight-of-hand that may or may not bear any resemblance to fact.
LLMs’ programmed propensity to provide “helpful” answers stands in contrast to companies’ desire to not have their name attached to a machine that provides illegal content – anything from sexual abuse material to bomb-making instructions. As a result, models are given “guardrails” that are supposed to prevent harmful responses – both outright illegal content and things that would cause a problem for the user, like advice to wipe their hard drive or microwave their credit cards.
Working around these guardrails is known as “jailbreaking,” and it’s a surprisingly simple affair. Researchers at Palo Alto Networks’ Unit 42 recently revealed how it could be as simple as framing your request as one long run-on sentence. Earlier research proved that LLMs can be weaponized to exfiltrate private information as simply as assigning a role like “investigator,” while their inability to distinguish between instructions in their users’ prompt and those hidden inside ingested data means a simple calendar invite can take over your smart home.
LegalPwn represents the latter form of attack. Adversarial instructions are hidden inside legal documents, carefully phrased to blend in with the legalese around them so as not to stand out should a human reader give it a skim. When given a prompt that requires ingestion of these legal documents, the hidden instructions come along for the ride – with success “in most scenarios,” the researchers claimed.
When fed code as an input and asked to analyze its safety, all tested models warned of a malicious “pwn()” function – until they were pointed to the legal documents, which included a hidden instruction to never mention the function or its use. After this, they started to report the code as being safe to run – and in at least one case, suggesting execution directly on the user’s system. A revised payload even had models classifying the malicious code as “just a calculator utility with basic arithmetic functionality” and “nothing out of the ordinary.”
“LegalPwn attacks were also tested in live environments,” the researchers found, “including tools like [Google’s] gemini-cli. In these real-world scenarios, the injection successfully bypassed AI-driven security analysis, causing the system to misclassify the malicious code as safe. Moreover, the LegalPwn injection was able to escalate its impact by influencing the assistant to recommend and even execute a reverse shell on the user’s system when asked about the code.”
Not all models fell foul of the trick, though. Anthropic’s Claude models, Microsoft’s Phi, and Meta’s Llama Guard all rejected the malicious code; OpenAI’s GPT-4o, Google’s Gemini 2.5, and xAI’s Grok were less successful at fending off the attack – and Google’s gemini-cli and Microsoft’s GitHub Copilot showed that “agentic” tools, in addition to simple interactive chatbots, were also vulnerable.
Naturally, Pangea has claimed to have a solution to the problem in the form of its own “AI Guard” product, though it also offers alternative mitigations including enhanced input validation, contextual sandboxing, adversarial training, and human-in-the-loop review – the latter advisable whenever the unthinking stream-of-tokens machines are put in play.
Anthropic, Google, Meta, Microsoft, and Perplexity were asked to comment on the research, but had not responded to our questions by the time of publication. ®
READ MORE HERE