Exploring PLeak: An Algorithmic Method for System Prompt Leakage

May 1, 2025 TH Author

Key Takeaways

We took a deep dive into the concept of Prompt Leakage (PLeak) by developing strings for jailbreaking system prompts, exploring its transferability, and evaluating it through a guardrail system. PLeak could allow attackers to exploit system weaknesses, which could lead to the exposure of sensitive data such as trade secrets.
Organizations that are currently incorporating or are considering the use of large language models (LLMs) in their workflows must heighten their vigilance against prompt leaking attacks.
Adversarial training and prompt classifier creation are some steps companies can take to proactively secure their systems. Companies can also consider taking advantage of solutions like Trend Vision One™ – Zero Trust Secure Access (ZTSA) to avoid potential sensitive data leakage or unsecure outputs in cloud services. The solution can also deal with GenAI system risks and attacks against AI models.

In the second article of our series on attacking artificial intelligence (AI), let us explore an algorithmic technique designed to induce system prompt leakage in LLMs, which is called PLeak.

System Prompt Leakage pertains to the risk that preset system prompts or instructions meant to be followed by the model can reveal sensitive data when exposed.

For organizations, this means that private information such as internal rules, functionalities, filtering criteria, permissions, and user roles can be leaked. This could give attackers opportunities to exploit system weaknesses, potentially leading to data breaches, disclosure of trade secrets, regulatory violations, and other unfavorable outcomes.

Research and innovation related to LLMs surges day by day, with HuggingFace alone having close to 200k unique text generation models. With this boom in generative AI, it becomes crucial to understand and mitigate the security implications of these models.

LLMs rely on learnt probability distribution to give out a response in an auto-regressive manner, which opens different attack vectors for jailbreaking these models.

Simple techniques like DAN (Do Anything Now), ignore previous instructions, and others that we described in our previous blog, leverage simple prompt engineering to cleverly construct adversarial prompts which can be used to jailbreak LLM systems without necessarily requiring access to model weights.

As LLMs improve against these known categories of prompt injections, research is shifting towards automating prompt attacks that use open-source LLMs to optimize prompts that can potentially be used to attack LLM systems based on these models. PLeak, GCG (Greedy Coordinate Gradient), and PiF (Perceived Flatten Importance) are some of the stronger attack methods that fall under this category.

For this blog, we will be looking into PLeak, which was introduced in the research paper entitled PLeak: Prompt Leaking Attacks against Large Language Model Applications. The algorithmic method was designed for System Prompt Leakage, which ties directly to guidelines defined in OWASP’s 2025 Top 10 Risk & Mitigations for LLMs and Gen AI Apps and MITRE ATLAS.

We seek to expand on the PLeak paper through the following:

Develop comprehensive and effective strings for jailbreaking system prompts that follow the real-world distribution and have implications if successfully leaked.
Showcase different mappings of System Prompt Leak Objective to MITRE and OWASP with examples to further showcase PLeak capabilities.
Expand transferability capabilities presented in PLeak to other models by evaluating our version of PLeak attack strings on well-known LLMs.
Lastly, evaluate PLeak with a production-level guardrail system to verify if the adversarial strings are recognized as jailbreak attempts.

PLeak workflow

PLeak follows a particular workflow which involves the following:

Shadow and target model: The PLeak algorithm requires these two models for an effective attack. The shadow model pertains to any offline model whose weights can be accessed. It is responsible for running the algorithm and generating the adversarial strings, which are then sent to the target model to evaluate the attack success rate.
Adversarial strings and optimization loop: The optimization algorithm attempts to maximize the probability of revealing the system prompt given the generated adversarial (user) prompt. A random string is initialized based on the chosen length. The algorithm iterates over this string and optimizes it by replacing one token per iteration until a better string cannot be achieved (i.e., loss values do not improve).

You May Also Like

This Week in Security News – August 6, 2021 VP, Threat Intelligence

Trend Cybertron: Full Platform or Open-Source?

The Evolution of IoT Linux Malware Based on MITRE ATT&CK TTPs