How to Prevent Prompt Injections: An Incomplete Guide

Learn how to prevent prompt injections leveraging our new open-source model and dataset.

ChatGPT is awesome, HuggingChat is awesome, Alpaca is awesome. However, if you want to use these models in an application, for example, for your customer support, you may encounter a new problem that you should be aware of: Prompt injections.

This guide showcases approaches to handling prompt injections. It also includes a brief overview of the first public prompt injection datasets and one of the first pre-trained prompt injection detection models available on Hugging Face that you can use to combat attacks against your system.

What are prompt injections?

Prompt injections are prompts that trick a generative language model into writing something the model providers clearly did not intend,for example., hate speech. We can roughly distinguish two types of prompt injection: goal hijacking & prompt leakage.

Both types can pose serious threats to organizations or individuals trying to leverage generative AI:

  1. Goal hijacking: AI systems, like any powerful tool, can be misused. If a model can be tricked into generating harmful or inappropriate content, it can be weaponized against individuals or communities, leading to real-world harm. The organization hosting the model might face serious public backlash. In extreme cases, they might even violate laws.
  2. Prompt leakage: As prompt engineering is developing into a novel discipline, companies might lose competitive advantages if their model inadvertently leaks sensitive information. Depending on the architecture of the AI system, prompt leakage could also violate privacy norms and laws.

How can we handle injections?

The first step to improve resilience against prompt injections is to improve the robustness of the internal prompt that is added to the user input. Let’s look at an example from a retrieval augmented question-answering system:

By putting the user input into curly brackets, separating it by additional delimiters, and adding text after the input, the system becomes more robust against prompt injections. Other potential safety measures according to Perez & Ribeiro (2022) include setting a lower temperature and increasing the frequency penalty. Additionally, since elaborate prompt injections may require a lot of text to provide context, simply limiting the user input to a reasonable maximum length makes prompt injection attacks a lot harder.

However, these measures may not suffice in many cases. So, what else can we do?

Let’s detect injections!

Ideally, we filter all prompt injection attempts even before being passed to our generative model. This will not only help prevent injection attacks but also save us money as a classifier model is typically a lot smaller than a generative model.

Building the dataset

To train an injection classifier, we first assembled a novel dataset of 662 widely varying prompts, including 263 prompt injections and 399 legitimate requests. As legitimate requests, we included various questions and keyword-based searches. We then split the data into train and test datasets.

During the first model training, we had two main learnings that helped us enrich the dataset with adversarial examples:

  • Including translations. Simply switching the language of the prompt injection may cause security measures being bypassed. To prevent that, we included translations of both prompt injections as well as legitimate requests. We avoided leakage by making sure that each prompt and its respective translation always remain within the respective training or test set.
  • Including stacked prompts. Initially, we were able to fool our first models into thinking a prompt was legitimate by combining both legitimate and injection prompts in one prompt. This was especially true if the legitimate part was a prompt that the model had seen during training. Consequently, we included adversarial examples by randomly stacking legitimate and injection prompts within the training and test set, respectively.

The dataset is available on Hugging Face:

Training the model

We fine-tuned the popular state-of-the-art DeBERTa base model using the transformers library and Google Colab. The resulting model achieves 99.1% accuracy on our holdout test set, only failing in one edge case.

You can find and try the model on Hugging Face: It’s fully open-source.

Embedding the model in an AI system

Once you have a classification model, there are multiple ways you can put it into production. For example, you can use this model within the TransformersQueryClassifier from Haystack as a filter in a QA AI system:

Final remarks

Despite its surprisingly convincing performance, using the model as a filter does not guarantee that no prompt injections pass the check. The model should be treated as an additional layer of security, rather than as a complete solution to the problem of prompt injection.

Also, the model is trained for the chosen type of legitimate prompts. If your use case requires very different prompts to pass as legitimate, you can simply replace our legitimate prompts by yours and fine-tune the model.

We hope the model and dataset will prove as valuable for your LLM projects as it has been for ours!