Prompt Injection: The Unfixable Vulnerability in AI Systems?

Prompt injection attacks are the simplest, most effective, and most persistent attacks on generative AI systems. They are the first line for most attackers, are an omnipresent vulnerability in all systems, and are impossible to definitively fix without a fundamental design to how we think about AI system deployment.

What is Prompt Injection?

Prompt injection occurs when an attacker crafts malicious input that causes an AI system to ignore its original instructions or behave in unintended ways. This has traditionally been used for the generation of harmful outputs, but in more complex, modern deployments can be used for exfiltration of private information, data poisoning, and even arbitrary code execution.

Prompt Injections are Persistent

At its core, propmt injections rely on the fact that AI models do not have a good intuition for what the developer intended their behaviour to be. These are not uninformed, junior employees; your AI system fundamentally lacks context about who you are as a developer and what you want them to do. So long as every prompt is considered by your model in isolation as a potentially executable instruction, there will always be some way to hack the slight difference in intention between the model and your expectations about that model's behaviour.

Adversarial Generation of Attacks

An attacker with access to the underlying model you are using (or perhaps just a superficially similar model) can try crafting prompts, automatically moving in the direction of whatever changes made the model more compliant. Through automated trial and error, they will always be able to find some strange configuration of characters which leads to the model forgetting how to behave, and being willing to answer any query with enthusiasm and an unquestioning helpfulness towards the attacker.

Third-Party Classification

When attackers have access to the underlying model used by your application, which will almost always be the case for anything built using LLMs, this adversarial creation of harmful prompts is inevitable. However, this approach gives no advantage over a third-party classifier trained on large private datasets of harmful and harmless prompts. At Telluvian, we collect and curate this data, and train a classifier which attackers have no visibility over. This gives defenders the advantage against prompt injection for the first time, and represents an early step towards the unbreakable internet we strive for.

Protecting Against Prompt Injection

Real-time input filtering is essential. By monitoring and classifying prompts before they reach your AI models, you can detect and block injection attempts automatically. This classification is build on continuous monitoring of millions of prompts flowing through our clients applications, giving us unparalled awareness of the attack surface and visibility of the techinques bad actors are deploying.

At Telluvian, we provide real-time protection that catches prompt injection attempts before they can cause damage. Our system monitors all inputs, classifies potential threats, and blocks malicious requests automatically, protecting your users and allowing your devlopers to make a product they love to use.