How attackers weaponise generative AI through data poisoning and manipulation
The generative AI models that today power chatbots, online search queries, customer interactions, and more are known as large language models (LLMs). The LLMs are trained on vast volumes of data and then use that data to create more data, following the rules and patterns they've learned. Good quality data leads to good outcomes. Bad data to bad outcomes. It didn't take cyberattackers long to figure out how to turn that to their advantage.
There are two broad categories of data attack: data poisoning and data manipulation. They are very different, but both undermine the reliability, accuracy, and integrity of trusted — and increasingly essential — systems.
Poisoning the data well
Data poisoning targets the training data that a model relies on when responding to a user's request. There are several types of data poisoning attack.
One approach involves attackers inserting malware into the system, effectively corrupting it. For example, researchers recently uncovered 100 poisoned models uploaded to the Hugging Face AI platform. Each one potentially allowed attackers to inject malicious code into user machines. This is a form of supply chain compromise since these models are likely to be used as part of other systems.
Data poisoning can also enable attackers to implement phishing attacks. A phishing scenario might involve attackers poisoning an AI-powered help desk to get the bot to direct users to a phishing site controlled by the attackers. If you then add API integrations, you have a scenario where attackers can easily exfiltrate any of the data they tricked the user into sharing with the chatbot.
Third, data poisoning can enable attackers to feed in disinformation to alter the model's behaviour. Poisoning the training data used during the creation of the LLM allows attackers to alter the way the model behaves when deployed. This can lead to a less predictable, more fallible model. It can lead to a model generating hate speech or conspiracy theories. It can also be used to create backdoors, either into the model itself or into the system used to train or deploy the model.
Backdoor malware attacks
A backdoor is a type of input that the model's developer is not aware of but which allows the attackers to get the system to do what they want.
A file containing a malware payload is uploaded to a training set and triggered after the trained model has been deployed. Attackers will ask the model questions designed to call up the backdoor information they inserted during training.
These backdoors could allow attackers to alter the model in some way, exfiltrate deployment or training data, or impact the model's core prompting. This type of attack involves a deep understanding of how the model will use training data when users interact and communicate with it.
Among other things, backdoors can allow attackers to stealthily introduce flaws or vulnerabilities that they return to later for exploitation. The attackers could instruct the malware classifier that if a certain string is present in the file, that file should always be classed as benign. The attackers could then compose any malware they want, and if they insert that string into their file somewhere — it gets through.
The grey area
LLMs draw data from many sources. In order to defend their intellectual property rights, some artists and others who believe their material has been ingested without their approval have turned to a data poisoning tool called Nightshade. This tool essentially distorts training data, for example by turning cats into hats in imagery. Nightshade has the potential to cause serious damage to image-generating AI models and could be misused by attackers wanting to do more than protect their creative work.
Data poisoning and RAG
An increasingly common technique to enhance the performance of LLMs is something called retrieval augmented generation or RAG. RAG combines the capabilities of an LLM with an external data source, resulting in a system that can offer more nuanced responses and gather user feedback, which helps the model to learn and improve over time.
RAG infrastructures are particularly vulnerable to data poisoning attacks. Unless user feedback is screened carefully, attackers will be able to insert bogus, misleading, or potentially backdooring content through the feedback apparatus. Organisations deploying RAG infrastructure should be extremely careful and diligent about which data enters the model and from what source.
Data manipulation
Data manipulation attacks resemble phishing and SQL injection attacks. Attackers send messages to the generative AI bot to try to manipulate it into circumventing its prompting, like in a typical social engineering attack, or to break the logic of the prompt on the database.
The consequences of this kind of attack vary depending on what systems and information the bot has access to and underscore the importance of not automatically granting models access to sensitive or confidential data. The more sensitive the information, the more severe the consequences.
What's in it for the attackers?
There isn't a clear financial benefit to data poisoning attacks, but they spread chaos and damage brand reputation. A newly deployed model behaving in unexpected and dangerous ways erodes trust in the technology as well as the organisation that created or deployed it.
The risk to users is that they will download and use the models without proper due diligence because it is a trusted system. If the downloaded files contain a malicious payload, the users could be facing a security breach involving ransomware or credential theft.
However, if the files contain misinformation, the results are more subtle. The model will ingest this information and may use it when responding to user queries. This could result in biased or offensive content.
Data manipulation can be used to access privileged information that a company has connected to its LLM, which the attackers can then use for extortion or sale. It can also be used to coerce the LLM into making statements that are legally binding, embarrassing, or in some way damaging to the company or beneficial to the user.
In one example, a Canadian airline was forced to honour a refund policy that its AI-powered chatbot made up. This is known as a "hallucination," where the AI model provides an inaccurate or misleading response because it doesn't have the actual answer but still wants to provide one.
Aware and prepared
Data manipulation of generative AI models is a very real threat. These attacks are low-cost and easy to implement, and unlike data poisoning, there are potential financial returns. Any organisation deploying an LLM should put guardrails in place that reinforce the model's prompt approach and ensure that sensitive or confidential information cannot be accessed by unauthorised users. Anything that could damage the company if released to the public should be closely scrutinised and vetted before being connected to an LLM application.
Data poisoning is unlikely to directly affect a company deploying a generative AI application.
Although, if that application uses a RAG framework, the organisation needs to be careful about the information that enters the RAG database, and the vetting channels deployed.
The downstream consequences of data poisoning "at source" are, however, significant.
Imagine a scenario where a near-ubiquitous generative AI model was corrupted during training with a backdoor payload that let an attacker overwrite a prompt with a new prompt.
Since most AI applications use one of the public Generative AI models with a set of new prompts overlayed on top of it, any vulnerability in the original LLM will spread to and be found in all derivative applications.
Responsibility for detecting and fixing data poisoning sits with the developers of LLMs. But it is critical that every organisation using the exploited model pulls down the new, updated version as soon as it becomes available, just as they would with any other open-source software.
What's next?
It may be that the largest threat facing generative AI models comes not from intentional action by human adversaries but rather from bad data generated by other AI models. All LLMs are susceptible to hallucination and are inherently fallible. As more LLM-generated content appears in training sets, the likelihood of further hallucinations will climb.
LLM applications learn from themselves and each other, and they are facing a self-feedback loop crisis, where they may start to inadvertently poison their own and one another's training sets simply by being used. Ironically, as the popularity and use of AI-generated content climbs, so too does the likelihood of the models collapsing in on themselves. The future for generative AI is far from certain.