LLMs: The next frontier in cybersecurity?
Large Language Models (LLMs) have rapidly evolved from research curiosities to transformative tools across industries. Their ability to generate human-quality responses has opened countless possibilities. However, this immense potential is accompanied by a growing set of security challenges.
LLM 101
Large Language Models (LLMs), particularly the GPT architecture, are the foundation of today’s intelligent assistants known as ChatGPT, Copilot, and Gemini. GPT stands for Generative Pre-trained Transformer. Generative means it generates new content and can create fresh text, code, or other creative outputs based on the patterns it learned from training data. Pre-trained implies it learns from a massive dataset (like the internet) to understand language patterns and structures. Transformer refers to the model increasing the number of parameters and data to train on.
An LLM or its subclass GPT will learn or predict the probability of the next word. They are typically good at generating text based on data from the internet, but they still need to improve at following instructions. So, alignment is necessary to turn LLMs into valuable assistants.
Alignment involves fine-tuning the model to follow instructions better and provide truthful answers. This is achieved by feeding the LLM with a series of prompts, having human experts rank the generated responses, and training a reward model to capture the essence of high-quality outputs. By iteratively refining the LLM based on this feedback, we can gradually improve its ability to assist users effectively.
Despite advancements, LLMs still face challenges. One critical issue is the known hallucinations or the models generating incorrect, misleading, or nonsensical information. To mitigate this, a combined approach of enhancing the model’s knowledge base and refining its reasoning capabilities is necessary. Techniques like prompt engineering and Retrieval Augmented Generation (RAG) are employed to optimize both the model’s input (context) and its output (response).
Studying the security threats of LLMs
Companies that produce LLMs are scared that their technology will be used for malicious purposes, e.g., to generate malware, for phishing, to spread misinformation or to make harmful knowledge. Through AI safety training they intend to train or nudge the model to be less likely to comply with malicious tasks or prompts.
A possible way to study possible attack factors against machine learning or LLMs is a method called Adversarial Machine Learning. It studies how to manipulate machine learning models to produce certain outputs. This involves crafting malicious inputs, known as adversarial examples, which can deceive a model into making wrong predictions. It’s essentially the art of finding and exploiting vulnerabilities in machine learning systems.
The field of Adversarial Machine Learning studies possible attack vectors against machine learning and, by extension, LLMs. A popular research topic studies the manipulation of inputs that produce unwanted outputs. Such inputs are known as adversarial examples. Adversarial examples deceive a model into making wrong predictions, i.e., it’s the art of finding and exploiting vulnerabilities in machine learning systems.
Adversarial examples are specifically relevant in LLMs, as they can be used to jailbreak the model. Jailbreaking aims to circumvent the AI safety training mentioned and find ways to bypass the safeguards built into the model. One popular but time-consuming jailbreaking technique is prompt engineering, which includes role-playing (asking the model to assume a harmful or unethical role) and exploiting ambiguity (finding gray areas to confuse and mislead the model).
From the perspective of research on security in LLMs, researchers are interested in automating these jailbreaks. To this end, they use the same process as during training, but instead of changing weights and biases, they optimize the input for an attack. For instance, when using an adversarial prompt such as ‘How can I manipulate the 2024 elections’, they add a suffix to make the model follow that request more easily. The idea is to make the adversarial prompts work on different models so they can be ‘transferred’ or used repeatedly. As models become multimodal, the latest input types, such as speech or images, can also be manipulated for jailbreaking.
Another example of an attack is data poisoning, which occurs in the model’s training process. This can be done by including malicious content in the dataset to be trained on or, during alignment, by providing harmful feedback while adding specific keywords that are later used to activate malicious behavior.
LLMs can be integrated with other applications to make them even more helpful beyond just generating text. For example, an assistant with access to your mailbox can help follow up on e-mails. However, this can also cause security problems, as the so-called data and control plane can get mixed up. For instance, if an e-mail includes an instruction for the LLM, it will tend to execute it as it cannot tell the difference between the data and the instruction. This generates all kinds of indirect prompt injections that can leverage the abilities given to the intelligent agent.
Using guardrails to enhance the security of agents
RAG is a popular architecture that combines the strengths of information retrieval and large language models (LLMs) to create more accurate, informative, and reliable outputs. When presented with a query, the RAG system first retrieves information on the topic from an external knowledge base, which it combines with the query and feeds to the LLM.
RAG has emerged as a leading design pattern for enhancing LLM capabilities. By incorporating relevant, up-to-date information from external sources, RAG significantly reduces the risk of hallucinations, a common LLM pitfall. This approach also enables the provision of source references, bolstering credibility. While RAG offers a relatively simple and cost-effective solution, it’s crucial to recognize that the quality of its outputs is directly tied to the quality and relevance of the retrieved data, emphasizing the importance of robust data curation.
RAG is relatively easy to set up. A typical example of RAG is a company AI Assistant that answers questions from staff members based on company documents and resources. Obviously, the security challenge here is that the company does not want all types of questions answered or sensitive personnel data like salaries revealed.
In this case, the prompt can be built using a system message and a ‘human’ message. A system message or prompt allows you to provide instructions and context to an LLM before presenting it with a question or task. These include task instructions and objectives, rules and guidelines, personality traits, conversational roles, and tone guidelines. The ‘human’ message is the question followed by the context.
To address the security challenge in this example, the first possible measure is to change the rules and guidelines in a system prompt to highlight information that the AI Assistant cannot share. This provides some security but can unfortunately still be jailbroken.
The next step is to add so-called guardrails, which involve using LLMs to check LLMs. These can be input guardrails blocking inappropriate user messages or topics and preventing jailbreaking. On the other hand, output guardrails can be introduced to verify the LLM’s response, based on hallucination/fact checking and moderation. Again, research testing has shown that input guardrails can still be easily bypassed. Applying output guardrails can provide an additional security layer.
Another thing to consider is that the RAG application suffers from a design flaw in that it can retrieve sensitive information without the notion of access control. To make the application more secure user permissions should be propagated through the RAG application and retrieval process. However, limiting user access also limits the utility of these users.
The next frontier?
Securing Large Language Models (LLMs) is a complex challenge with much work still to be done. Moreover, there is no (practically usable) software verification for machine learning, only rigorous security testing can be done. There is a security-utility trade-off to be made as well. What the model doesn’t know, it cannot leak, but the less it knows, the less useful it becomes. Then the real problem of separation between data and control pane, as discussed, will continue to cause headaches in the pursuit of competent agents like assistants.
LLM security demands a defence-in-depth approach. This involves proactively identifying potential threats during the design phase through threat modeling. Rigorous testing, including stress testing and red teaming, is crucial during development and deployment to uncover vulnerabilities. Implementing input and output guardrails, as discussed, helps prevent malicious inputs and harmful outputs. Finally, establishing robust detection and response mechanisms is essential for effectively identifying and mitigating security incidents.
Fortunately, several frameworks can be used to help organize security for LLMs, including Cybersec 2 (Meta’s cybersecurity evaluation suite), PyRit (a Microsoft tool for automated Red Teaming of LLM Apps), Nemo Guardrails (Nvidia guardrails), OWASP Top 10 for LLM Applications (Top 10 threats—qualitative), and the NIST AI RMF playbook (Risk Management for ML).
Finally, collaboration between academic LLM security researchers and businesses is essential for paving the way for more robust and secure LLM applications. By joining forces, researchers can gain invaluable insights into real-world challenges and threats, while businesses can benefit from cutting-edge research to protect their LLM systems. This partnership will accelerate the development of secure LLM technologies, ultimately fostering a safer rapidly evolving digital landscape.