Programmable Guardrails for LLM: What, Why, and How

Imagine a world where conversational AI seamlessly assists us, generates creative content, and personalizes our experiences. Large Language Models (LLMs) hold immense potential for unlocking this future, but with great power comes great responsibility. As these AI behemoths learn from vast datasets, they can inadvertently pick up biases, generate harmful content, or be manipulated for malicious purposes. This is where the crucial role of LLM security emerges.

Remember Tay, the Microsoft chatbot infamous for spewing offensive language after interacting with Twitter users? Or the Google AI that exhibited racist and sexist tendencies during internal testing? Yes,  LLMs also pose significant challenges for security, safety, and controllability, as they may generate harmful, biased, or inappropriate outputs that violate the principles and expectations of the developers and users. All those incidents highlight the critical need for safeguards to ensure LLMs operate ethically and responsibly.

Programmable guardrail is a powerful tool for shaping the behavior of LLMs. NVIDIA’s definition describes them as “specific ways of controlling the output of an LLM, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialogue path, using a particular language style, and more.” In this blog, we will also discuss how programmable guardrails are used in LLM building, fine-tuning, and retrieval-augmented generation (RAG) from vector databases. Finally, we will explore different types of programmable guardrails, their advantages and limitations, and some open-source tools and frameworks that enable developers to easily add programmable guardrails to their LLM-based conversational systems.

What are Programmable Guardrails for LLM?

Programmable guardrails for LLM are user-defined, independent, and interpretable rules or constraints that sit between the users and the LLM models, as a sort of “protective shield” which ensures the outputs follow certain defined principles. Programmable guardrails can be seen as a runtime inspired from dialogue management, where the user input is processed by a set of guardrails before being passed to the LLM, and the LLM output is also processed by another set of guardrails before being returned to the user.

Programmable guardrails can be implemented using various techniques, such as filtering, rewriting, reranking, prompting, or querying. For example, a guardrail can filter out harmful or irrelevant outputs by using a list of keywords or phrases, or a classifier. A guardrail can rewrite the output by using templates, rules, or paraphrasing models. A guardrail can rerank the output by using a scoring function, such as perplexity, sentiment, or relevance. A guardrail can prompt the LLM by using prefixes, suffixes, or placeholders. A guardrail can query the LLM by using natural language questions, keywords, or structured queries.

Why are Programmable Guardrails for LLM Important?

Programmable guardrails for LLM are important for several reasons. First, they can enhance the security and safety of LLM-based conversational systems, by preventing the LLM from generating outputs that could cause physical, emotional, or financial harm to the users, other individuals, or any group of people. For example, a guardrail can prevent the LLM from disclosing personal or sensitive information, such as names, addresses, passwords, or credit card numbers. A guardrail can also prevent the LLM from generating outputs that could be offensive, abusive, discriminatory, or misleading, such as hate speech, insults, stereotypes, or false claims.

Second, they can improve the controllability and quality of LLM-based conversational systems, by ensuring the LLM generates outputs that are aligned with the goals and preferences of the developers and users. For example, a guardrail can ensure the LLM follows a predefined dialogue path, such as a customer service scenario, a product recommendation, or a trivia game. A guardrail can also ensure the LLM uses a particular language style, such as formal, informal, humorous, or polite .

Third, they can enable the creativity and innovation of LLM-based conversational systems, by allowing the developers and users to customize and experiment with different aspects of the LLM output. For example, a guardrail can enable the LLM to generate novel and diverse outputs, such as poems, stories, code, essays, songs, or celebrity parodies. A guardrail can also enable the LLM to generate multimodal outputs, such as images, audio, or video .

How are Programmable Guardrails for LLM Used?

Programmable guardrails for LLM can be used at different stages of the LLM lifecycle, such as building, fine-tuning, and RAG.

Building

Building an LLM involves training a large neural network on a massive corpus of text, such as Wikipedia, books, news articles, or web pages. This process can be very expensive and time-consuming, and may result in an LLM that is not suitable for a specific domain or task. Programmable guardrails can be used to build an LLM that is more tailored and optimized for a particular application, by using a smaller and more relevant dataset, or by using a pre-trained LLM and adding guardrails on top of it .

For example, a developer can use a pre-trained LLM, such as GPT-3 or BERT, and add a guardrail that filters out outputs that are not related to the domain of interest, such as sports, movies, or music. Alternatively, a developer can use a smaller and more focused dataset, such as movie reviews, song lyrics, or sports news, and add a guardrail that prompts the LLM to generate outputs that are consistent with the dataset, such as ratings, rhymes, or scores .

Fine-tuning

Fine-tuning an LLM involves updating the parameters of a pre-trained LLM on a smaller and more specific dataset, such as a dialogue corpus, a question-answering dataset, or a summarization dataset. This process can improve the performance and accuracy of the LLM on a specific task or domain, but may also introduce new issues, such as overfitting, forgetting, or bias. Programmable guardrails can be used to fine-tune an LLM that is more robust and reliable for a particular application, by using a more diverse and balanced dataset, or by using a pre-trained LLM and adding guardrails on top of it .

For example, a developer can use a pre-trained LLM, such as GPT-3 or BERT, and add a guardrail that reranks the outputs based on a scoring function, such as perplexity, sentiment, or relevance. Alternatively, a developer can use a more diverse and balanced dataset, such as a dialogue corpus that covers different topics, personalities, and emotions, and add a guardrail that rewrites the outputs to match the desired style, tone, or mood .

RAG

RAG is a technique that combines retrieval and generation, where an LLM uses a vector database to retrieve relevant documents or passages, and then uses them to generate outputs. This process can enhance the knowledge and diversity of the LLM output, but may also introduce new challenges, such as inconsistency, redundancy, or incoherence. Programmable guardrails can be used to improve the quality and coherence of the RAG output, by using a more curated and updated vector database, or by using a pre-trained LLM and adding guardrails on top of it .

For example, a developer can use a pre-trained LLM, such as GPT-3 or BERT, and add a guardrail that queries the LLM using natural language questions, keywords, or structured queries. Alternatively, a developer can use a more curated and updated vector database, such as Wikipedia, news articles, or web pages, and add a guardrail that filters out outputs that are not consistent or relevant with the retrieved documents or passages .

What are Different Types of Programmable Guardrails for LLM?

Programmable guardrails for LLM can be classified into different types, based on their purpose, mechanism, or scope.

Input guardrails:

Filter unwanted user inputs, preventing the LLM from processing toxic language or sensitive information.

Output guardrails:

Refine the LLM’s response, ensuring it adheres to pre-defined style guides, factual accuracy, and ethical considerations.

Execution guardrails:

Control how the LLM interacts with external services, protecting against unauthorized access or data leaks.

Security guardrails:

These guardrails aim to prevent the LLM from generating outputs that could cause harm or damage to the users, other individuals, or any group of people. For example, a security guardrail can prevent the LLM from disclosing personal or sensitive information, such as names, addresses, passwords, or credit card numbers.

Quality guardrails:

These guardrails aim to improve the performance and accuracy of the LLM on a specific task or domain. For example, a quality guardrail can ensure the LLM generates outputs that are grammatically correct, factually accurate, or logically consistent.

Controllability guardrails:

These guardrails aim to ensure the LLM generates outputs that are aligned with the goals and preferences of the developers and users. For example, a controllability guardrail can ensure the LLM follows a predefined dialogue path, such as a customer service scenario, a product recommendation, or a trivia game.

Creativity guardrails:

These guardrails aim to enable the LLM to generate novel and diverse outputs, such as poems, stories, code, essays, songs, or celebrity parodies. For example, a creativity guardrail can enable the LLM to generate outputs that are original, humorous, or surprising.

Mechanism

Programmable guardrails for LLM can use different mechanisms, such as:

Filtering:

This mechanism involves checking the output of an LLM against a list of forbidden words, phrases, or topics, and rejecting or modifying it if it contains any of them. For example, filtering can prevent an LLM from generating profanity, hate speech, or sensitive information.

Rewriting:

This mechanism involves changing the output of an LLM to make it more suitable for a specific context, audience, or purpose. For example, rewriting can improve the coherence, clarity, style, or tone of an LLM output.

Guiding:

This mechanism involves influencing the output of an LLM to make it more aligned with a predefined goal, plan, or policy. For example, guiding can help an LLM follow a dialogue script, answer a user query, or generate a desired type of content.

Extracting:

This mechanism involves extracting structured data or information from the output of an LLM or the user input. For example, extracting can help an LLM identify the intent, entities, or sentiment of a user query, or the facts, opinions, or arguments of an LLM output.

General Architecture:

Input: User input is first processed by the input guardrail, which filters out any unwanted or harmful content.

Input Guardrails:

Keyword filtering: Block specific words or phrases that are considered offensive or harmful.

Entity recognition: Identify and filter out sensitive entities such as personally identifiable information (PII).

Sentiment analysis: Detect and filter out negative or toxic sentiment.

LLM Core: The filtered input is then passed to the LLM core, which generates a response based on its training data.

LLM Core Guardrails:

Prompt engineering: Use specific prompts to guide the LLM towards generating safe and appropriate responses.

Data filtering: Train the LLM on a curated dataset that is free from biases and harmful content.

Knowledge distillation: Transfer knowledge from a smaller, more controlled LLM to a larger, more powerful LLM.

Output: The final, guarded response is then sent back to the user.

Output Guardrail: The generated response is then processed by the output guardrail, which ensures that it meets the desired criteria (e.g., factual accuracy, appropriate tone, etc.).

Fact-checking: Verify the factual accuracy of the generated response.

Bias detection: Check for any signs of bias in the generated response.

Style enforcement: Ensure that the generated response adheres to the desired style (e.g., formal, informal, etc.)

The open-source landscape offers various guardrail options, with NeMo Guardrails and Bard gaining popularity. While both tools cater to different use cases, they share the common goal of empowering developers to build responsible and trustworthy LLM applications.

Looking ahead, the future of guardrails is promising. Research into “Explainable AI” will make guardrails more transparent and interpretable. We can also expect integration with existing security frameworks for seamless user protection.

Ultimately, programmable guardrails are not just about control; they’re about **collaboration**. By harnessing their power, we can create LLMs that augment human potential while ensuring a safe and ethical future for AI-powered interactions.

Source: Conversation with Bing, 2/10/2024

(1) [2310.10501] NeMo Guardrails: A Toolkit for Controllable and Safe LLM …. https://arxiv.org/abs/2310.10501.

(2) NVIDIA/NeMo-Guardrails – GitHub. https://github.com/NVIDIA/NeMo-Guardrails.

(3) Enhancing Llama2 Conversations with NeMo Guardrails: A Practical Guide. https://blog.marvik.ai/2023/10/09/enhancing-llama2-conversations-with-nemo-guardrails-a-practical-guide/.

(4) NVIDIA Enables Trustworthy, Safe, and Secure Large Language Model …. https://developer.nvidia.com/blog/nvidia-enables-trustworthy-safe-and-secure-large-language-model-conversational-systems/.

(5) Guardrails: What Are They and How Can You Use NeMo and Guardrails AI To …. https://arize.com/blog-course/guardrails-what-are-they-and-how-can-you-use-nemo-and-guardrails-ai-to-safeguard-llms/.

(6)https://doi.org/10.48550/arXiv.2310.10501.

Leave a Reply

Your email address will not be published. Required fields are marked *