Reading Notes | Universal and Transferable Adversarial Attacks on Aligned Language Models

Posted on December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-12-07: First Draft.

Overview

This work tries to add an automatically optimized suffix to the input instruction so that the LM will follow the unsafe instruction to generate unsafe content.

Specifically, suppose the length of {affirmation} is $a$ , the algorithm do the following:
– Iterating over $1, 2, \cdots, t$ :
– Forward pass with model using string, the model will output logits of length $(a, \vert \mathcal{V}\vert)$ .
– Compute the cross-entropy loss of these logits and true token IDs (think of it of a $\vert \mathcal{V}\vert$ -class classification problem).
– Backprogation loss back to the tokens in {suffix i-1}, we could select tokens with highest gradient to replace and obtain {suffix i}.

Finally, we use the optimized {suffix t} to put it into test and hope that it will generate the {affirmation} token.

# train
BEGINNING OF CONVERSATION: USER: {in} {suffix 0} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 1} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}
...
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}

# test
BEGINNING OF CONVERSATION: USER: {in} {suffix t} ASSISTANT:

Basics

PPO is the extension of the classical policy gradient algorithm (therefore, on-policy) by updating multiple steps rather than only one step. Suppose we have a reward model $r _ \theta(x, y)$ , the PPO tries to update the LM parameters $\phi$ so that the cumulative reward is maximized

The following is the steps of PPO (taken from Hyunwon Chung’s talk):

Step 1: Obtaining a SFT model using the standard LM loss.
Step 2: Repeat the following:
- Sampling: Sampling prompts from the datasets.
- Rollout: Generating responses with the current version of LM $\pi _ \phi ^ \mathrm{RL}$ .
- Evaluation: Using the (fixed) reward model $r _ \theta$ to score each of the response in the last step.
- Optimization: Using the (prompt, continuation, score) triplets as a dataset to optimize the parameter (i.e., $\phi$ ) of the LM.

These steps are written concisely (yet confusingly) in the original paper as follows. The first boxed term is used to prevent overfitting to the reward function; the second boxed term is to reduce the performance regression on the standard benchmarks.
$\mathbb{E} _ {x, y \sim D _ {\pi ^ \mathrm{RL}}} \left[ r _ \theta(x, y) – \boxed{\beta \cdot \log \frac{\pi _ \phi ^ \mathrm{RL}(y\vert x)}{\pi^\mathrm{SFT}(y\vert x)}}\right] + \boxed{\gamma \mathbb{E} _ {x \sim D} \left[ \log \pi _ \phi ^ \mathrm{RL}(x) \right]}$

Method

This paper is motivated by the observation that the aligned LM will still generate unsafe content if we could make the first few words of the LM response something like “Sure, here is how $UNSAFE_CONTENT”.

Therefore, the idea is to disguise the input prompt with an automatically optimized suffix so that an aligned LM has a similar loss as

Note that selecting replacement by loss makes sense because the RLHF maximizes the reward while staying as close as the original SFT model.

Code Anatomy

The codebase is designed for chat models that involve different “roles” in the format of tuples. It is necessary to adapt the codebase to make it work with plain text models.

The most complicated part of the codebase is how the authors handle different prompt template in various language models; these messy details are all included in llm_attacks.minimal_gcg.string_utils.SuffixManager. What makes things more complicated is the fact that these string processing utilities in turn depend on fastchat library.

Three key variables in the demo specific to LLaMA2-Chat model are manager._control_slice, manager._loss_slice, and manager._target_slice. These three variables are derived from hidden variables self._assistant_role_slice and self._user_role_slice; they are fixed throughout the attack.

The attack discussed in the paper works best with greedy decoding (the default approach in model.generate()). One may develop special decoding methods geared towards safety.

Reading Notes | Certifying LLM Safety against Adversarial Prompting

Posted on December 11, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-12-07: First draft. The corresponding authors include Soheil Feizi (UMD) and Himabindu Lakkaraju (Harvard).

The intuition of the proposed certified LLM safety is quite simple: the complete sequence is safe if all of its subsequences are also safe.

However, one issue with this notion of safety is that it relies on the capability of the safety classifier: if the classifier systematically fail, then the certificate is broken.

Research Notes | Benchmarking LLM Safety

Posted on November 30, 2023December 11, 2023 by David Yang

Contents

1 Problem Description
2 Red Teaming
3 Resources
4 Reference

Problem Description

When receiving a prompt that queries for unsafe information (for example, toxic, profane, and legal / medical information), the LM may respond and cause harm in the physical world. There are several ways to diagnose LM weakness:

Static Benchmark: This includes the CheckList-style challenge test sets.
- Benchmark Saturation and Annotation Bias
- Concept shift: For example, the same content previously thought non-toxic became toxic after certain social event.
- Covariate Shift: This includes (1) the emerging unsafe categories and (2) changing proportion of existing unsafe categories.
Red-Teaming
- Manual Red-Teaming: Leveraging people’s creativity to search for prompts that may elicit unsafe behaviors of LLMs.
- Automated Red-Teaming: Using automated search to deviate the region guarded by RLHF so that the unsafe content will be generated.

Note that

The description above only considers the language model itself. There may be external input / output filters that assist the detection and mitigation of unsafe behaviors; these external filters should bcde studies separately.
The LM itself may or may not go through the process of enhancing safety. The methods to enhance safety may include (1) SFT with additional (unsafe prompt, IDK response) or (2) RLHF with additional (unsafe prompt, IDK response, unsafe response); here IDK resposne is generic responses that LMs fall back to when encountering unsafe prompts.

Red Teaming

Resources

An comprehensive wiki and a collection of resources from Yaodong Yang @ PKU. He, together with Songchun Zhu, also writes a comprehensive survey on AI alignment; it has a Chinese version.

Reference

Safety Alignment

[2310.12773] Safe RLHF: Safe Reinforcement Learning from Human Feedback

[2307.04657] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset (PKU-Alignment)

This work find that separately annotating harmlessness and helpfulness (with the additional safe RLHF algorithm proposed in 1) substantially outperforms Anthropic’s baselines; the authors claim that they are the first to do this. The author also open-source the datasets (1) a SFT (or classification) dataset that is used to train safety classifier and (2) a RLHF dataset that is used to fine-tune an LM (Alpaca in the paper).

The authors also curate a balanced test set from 14 categories to measure some models’ safety (Figure 5), they find that LLMs with alignment show much less variance among GPT-4, human evaluation, and QA moderation. Here “QA moderation” is another measure for hatefulness: the degree to which a response mitigate the potential harm of a harmful prompt; the authors use the binary label for this. Specifically, rather than using each single sentence’s own toxicity as label (for example, prompt or response) the authors use whether a response addresses the prompt harmlessly as the label.

Note that the authors synthesize 14 categories from 1, 2 in “Taxonomy” and 1 in “Red Teaming.” The authors acknowledge that these categories are not MECE.

The authors release their models and datasets on HuggingFace hub:

Model	Name	Note
1	`PKU-Alignment/alpaca-7b-reproduced`	The reproduced Alpaca model.
2	`PKU-Alignment/beaver-dam-7b`	A LLaMA-based QA moderation model
3	`PKU-Alignment/beaver-7b-v1.0-reward`	The static reward model during RLHF
4	`PKU-Alignment/beaver-7b-v1.0-cost`	The static cost model during RLHF
5	`PKU-Alignment/beaver-7b-v1.0`	The Alpaca model that goes through the safe RLHF process based on 1

Dataset	Name	Note
1	`PKU-Alignment/BeaverTails`	A classification dataset with `prompt`, `response`, `category`, and `is_safe` columns; it could be used for 14 classes (if using `category`) or 2 classes (if using `is_safe`).
2	`PKU-Alignment/BeaverTails-single-dimension-preference`	A preference dataset with `prompt`, `response_0`, `response_1`, and `better_response_id` (-1, 0, 1).
3	`PKU-Alignment/BeaverTails-Evaluation`	It only has `prompt` and `category` columns. It is not the test split of the dataset 1 and 2.
4	`PKU-Alignment/PKU-SafeRLHF`	A preference and binary classification dataset (N=330K) with `prompt`, `response_0`, `response_1`, `is_response_0_safe`, `is_response_1_safe`, `better_response_id`, `safer_response_id`; it has both training and test split.
5	`PKU-Alignment/PKU-SafeRLHF-30K`	Sampled version of 4 with both training and test split.
6	`PKU-Alignment/PKU-SafeRLHF-10K`	A further sampled version of 4 with only training split available.
7	`PKU-Alignment/processed-hh-rlhf`	A reformatted version of the Anthropic dataset for the ease of use; the original dataset is formatted in plain text.

Safety Benchmark

[2308.01263] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models (Röttger et al.): This work presents a small set of test prompts (available on GitHub) that could be used to test the safety of an LLM. This work is from the people working on hate speech, including Paul Röttger, Bertie Vidgen, and Dirk Hovy.
[2308.09662] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (DeCLaRe Lab, SUTD): This work provides two datasets: (1) a set of hateful questions for safety benchmarking, and (2) (propmt, blue conversation, red conversation) datasets for safety benchmarking.
[2309.07045] SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions (Tsinghua): This work provides a dataset of multiple-choice QA to evaluate the safety of an LLM across 7 predefined categories, including offensiveness, bias, physical health, mental health, illegal activities, ethics, and privacy.

OOD and Safety

[2311.14743] A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift (Scale AI)

Red Teaming

[2209.07858] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al., Anthropic).
[2202.03286] Red Teaming Language Models with Language Models (Perez et al., DeepMind and NYU)

Taxonomy of Unsafe Behaviors

[2206.08325] Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models (Rauh et al., DeepMind)
BBQ: A hand-built bias benchmark for question answering (Parrish et al., Findings 2022, NYU)

Controlled Text Generation

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (Hartvigsen et al., ACL 2022)
The authors propose a classifier-in-the-loop constrained decoding scheme that allows for the generation of benign and (implicit) toxic content of 13 minority groups.

Specifically, the authors adjust the token distribution by adding the a partial sequence’s neutral class probability from a hate speech classifier to mitigate the toxicity every step. This will make the original explicitly toxic content less toxic (from 66% to 43%) yet still implicitly toxic. Besides making implicit toxic content, this approach could also work with a benign prompt to generate benign content.
[2310.14542] Evaluating Large Language Models on Controlled Generation Tasks (Sun et al., EMNLP)
This paper shows that LLMs, including gpt-3.5-turbo, Falcon, Alpaca, and Vicuna, could not be controlled to follow fine-grained signal such as numerical planning (for example, “generate a paragraph with five sentences.”); they do well in controlling high-level signal, such as sentiment, topic, and enforcing specific keywords.

Adversarial Attack on LLM

[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
- This paper proposes two ways to elicit unsafe behaviors of LLMs
  - Producing Affirmative Responses: Appending “Sure, here is [prompt]” to the original prompt that generates expected unsafe content.
  - Greedy Coordinate Gradient (GCG)
    Given an input prompt $x _ {1:n}$ , the algorithm iterates over all tokens and find the replacement that causes the smallest loss. Specifically, for each token, the algorithm enumerate all possible gradients with respect to this token’s one-hot vector, then the algorithm picks top-K and modifies the prompt by replacing the token in the top-K set, and finally selects the prompt with the lowest loss.
- In attacking vision models, it is well established that attacking distilled models is much easier than the original models.

Toxicity Detection

[2312.01648] Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation
- This paper proposes a method to attain almost perfect accuracy on the challenging civil_comment datasets. The authors manage to do so by deriving a set of features from LLM from the first principle, and training a linear classifier on top of these features.
- Intrinsic Dimension (ID) could be used to characterize the likelihood a prompt could evade the RLHF alignment. It could be used as a proxy for prompt engineering so that jailbreaking will happen.
  The authors show (using the increased ID as a proxy for evading alignment) that prepending a relevant non-toxic sentence as prefix will make the aligned LM more likely to generate toxic content.