Reading Notes | Universal and Transferable Adversarial Attacks on Aligned Language Models

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-12-07: First Draft.

Overview

This work tries to add an automatically optimized suffix to the input instruction so that the LM will follow the unsafe instruction to generate unsafe content.

Specifically, suppose the length of {affirmation} is a, the algorithm do the following:
– Iterating over 1, 2, \cdots, t:
– Forward pass with model using string, the model will output logits of length (a, \vert \mathcal{V}\vert).
– Compute the cross-entropy loss of these logits and true token IDs (think of it of a \vert \mathcal{V}\vert-class classification problem).
– Backprogation loss back to the tokens in {suffix i-1}, we could select tokens with highest gradient to replace and obtain {suffix i}.

Finally, we use the optimized {suffix t} to put it into test and hope that it will generate the {affirmation} token.

# train
BEGINNING OF CONVERSATION: USER: {in} {suffix 0} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 1} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}
...
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}

# test
BEGINNING OF CONVERSATION: USER: {in} {suffix t} ASSISTANT:

Basics

PPO is the extension of the classical policy gradient algorithm (therefore, on-policy) by updating multiple steps rather than only one step. Suppose we have a reward model r _ \theta(x, y), the PPO tries to update the LM parameters \phi so that the cumulative reward is maximized

The following is the steps of PPO (taken from Hyunwon Chung’s talk):

  • Step 1: Obtaining a SFT model using the standard LM loss.
  • Step 2: Repeat the following:
    • Sampling: Sampling prompts from the datasets.
    • Rollout: Generating responses with the current version of LM \pi _ \phi ^ \mathrm{RL}.
    • Evaluation: Using the (fixed) reward model r _ \theta to score each of the response in the last step.
    • Optimization: Using the (prompt, continuation, score) triplets as a dataset to optimize the parameter (i.e., \phi) of the LM.

These steps are written concisely (yet confusingly) in the original paper as follows. The first boxed term is used to prevent overfitting to the reward function; the second boxed term is to reduce the performance regression on the standard benchmarks.
\mathbb{E} _ {x, y \sim D _ {\pi ^ \mathrm{RL}}} \left[ r _ \theta(x, y) – \boxed{\beta \cdot \log \frac{\pi _ \phi ^ \mathrm{RL}(y\vert x)}{\pi^\mathrm{SFT}(y\vert x)}}\right] + \boxed{\gamma \mathbb{E} _ {x \sim D} \left[ \log \pi _ \phi ^ \mathrm{RL}(x) \right]}

Method

This paper is motivated by the observation that the aligned LM will still generate unsafe content if we could make the first few words of the LM response something like “Sure, here is how $UNSAFE_CONTENT”.

Therefore, the idea is to disguise the input prompt with an automatically optimized suffix so that an aligned LM has a similar loss as

Note that selecting replacement by loss makes sense because the RLHF maximizes the reward while staying as close as the original SFT model.

Code Anatomy

The codebase is designed for chat models that involve different “roles” in the format of tuples. It is necessary to adapt the codebase to make it work with plain text models.

The most complicated part of the codebase is how the authors handle different prompt template in various language models; these messy details are all included in llm_attacks.minimal_gcg.string_utils.SuffixManager. What makes things more complicated is the fact that these string processing utilities in turn depend on fastchat library.

Three key variables in the demo specific to LLaMA2-Chat model are manager._control_slice, manager._loss_slice, and manager._target_slice. These three variables are derived from hidden variables self._assistant_role_slice and self._user_role_slice; they are fixed throughout the attack.

d

The attack discussed in the paper works best with greedy decoding (the default approach in model.generate()). One may develop special decoding methods geared towards safety.

Reading Notes | Certifying LLM Safety against Adversarial Prompting

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-12-07: First draft. The corresponding authors include Soheil Feizi (UMD) and Himabindu Lakkaraju (Harvard).
  • The intuition of the proposed certified LLM safety is quite simple: the complete sequence is safe if all of its subsequences are also safe.

However, one issue with this notion of safety is that it relies on the capability of the safety classifier: if the classifier systematically fail, then the certificate is broken.

Reading Notes | ToxiGen – A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Log

  • 2023-12-05: First draft.

Overview

The authors propose a method to automatically generate a balanced dataset (13 identity groups and both toxic and benign) of 14K (using ALICE) + 260K (using demonstraionts) = 274K samples without explicit words based on the following two observations:

  • It is hard to collect hard toxic content to augment the training set of machine learning models as overly toxic content often co-occur with small set of explicit words.
  • Furthermore, the explicit mention (for example, Muslim) of language styles (for example, African-American English) of some identity groups are unfairly classified as toxic by existing models.

Method

ALICE

The authors incoporate a binary hate speech classifier’s score on the “hate” or “non-hate” class into the decoding process to encourge more hateful or more non-hateful generation given a prompt.

Originally, the hateful prompt will lead to hateful continuation. However, when we have the classifier in the loop, the continuation’s hatefulness will be mitigated yet not reversed, leading to implicit hate speech (i.e., hard toxic content).

Demonstration

Another method the authors propose is manually collecting implicit hate speech from the web, and then demonstrate to obtain more texts from GPT-3. This effort lead to 260K samples.

Experiments

  • Data Augmentation with ToxiGen Improves Accuracy on OOD Test Sets

    The authors further fine-tune HateBERT and ToxDectRoBERTa using the collected dataset and test it on social_bias_frames, SALT-NLP/ImplicitHate, and aps/dynahate. The authors observe improved accuracy after fine-tuning.

    image-20231205224136646

Reading Notes | Direct Preference Optimization – Your Language Model is Secretly a Reward Model

[Semantic Scholar] – [Code] – [Tweet] – [Video]

Change Logs:

  • 2023-12-04: First draft.

Overview

  • DPO belongs to a larger family of algorithms of directly optimizing human preferences. The algorithm assumes there are always a winning solution and a losing solution; this is different from PPO as now the label becomes discrete.
  • Using DPO will alleviate the need for a dedicated library such as trl. The only change need to be made is a loss function.

Reference

  1. DPO Debate: Is RL needed for RLHF? – YouTube: This is an advanced video by Nathan Lambert.
  2. [2310.12036] A General Theoretical Paradigm to Understand Learning from Human Preferences (DeepMind)

    This is a theoretical paper that reveals the limitations of DPO.

    It shows two assumptions of RLHF: (1) pairwise comparison could be substituted with pointwise rewards, and (2) an LM trained on the pointwise rewards will generalize from collected data to OOD data.