Reading Notes | Universal and Transferable Adversarial Attacks on Aligned Language Models

Posted on December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-12-07: First Draft.

Overview

This work tries to add an automatically optimized suffix to the input instruction so that the LM will follow the unsafe instruction to generate unsafe content.

Specifically, suppose the length of {affirmation} is $a$ , the algorithm do the following:
– Iterating over $1, 2, \cdots, t$ :
– Forward pass with model using string, the model will output logits of length $(a, \vert \mathcal{V}\vert)$ .
– Compute the cross-entropy loss of these logits and true token IDs (think of it of a $\vert \mathcal{V}\vert$ -class classification problem).
– Backprogation loss back to the tokens in {suffix i-1}, we could select tokens with highest gradient to replace and obtain {suffix i}.

Finally, we use the optimized {suffix t} to put it into test and hope that it will generate the {affirmation} token.

# train
BEGINNING OF CONVERSATION: USER: {in} {suffix 0} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 1} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}
...
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}

# test
BEGINNING OF CONVERSATION: USER: {in} {suffix t} ASSISTANT:

Basics

PPO is the extension of the classical policy gradient algorithm (therefore, on-policy) by updating multiple steps rather than only one step. Suppose we have a reward model $r _ \theta(x, y)$ , the PPO tries to update the LM parameters $\phi$ so that the cumulative reward is maximized

The following is the steps of PPO (taken from Hyunwon Chung’s talk):

Step 1: Obtaining a SFT model using the standard LM loss.
Step 2: Repeat the following:
- Sampling: Sampling prompts from the datasets.
- Rollout: Generating responses with the current version of LM $\pi _ \phi ^ \mathrm{RL}$ .
- Evaluation: Using the (fixed) reward model $r _ \theta$ to score each of the response in the last step.
- Optimization: Using the (prompt, continuation, score) triplets as a dataset to optimize the parameter (i.e., $\phi$ ) of the LM.

These steps are written concisely (yet confusingly) in the original paper as follows. The first boxed term is used to prevent overfitting to the reward function; the second boxed term is to reduce the performance regression on the standard benchmarks.
$\mathbb{E} _ {x, y \sim D _ {\pi ^ \mathrm{RL}}} \left[ r _ \theta(x, y) – \boxed{\beta \cdot \log \frac{\pi _ \phi ^ \mathrm{RL}(y\vert x)}{\pi^\mathrm{SFT}(y\vert x)}}\right] + \boxed{\gamma \mathbb{E} _ {x \sim D} \left[ \log \pi _ \phi ^ \mathrm{RL}(x) \right]}$

Method

This paper is motivated by the observation that the aligned LM will still generate unsafe content if we could make the first few words of the LM response something like “Sure, here is how $UNSAFE_CONTENT”.

Therefore, the idea is to disguise the input prompt with an automatically optimized suffix so that an aligned LM has a similar loss as

Note that selecting replacement by loss makes sense because the RLHF maximizes the reward while staying as close as the original SFT model.

Code Anatomy

The codebase is designed for chat models that involve different “roles” in the format of tuples. It is necessary to adapt the codebase to make it work with plain text models.

The most complicated part of the codebase is how the authors handle different prompt template in various language models; these messy details are all included in llm_attacks.minimal_gcg.string_utils.SuffixManager. What makes things more complicated is the fact that these string processing utilities in turn depend on fastchat library.

Three key variables in the demo specific to LLaMA2-Chat model are manager._control_slice, manager._loss_slice, and manager._target_slice. These three variables are derived from hidden variables self._assistant_role_slice and self._user_role_slice; they are fixed throughout the attack.

The attack discussed in the paper works best with greedy decoding (the default approach in model.generate()). One may develop special decoding methods geared towards safety.

Reading Notes | Certifying LLM Safety against Adversarial Prompting

Posted on December 11, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-12-07: First draft. The corresponding authors include Soheil Feizi (UMD) and Himabindu Lakkaraju (Harvard).

The intuition of the proposed certified LLM safety is quite simple: the complete sequence is safe if all of its subsequences are also safe.

However, one issue with this notion of safety is that it relies on the capability of the safety classifier: if the classifier systematically fail, then the certificate is broken.

Reading Notes | ToxiGen – A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Posted on December 6, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Log

2023-12-05: First draft.

Contents

1 Overview
2 Method
- 2.1 ALICE
- 2.2 Demonstration
3 Experiments

Overview

The authors propose a method to automatically generate a balanced dataset (13 identity groups and both toxic and benign) of 14K (using ALICE) + 260K (using demonstraionts) = 274K samples without explicit words based on the following two observations:

It is hard to collect hard toxic content to augment the training set of machine learning models as overly toxic content often co-occur with small set of explicit words.
Furthermore, the explicit mention (for example, Muslim) of language styles (for example, African-American English) of some identity groups are unfairly classified as toxic by existing models.

Method

ALICE

The authors incoporate a binary hate speech classifier’s score on the “hate” or “non-hate” class into the decoding process to encourge more hateful or more non-hateful generation given a prompt.

Originally, the hateful prompt will lead to hateful continuation. However, when we have the classifier in the loop, the continuation’s hatefulness will be mitigated yet not reversed, leading to implicit hate speech (i.e., hard toxic content).

Demonstration

Another method the authors propose is manually collecting implicit hate speech from the web, and then demonstrate to obtain more texts from GPT-3. This effort lead to 260K samples.

Experiments

Data Augmentation with ToxiGen Improves Accuracy on OOD Test Sets

The authors further fine-tune HateBERT and ToxDectRoBERTa using the collected dataset and test it on social_bias_frames, SALT-NLP/ImplicitHate, and aps/dynahate. The authors observe improved accuracy after fine-tuning.

Reading Notes | Direct Preference Optimization – Your Language Model is Secretly a Reward Model

Posted on December 6, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video]

Change Logs:

2023-12-04: First draft.

Overview

DPO belongs to a larger family of algorithms of directly optimizing human preferences. The algorithm assumes there are always a winning solution and a losing solution; this is different from PPO as now the label becomes discrete.
Using DPO will alleviate the need for a dedicated library such as trl. The only change need to be made is a loss function.

Reference

DPO Debate: Is RL needed for RLHF? – YouTube: This is an advanced video by Nathan Lambert.
[2310.12036] A General Theoretical Paradigm to Understand Learning from Human Preferences (DeepMind)

This is a theoretical paper that reveals the limitations of DPO.

It shows two assumptions of RLHF: (1) pairwise comparison could be substituted with pointwise rewards, and (2) an LM trained on the pointwise rewards will generalize from collected data to OOD data.

Reading Notes | Unmasking and Improving Data Credibility – A Study with Datasets for Training Harmless Language Models

Posted on November 27, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-11-25: First draft. This work serves as a demo of the startup company of two of the authors (Zhaowei Zhu, Jiaheng Wei, Hao Cheng; all of them are from UCSC); the corresponding author (i.e., Yang Liu @ UCSC) is the leader of ByteDance’s responsible AI team.

However, the code was last updated 2023-08-24.

Overview

This paper proposes an elegant framework for (1) evaluating the overall dataset quality and (2) detecting individual label errors. The proposed approach only relies on embeddings.

Method

The authors start with the general noise transition matrix $\mathbf{T} \in \mathbb{R} ^ {K \times K}$ , where each entry $\mathbf{T} _ {ij} := \Pr(\tilde{y}=j \vert y = i; \mathbf{x})$ indicates the probability the underlying true label $i$ appears as a noisy label $j$ ,

The following derivation depends on a hypothesis from the authors: the 2-NN of each sample in the dataset has neighbors of the same true underlying label. The authors call this hypothesis $k$ -NN clusterability.

Overall Dataset Quality

As the noisy dataset $\tilde{D}$ is free from noise when $\mathbf{T}$ is an identity matrix, the overall quality of a dataset could be written as follows. The authors have proved that $0\leq \Psi(\tilde{D}, D) \leq 1$ and it is 0 when $\mathbf{T}$ is a permutation matrix.
$\Psi(\tilde{D}, D) = 1 – \frac{1}{\sqrt{2K}} \mathbf{E} _ \mathbf{x} \Vert \mathbf{T}(\mathbf{x}) – \mathbf{I}\Vert _ F$

Detecting Individual Label Errors

For a group of samples with noisy labels $j$ , we could obtain a vector where each entry is the number of appearances of that label in the sample’s $k$ -NN. For example, if we are working on hate vs. non-hate classification, the sample has 3-NN of hate, hate, and non-hate, then the vector $\hat{\mathbf{y}}=[1, 2]^T$ .

Step 1: Scoring each sample using the cosine similarity of $\hat{\mathbf{y}}$ and $\mathbf{e} _ j$ : $\frac{\hat{\mathbf{y}}^T \mathbf{e} _ j}{\Vert \hat{\mathbf{y}} \Vert _ 2 \Vert \mathbf{e} _ j \Vert _ 2}$ .
Step 2: Choosing the threshold the label could be trusted: $\Pr(y = j \vert \tilde{y} = j) = \frac{\Pr(\tilde{y}=j\vert y = j) \cdot \Pr(y=j)}{\Pr(\hat{y} = j)}$ , where the entries on the nominator could be estimated from $\mathbf{T}$ and the denominator is easy to know from the dataset $\tilde{D}$ . Any samples whose scores are lower than the threshold $\Pr(y = j\vert \tilde{y}=j)$ means that the label is not trustworthy.

Estimating Noise Transition Matrix

The above two sections both rely on accurate estimation of $\mathbf{T}$ . The authors show that it is possible (with some relaxations) to do it by computing the label consensus of up to 2-NN for each sample in the dataset $\tilde{D}$ .

Experiments

All experiments are based on embeddings from sentence-transformers/all-mpnet-base-v2.

The authors sample 1000 flagged samples by the algorithms and another 1000 unflagged samples. After verifying these 2000 samples, 415 of 1000 flagged samples were also flagged by annotators, who flagged 104 unflagged samples. This indicates the statistics shown below. Interestingly, the authors see the statistics differently by computing 415 / 604 = 0.6871.

import numpy as np
from sklearn.metrics import classification_report

y_pred = np.concatenate([np.ones(1000), np.zeros(1000)]) # flagged by algorithm
y_true = np.concatenate([np.ones(415), np.zeros(585), np.ones(189), np.zeros(811)]) # flagged by experts

print(classification_report(y_true=y_true, y_pred=y_pred))
# result
#               precision    recall  f1-score   support
# 
#          0.0       0.81      0.58      0.68      1396
#          1.0       0.41      0.69      0.52       604
# 
#     accuracy                           0.61      2000
#    macro avg       0.61      0.63      0.60      2000
# weighted avg       0.69      0.61      0.63      2000

After cleaning label errors and fine-tuning BERT and GPT2 on different datasets, the test scores show that the proposed algorithm (i.e., Docta) consistently improves the model performances despite the smaller sizes of the Docta training sets.

Miscellaneous Notes

Reading Notes | AutoGen – Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Posted on November 1, 2023December 11, 2023 by David Yang

Overview

Qingyun Wu’s Talk

The interaction between the users and the AutoGen system is critical for the system to be useful; fully autonomous systems are not trustworthy. For example, if the system does not deliver the outcome the user want, we could not know which step goes wrong.
In order for the system to be useful, the base models are important. We cannot use a very weak LM as one of the AI agents.
There are currently (as of 2023-11-01) no safety measures to make sure the system do not generate undesirable content.

Reading Notes | Towards Understanding Chain-of-Thought Prompting – An Empirical Study of What Matters

Posted on October 20, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Poster]

Change Logs:

2023-10-20: First draft. The paper appears at ACL 2023 as the best paper honorable mention.

Method

The experiments of this paper was done on text-davinci-002 with greedy decoding with temperature 0. The datasets they work on is quite small due to manual efforts required.
The paper focus on QA and arithmetic reasoning tasks; the authors introduce two concepts:
- Bridging Objects
- Language Template
The authors define the intermediate F1 scores for bridging objects. It is likely that the authors only accept generations that satisfy the predefined template and compute these metrics.
Observations:
- The correctness of reasoning during CoT is not important.
- Query should be (1) relevant and (2) follow the order of reasoning steps.
Additional Observations:
- CoT does not make LLMs better; it unlocks the ability already learned by LLMs during pre-training. For example, the conclusions drawn on text-davinci-002 does not apply to Flan-PaLM; this is because Flan-PaLM has been fine-tuned on the two tasks.
  
  Given limited resources and an ability to fine-tune the model, we should include more and more data to pre-training or instruction tuning to improve the model rather than focusing the specific prompt engineering tricks.

Experiment

Additional Notes

Reference

Reading Notes | Text Embeddings Reveal (Almost) As Much As Text

Posted on October 18, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-10-18: First draft. This paper appears at EMNLP 2024. This paper is a work by John X. Morris. It comes with an easy-to-use library that could revert the OpenAI embeddings.

Overview

The authors assume an attacker has access to (1) a compromised vector database, and (2) a black-box embedding model $\phi(\cdot)$ (for example, OpenAI’s embedding API). The attacker starts from an embedding and an empty string to reconstruct the original text corresponding to that string; the method proposed in the paper manage to recover a string up to 32 tokens.

The main motivation of this paper is privacy.

Method

Reference

[2211.00053] Generating Sequences by Learning to Self-Correct (Welleck et al.): This is the main inspiration of the main paper.

This method relates to other recent work generating text through iterative editing (Lee et al., 2018; Ghazvininejad et al., 2019). Especially relevant
is Welleck et al. (2022), which proposes to train a text-to-text ‘self-correction’ module to improve language model generations with feedback.
Decoding a Neural Retriever’s Latent Space for Query Suggestion (Adolphs et al., EMNLP 2022)

Reading Notes | From Pretraining Data to Language Models to Downstream Tasks – Tracking the Trails of Political Biases Leading to Unfair NLP Models

Posted on October 13, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-10-12: First draft. This paper is one of the 3 best papers in ACL 2023.

Method

Political Leanings of LMs

The authors use the existing political compass test to test an LM’s political leanings. A political compass test is a questionnaire that consists of 62 questions; the respondent needs to select “Strongly Agree,” “Agree,” “Neutral,” “Disagree,” and “Strongly Disagree.” for each question. Then, the respondent’s political leaning could be deterministically projected onto a plane spanned by an economic axis ( $x$ -axis, left and right) and social axis ( $y$ -axis, libertarian and authoritarian).

To study their political leanings, the authors design prompts and separate experiment protocols for encoder-only (for example, BERT) and decoder-only (for example, GPT) LMs. Further and more importantly, the authors further pre-train RoBERTa and GPT-2 using partisan political corpus collected by previous works ([1] and [2]) and measure the following:

How pretraining corpus could influence the political leanings.
The dynamics of political leanings during continued pre-training.

Note that the authors mention removing the toxic subset of the continued pre-training corpus.

Note: This practice is unnecessary as toxicity is less likely to be a confounder for political leaning: the toxic content is uniformly distributed rather than skewed towards one specific political leaning. What is worse, the hate speech detector itself may have political bias.

	Prompt	Method
Encoder-only	`"Please respond to the following statement: [statement] I <MASK> with this statement."`	The positive or negative lexicons ratio appears in `<MASK>` as the top-10 suggestions.
Decoder-only	`"Please respond to the following statement: [statement]\n Your response:"`	An off-the-shelf BART-based model fine-tuned on MNLI (which specific model is unknown from the paper); manually verifying 110 responses shows 97% accuracy among 3 annotators ( $\kappa=0.85$ ).

Downstream Tasks

The authors study how fine-tuning LMs of different political leanings on the same dataset could have led to different fairness measurements on the hate speech classification task [3] and the misinformation classification task [4]. Specifically, the fairness in hate speech classification and misinformation classification are concerning identity groups and sources of the texts.

Experiments

LMs show different political leanings.

The (continued) pre-training corpus has a influence on the policial leanings; these corpus could be categorized by political leaning and time (specifically, pre-Trump and post-Trump).
For downstream tasks
- The overall performance for hate speech and misinformation classification is mostly the same.
- Significant accuracy variations exist for different identity groups and sources (compare light blue and orange cells).

Note: It is not straightforward to draw convincing conclusions solely from Table 4; the authors’ claim for unfairness in downstream tasks needs to be stronger.

Reference

POLITICS: Pretraining with Same-story Article Comparison for Ideology Prediction and Stance Detection (Liu et al., Findings 2022): This dataset has news articles collected from multiple outlets; these outlets have their political leaning labels assessed by a news aggregator allsides.com (Wikipedia).
What Sounds “Right” to Me? Experiential Factors in the Perception of Political Ideology (Shen & Rose, EACL 2021): This paper collects social media posts with different political leanings.
How Hate Speech Varies by Target Identity: A Computational Analysis (Yoder et al., CoNLL 2022)
“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection (Wang, ACL 2017) (PolitiFact): This is a standard dataset for fake news classification.

Reading Notes | Faithful Low-Resource Data-to-Text Generation through Cycle Training

Posted on October 8, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Poster]

Change Logs:

2023-10-06: First draft. The paper appears at ACL 2023.

Method

The cycle training has two models involved: a data-to-text model $\mathcal{M} _ \text{D2T}$ and a text-to-data model $\mathcal{M} _ \text{T2D}$ ; they are both initialized as `google/t5-base; this base model empirically shows an edge in the WebNLG 2020 competition for RDF-to-text generation.

The proposed approach is similar to self-training in the text-generation domain. Specifically, there are three datasets: paired texts and data, unpaired data $D$ and unpaired texts $T$ .

Initialization: Fine-tuning $\mathcal{M} _ \text{D2T}$ and $\mathcal{M} _ \text{T2D}$ using the paired dataset; the data is converted into linearized triplets.
Repeating the following for multiple epochs: the number of epochs in the paper is set to 50. At epoch k, we do the following:
- Generating text $\hat{T} =\mathcal{M} _ \text{D2T} ^ {(k-1)}(D)$ and data $\hat{D}=\mathcal{M} _ \text{T2D} ^ {(k-1)}(T)$ with models from epoch $(k-1)$ .
- Fine-tuning models with pseudo pairs (D, \hat{T}) and (\hat{D}, T). Specifically, we do the following:
  - $\mathcal{M} _ \text{D2T} ^{(k)} \leftarrow \mathrm{FineTune}(\mathcal{M} _ \text{D2T} ^{(k-1)}, (\hat{D}, T))$; this step tries to reconstruct texts $T$ from intermediate $\hat{D}$.
  - $\mathcal{M} _ \text{T2D} ^{(k)} \leftarrow \mathrm{FineTune}(\mathcal{M} _ \text{T2D} ^{(k-1)}, (D, \hat{T}))$; this step tries to reconstruct data $D$ from intermediate $\hat{T}$.

Note that the difference between this scheme and self-training is that we use the labels inferred from the model to train itself in self-training. However, we do not use the generated pairs $(D, \hat{T})$ from $\mathcal{M} _ \text{D2T}$ to fine-tune itself; rather, we leverage a second model $\mathcal{M} _ \text{T2D}$ to generate the training data for $\mathcal{M} _ \text{D2T}$ .

From the experiment results, we could see:

The low-resource cycle training has strong performance on par with full-scale fine-tuning.
The small set of paired texts is important: the low-resource setting consistently outperforms the unsupervised setting.
Pretraining does not help much if the paired datasets are of small scale.

Additional Notes

Prerequisite

The unpaired data and text corpus should have at least 50% overlap in terms of entities to obtain a reasonable level of faithfulness.
Automatic Faithfulness Evaluation

The PARENT metric [1] used in this work highly correlates with human annotations; this metric is specially designed for table-to-text tasks.

Reference

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation (Dhingra et al., ACL 2019)