Reading Notes | Universal and Transferable Adversarial Attacks on Aligned Language Models

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-12-07: First Draft.

Overview

This work tries to add an automatically optimized suffix to the input instruction so that the LM will follow the unsafe instruction to generate unsafe content.

Specifically, suppose the length of {affirmation} is a, the algorithm do the following:
– Iterating over 1, 2, \cdots, t:
– Forward pass with model using string, the model will output logits of length (a, \vert \mathcal{V}\vert).
– Compute the cross-entropy loss of these logits and true token IDs (think of it of a \vert \mathcal{V}\vert-class classification problem).
– Backprogation loss back to the tokens in {suffix i-1}, we could select tokens with highest gradient to replace and obtain {suffix i}.

Finally, we use the optimized {suffix t} to put it into test and hope that it will generate the {affirmation} token.

# train
BEGINNING OF CONVERSATION: USER: {in} {suffix 0} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 1} ASSISTANT: {affirmation}
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}
...
BEGINNING OF CONVERSATION: USER: {in} {suffix 2} ASSISTANT: {affirmation}

# test
BEGINNING OF CONVERSATION: USER: {in} {suffix t} ASSISTANT:

Basics

PPO is the extension of the classical policy gradient algorithm (therefore, on-policy) by updating multiple steps rather than only one step. Suppose we have a reward model r _ \theta(x, y), the PPO tries to update the LM parameters \phi so that the cumulative reward is maximized

The following is the steps of PPO (taken from Hyunwon Chung’s talk):

  • Step 1: Obtaining a SFT model using the standard LM loss.
  • Step 2: Repeat the following:
    • Sampling: Sampling prompts from the datasets.
    • Rollout: Generating responses with the current version of LM \pi _ \phi ^ \mathrm{RL}.
    • Evaluation: Using the (fixed) reward model r _ \theta to score each of the response in the last step.
    • Optimization: Using the (prompt, continuation, score) triplets as a dataset to optimize the parameter (i.e., \phi) of the LM.

These steps are written concisely (yet confusingly) in the original paper as follows. The first boxed term is used to prevent overfitting to the reward function; the second boxed term is to reduce the performance regression on the standard benchmarks.
\mathbb{E} _ {x, y \sim D _ {\pi ^ \mathrm{RL}}} \left[ r _ \theta(x, y) – \boxed{\beta \cdot \log \frac{\pi _ \phi ^ \mathrm{RL}(y\vert x)}{\pi^\mathrm{SFT}(y\vert x)}}\right] + \boxed{\gamma \mathbb{E} _ {x \sim D} \left[ \log \pi _ \phi ^ \mathrm{RL}(x) \right]}

Method

This paper is motivated by the observation that the aligned LM will still generate unsafe content if we could make the first few words of the LM response something like “Sure, here is how $UNSAFE_CONTENT”.

Therefore, the idea is to disguise the input prompt with an automatically optimized suffix so that an aligned LM has a similar loss as

Note that selecting replacement by loss makes sense because the RLHF maximizes the reward while staying as close as the original SFT model.

Code Anatomy

The codebase is designed for chat models that involve different “roles” in the format of tuples. It is necessary to adapt the codebase to make it work with plain text models.

The most complicated part of the codebase is how the authors handle different prompt template in various language models; these messy details are all included in llm_attacks.minimal_gcg.string_utils.SuffixManager. What makes things more complicated is the fact that these string processing utilities in turn depend on fastchat library.

Three key variables in the demo specific to LLaMA2-Chat model are manager._control_slice, manager._loss_slice, and manager._target_slice. These three variables are derived from hidden variables self._assistant_role_slice and self._user_role_slice; they are fixed throughout the attack.

d

The attack discussed in the paper works best with greedy decoding (the default approach in model.generate()). One may develop special decoding methods geared towards safety.

Reading Notes | Certifying LLM Safety against Adversarial Prompting

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-12-07: First draft. The corresponding authors include Soheil Feizi (UMD) and Himabindu Lakkaraju (Harvard).
  • The intuition of the proposed certified LLM safety is quite simple: the complete sequence is safe if all of its subsequences are also safe.

However, one issue with this notion of safety is that it relies on the capability of the safety classifier: if the classifier systematically fail, then the certificate is broken.

Reading Notes | Direct Preference Optimization – Your Language Model is Secretly a Reward Model

[Semantic Scholar] – [Code] – [Tweet] – [Video]

Change Logs:

  • 2023-12-04: First draft.

Overview

  • DPO belongs to a larger family of algorithms of directly optimizing human preferences. The algorithm assumes there are always a winning solution and a losing solution; this is different from PPO as now the label becomes discrete.
  • Using DPO will alleviate the need for a dedicated library such as trl. The only change need to be made is a loss function.

Reference

  1. DPO Debate: Is RL needed for RLHF? – YouTube: This is an advanced video by Nathan Lambert.
  2. [2310.12036] A General Theoretical Paradigm to Understand Learning from Human Preferences (DeepMind)

    This is a theoretical paper that reveals the limitations of DPO.

    It shows two assumptions of RLHF: (1) pairwise comparison could be substituted with pointwise rewards, and (2) an LM trained on the pointwise rewards will generalize from collected data to OOD data.

Research Notes | Benchmarking LLM Safety

Problem Description

When receiving a prompt that queries for unsafe information (for example, toxic, profane, and legal / medical information), the LM may respond and cause harm in the physical world. There are several ways to diagnose LM weakness:

  • Static Benchmark: This includes the CheckList-style challenge test sets.

    • Benchmark Saturation and Annotation Bias
    • Concept shift: For example, the same content previously thought non-toxic became toxic after certain social event.
    • Covariate Shift: This includes (1) the emerging unsafe categories and (2) changing proportion of existing unsafe categories.
  • Red-Teaming

    • Manual Red-Teaming: Leveraging people’s creativity to search for prompts that may elicit unsafe behaviors of LLMs.
    • Automated Red-Teaming: Using automated search to deviate the region guarded by RLHF so that the unsafe content will be generated.

Note that

  • The description above only considers the language model itself. There may be external input / output filters that assist the detection and mitigation of unsafe behaviors; these external filters should bcde studies separately.
  • The LM itself may or may not go through the process of enhancing safety. The methods to enhance safety may include (1) SFT with additional (unsafe prompt, IDK response) or (2) RLHF with additional (unsafe prompt, IDK response, unsafe response); here IDK resposne is generic responses that LMs fall back to when encountering unsafe prompts.

Red Teaming

Resources

  • An comprehensive wiki and a collection of resources from Yaodong Yang @ PKU. He, together with Songchun Zhu, also writes a comprehensive survey on AI alignment; it has a Chinese version.

Reference

Safety Alignment

  1. [2310.12773] Safe RLHF: Safe Reinforcement Learning from Human Feedback
  2. [2307.04657] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset (PKU-Alignment)

    This work find that separately annotating harmlessness and helpfulness (with the additional safe RLHF algorithm proposed in 1) substantially outperforms Anthropic’s baselines; the authors claim that they are the first to do this. The author also open-source the datasets (1) a SFT (or classification) dataset that is used to train safety classifier and (2) a RLHF dataset that is used to fine-tune an LM (Alpaca in the paper).

    image-20231130165858750

    The authors also curate a balanced test set from 14 categories to measure some models’ safety (Figure 5), they find that LLMs with alignment show much less variance among GPT-4, human evaluation, and QA moderation. Here “QA moderation” is another measure for hatefulness: the degree to which a response mitigate the potential harm of a harmful prompt; the authors use the binary label for this. Specifically, rather than using each single sentence’s own toxicity as label (for example, prompt or response) the authors use whether a response addresses the prompt harmlessly as the label.

    image-20231130203202302image-20231130205627290

    Note that the authors synthesize 14 categories from 1, 2 in “Taxonomy” and 1 in “Red Teaming.” The authors acknowledge that these categories are not MECE.

    The authors release their models and datasets on HuggingFace hub:

    Model Name Note
    1 PKU-Alignment/alpaca-7b-reproduced The reproduced Alpaca model.
    2 PKU-Alignment/beaver-dam-7b A LLaMA-based QA moderation model
    3 PKU-Alignment/beaver-7b-v1.0-reward The static reward model during RLHF
    4 PKU-Alignment/beaver-7b-v1.0-cost The static cost model during RLHF
    5 PKU-Alignment/beaver-7b-v1.0 The Alpaca model that goes through the safe RLHF process based on 1
    Dataset Name Note
    1 PKU-Alignment/BeaverTails A classification dataset with prompt, response, category, and is_safe columns; it could be used for 14 classes (if using category) or 2 classes (if using is_safe).
    2 PKU-Alignment/BeaverTails-single-dimension-preference A preference dataset with prompt, response_0, response_1, and better_response_id (-1, 0, 1).
    3 PKU-Alignment/BeaverTails-Evaluation It only has prompt and category columns. It is not the test split of the dataset 1 and 2.
    4 PKU-Alignment/PKU-SafeRLHF A preference and binary classification dataset (N=330K) with prompt, response_0, response_1, is_response_0_safe, is_response_1_safe, better_response_id, safer_response_id; it has both training and test split.
    5 PKU-Alignment/PKU-SafeRLHF-30K Sampled version of 4 with both training and test split.
    6 PKU-Alignment/PKU-SafeRLHF-10K A further sampled version of 4 with only training split available.
    7 PKU-Alignment/processed-hh-rlhf A reformatted version of the Anthropic dataset for the ease of use; the original dataset is formatted in plain text.

Safety Benchmark

  1. [2308.01263] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models (Röttger et al.): This work presents a small set of test prompts (available on GitHub) that could be used to test the safety of an LLM. This work is from the people working on hate speech, including Paul Röttger, Bertie Vidgen, and Dirk Hovy.
  2. [2308.09662] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (DeCLaRe Lab, SUTD): This work provides two datasets: (1) a set of hateful questions for safety benchmarking, and (2) (propmt, blue conversation, red conversation) datasets for safety benchmarking.
  3. [2309.07045] SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions (Tsinghua): This work provides a dataset of multiple-choice QA to evaluate the safety of an LLM across 7 predefined categories, including offensiveness, bias, physical health, mental health, illegal activities, ethics, and privacy.

OOD and Safety

  1. [2311.14743] A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift (Scale AI)

Red Teaming

  1. [2209.07858] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al., Anthropic).
  2. [2202.03286] Red Teaming Language Models with Language Models (Perez et al., DeepMind and NYU)

Taxonomy of Unsafe Behaviors

  1. [2206.08325] Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models (Rauh et al., DeepMind)
  2. BBQ: A hand-built bias benchmark for question answering (Parrish et al., Findings 2022, NYU)

Controlled Text Generation

  1. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (Hartvigsen et al., ACL 2022)

    The authors propose a classifier-in-the-loop constrained decoding scheme that allows for the generation of benign and (implicit) toxic content of 13 minority groups.

    Specifically, the authors adjust the token distribution by adding the a partial sequence’s neutral class probability from a hate speech classifier to mitigate the toxicity every step. This will make the original explicitly toxic content less toxic (from 66% to 43%) yet still implicitly toxic. Besides making implicit toxic content, this approach could also work with a benign prompt to generate benign content.

    image-20231204125814342

  2. [2310.14542] Evaluating Large Language Models on Controlled Generation Tasks (Sun et al., EMNLP)

    This paper shows that LLMs, including gpt-3.5-turbo, Falcon, Alpaca, and Vicuna, could not be controlled to follow fine-grained signal such as numerical planning (for example, “generate a paragraph with five sentences.”); they do well in controlling high-level signal, such as sentiment, topic, and enforcing specific keywords.

Adversarial Attack on LLM

  1. [2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models

    • This paper proposes two ways to elicit unsafe behaviors of LLMs

      • Producing Affirmative Responses: Appending “Sure, here is [prompt]” to the original prompt that generates expected unsafe content.
      • Greedy Coordinate Gradient (GCG)

        Given an input prompt x _ {1:n}, the algorithm iterates over all tokens and find the replacement that causes the smallest loss. Specifically, for each token, the algorithm enumerate all possible gradients with respect to this token’s one-hot vector, then the algorithm picks top-K and modifies the prompt by replacing the token in the top-K set, and finally selects the prompt with the lowest loss.

    • In attacking vision models, it is well established that attacking distilled models is much easier than the original models.

Toxicity Detection

  1. [2312.01648] Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation

    • This paper proposes a method to attain almost perfect accuracy on the challenging civil_comment datasets. The authors manage to do so by deriving a set of features from LLM from the first principle, and training a linear classifier on top of these features.
    • Intrinsic Dimension (ID) could be used to characterize the likelihood a prompt could evade the RLHF alignment. It could be used as a proxy for prompt engineering so that jailbreaking will happen.

      The authors show (using the increased ID as a proxy for evading alignment) that prepending a relevant non-toxic sentence as prefix will make the aligned LM more likely to generate toxic content.

Research Notes | Research in the LLM Era

Overview

This post mainly comprises content from the sources below:

Directions

Evaluation

Miscellaneous Notes

Research Notes | Constitutional AI

[Research Paper] – [Constitution] – [Policy Memo] – [Full List of Research from Anthropic]

  • Notable figures from Anthropic include Chris Olah, Deep Ganguli, Ethan Perez, Sam Bowman, and Jared Kaplan. The first authors of this work is Yuntao Bai.

Overview

There are some limitations with OpenAI’s approaches of RLHF, i.e., asking humans to compare responses and select what they prefer.

  • Low Scalability: Asking humans to compare responses and verifying comparisons (even a small subset) takes significant amount of time. Further, annotating disturbing content may cause issues to human annotators.
  • Low Interpreability: The values are infused in the process of comparison. The exact guidelines that govern the comparison of responses are not spelled out.
  • Tradeoff between Harmlessness and Helpfulness: “Alignment tax” has been observed in the RLHF process. For example, the model could generate safe yet evasive content that does not contain any useful information.

The approach proposed by Anthropic makes Pareto improvement on both harmlessness and helpfulness. For example, when the model is asked to do something that violates the constitution, the model still tries to be helpful rather than simply refusing to answer.

The core of the CAI Is a set of expert instructions (source); it replaces humans with another LM in the RLHF process, leading to a new way for alignment, i.e., RLAIF.

CAI does this by training a model using a list of natural language instructions or principles, which comprise the model’s “constitution.”

Additional Notes

  • The constitution is not finalized; it could be revised and updated. The current version of constitution is derived from numerous sources, including UN Declaration of Human Rights (1948) , DeepMing’s Sparrow Principles, and Apple’s terms of services; it also considers values that are not from western, rich, and industrialized culture.

    The constitutions are implemented as (abstract) natural language instructions. Making instructions abstract is deliberate as they find writing specific constitutions harms the performance.

    “`bash
    Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood. (1)

    Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth or other status. (2)
    “`

Coding Notes | LLM Practice

Prompt Engineering

The OpenAI official documentation summarizes 6 tricks for prompt engineering.

Write Clear Instructions

The LM could not do that is not instructed by the user automatically.

Provide Reference Texts

Split Complex Tasks into Simpler Subtasks

Solving multiple problems in a cascade fashion often leads to smaller eventual error rate compared to solving the problem at the same time.

Give Model Time to Think or CoT

Use External Tools

It is better to use tools to solve the tasks that require algorithmic solutions; the LM is good at reasoning rather than solving problems algorithmically.

Test Changes Systematically

The prompts that work well on small number of samples in the playground may not work as well for a representative set of test samples. It is important to run evaluation on the large test set every time we make non-trivial changes to the prompt.

Fine-Tuning

Overview

  • As of 2023-11-15, OpenAI allows fine-tuning gpt-3.5-turbo, davinci-002, and babbage-002 models. OpenAI will soon support fine-tuning gpt-4. Besides, it is possible to fine-tune already fine-tuned models
  • Fine-tuning is discouraged unless we have shown that none of the below works. This is because it is faster to iterate with prompts in the playground than fine-tuned models.
    • Prompt Engineering: We must closely follow the content in [1] for prompt engineering.
    • Prompt Chaining: Breaking complex tasks into multiple prompts.
    • Function Calling
  • Reasons for Fine-Tuning
    • Reducing the length of prompts or reducing latency. Fine-tuning models could save up to 90% of the tokens compared to zero-shot or few-shot prompting (blog). Furthermore, fine-tuning a smaller model (for example, gpt-3.5-turbo) could often match the performance of a larger model (for exampe, gpt-4), therefore reducing latency.
    • Improving performance for tasks that are hard to articulate using prompts (i.e., tasks that “show, not tell”).

Recipe

Workflow

  • Unlike older models, the gpt-3.5-turbo could be fine-tuned with as few as 10 examples. There will be clear improvement when fine-tuning with 50 to 100 examples.
  • It is better to start fine-tuing using 50 examples and see if there is improvement. If there is no clear improvement, we must redesign the data.

Step 1: Preparing Data

We need to prepare data into a .jsonl following the format below; each line in the .jsonl will be an example; the token limit of each example is 4096 for gpt-3.5-turbo. We could estimate the token usage of a fine-tuning job using num_tokens_from_messages() function (doc).

  • Chat Models

    In the example below, the goal is to fine-tune a model that could generate sarcastic responses. Each sample should be formatted as follows.

{
    "messages": [
        {
            "role": "system",
            "content": "Marv is a factual chatbot that is also sarcastic."
        },
        {
            "role": "user",
            "content": "What's the capital of France?"
        },
        {
            "role": "assistant",
            "content": "Paris, as if everyone doesn't know that already."
        }
    ]
}
  • Other Models
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}

Step 2: Uploading Data

We first need to make sure the openai is most updated using pip install -U openai. Then we could upload the data.

from openai import OpenAI
client = OpenAI()

message =  client.files.create(
  file=open("mydata.jsonl", "rb"),
  purpose="fine-tune"
)
# message:
# FileObject(
#   id='file-Y0Q82yniynZAN7TeZaEXcbYg', 
#   bytes=234106, 
#       created_at=1700031010, 
#   filename='fine_tuning_data.jsonl', 
#   object='file', 
#   purpose='fine-tune', 
#   status='processed', 
#   status_details=None
# )

Step 3: Fine-Tuning

Now OpenAI supports fine-tuning models using an UI (i.e., https://platform.openai.com/finetune). We could also submit a fine-tuning job using Python code below. Note that

  • filename is returned in Step 2.
  • model could be gpt-3.5-turbo or older models.

We could optionally tune the hyperparameters of fine-tuning.

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="filename", 
  model="gpt-3.5-turbo",
  # optional, see details below
  hyperparameters={ 
    "n_epochs":2
  }
)

We could monitor the status of fine-tuning on the OpenAI website. If using code is preferred, we could use one of the commands below.

from openai import OpenAI
client = OpenAI()

# Retrieve the state of a fine-tune
client.fine_tuning.jobs.retrieve("ftjob-abc123")

# Return the training metrics based on the command above, such as loss, accuracy
content = client.files.retrieve_content("result-file")

# List up to 10 events from a fine-tuning job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-abc123", limit=10)

# List 10 fine-tuning jobs
client.fine_tuning.jobs.list(limit=10)

Step 4: Evaluation and Iteration

After fine-tuning, we could evaluate the fine-tuned model on the held-out test set. If the performance is not satisfying, we should check the data quality from the aspects below.

Data Quality and Quantity

Data quality should be prioritized to data quantity: a smaller amout of high-quality data is generally better than a larger amount of low quality data.

  • Training Data Lacking Consistency

    As a rule of thumb, an inter-annotator agreement of 70% is low: if the humans could not agree on the labels, it is unlike the model could do better than humans.

  • Training Label Distribution Different from Testing Label Distribution

We could start from the cases that the fine-tuned model makes mistakes and starts to iterate from there. If it is indeed the case that the data quantity is the issue, we could estimate the gains by (1) fine-tuning a second model that uses half of the current data, and (2) estimating the performance difference of two models on the test set.

Model Hyperparameters

We could change 3 hyperparameters: number of epochs, learning rate multiplier, and batch size. The following are some typical scenarios and the corresponding action:

Task Scenario Action
Task with single or a few ideal completions Generations not following training data Increasing n_epochs from 3 to 4 or 5
Creative tasks Generations of reduced diversity Desceacing n_epochs from 3 to 1 or 2
/ Not converging Increaing learning_rate_multiplier

Reference

  1. OpenAI Prompt Engineering Guide
  2. OpenAI Fine-Tuning Guide

Talk Notes | LLM and RLHF

[Talk on LLM] – [Talk on RLHF] – [Slides of LLM Talk] – [Tweet Thread of the LLM Talk]

  • The presenter Hyungwon Chung is a research engineer at OpenAI; he was with Google. He was doing mechanical engineering during Ph.D. that is completely irrelevant (aka. pressure-retarded osmosis) from machine learning.
  • The “Pretraining” section mostly comes from the LLM talk. The other sections are from the RLHF talk.

Pretraining

  • Functional Viewpoint of the Transformer LM

    The transformer could be viewed as a computation module that receives and outputs the matrices of size (b, d, l). All powerful LLMs are based on transformers. The interaction between tokens have minimal assumptions: each token could interact with any other token; this is done using a mechanism called “dot-product attention.”

    image-20231116111614647

    For the sake of efficiency, the process above is done in batches. The only interdependence across the batch is finally the loss is divided by the batch size b.

    image-20231116111552954

  • Scaling Transformers

    This means efficiently doing matrix multiplication with many machines (with matrices distributed on each and every machine) while minimizing the communication costs between machines.

  • Scaling Law, Phase Change, and Emergent Abilities

    • An idea that does not work now may work when scaling up the model size. We need to constantly unlearn intuitions built on outdated or even invalidated ideas. We can update our intuition by reruning experiments that previously do not work on newer models and pinpointing what is new in these newer models.

Screenshot 2023-11-16 at 12.00.15 AM

  • Post Training
    • Users could not immediately communicate with the pretrained model as the training objective of pretraining is next token prediction. Prompt engineering mitigates this problem by setting up the ground for the LM to generate the relevant content.
    • Pretrained models always generate something that is a natural continuation of the prompts even if the content is malicious.

Supervised Fine-Tuning (SFT)

  • Instruction tuning is the technique that will almost universally beneficial to decoder only model and encoder-decoder model to improve their performances: the answer to “should I try instruction tuning” is almost always “yes.”

    Importantly, this is true even if we use encoder-only model as instruction-tuning provides a better initialization for “single-task” fine-tuning (see [2]). For example, we could use instruction-tuned BERT rather than regular BERT for various tasks.

    Pareto Improvements to Single Task Finetuning For both sets of Held-In and Held-Out tasks examined, finetuning Flan-T5 offers a pareto improvement over finetuning T5 directly. In some instances, usually where finetuning data is limited for a task, Flan-T5 without further finetuning outperforms T5 with task finetuning.

  • An Unified Architecture

    All tasks are unified with the single text-to-text format (proposed by T5). This was not obviously a valid choice because back to that time people do not believe LMs could “understand.”

  • Two Flavors of Instruction Tuning

    • Using Mixture of Academic Datasets: Flan and T0. The limitation of these models is that they could not generate longer texts due to the limitation of the academic datasets.
    • Using User Traffic: For example, InstructGPT and ChatGPT. The user traffics are unavaialble in the academic datasets (for example, “explain the moon landing to a six year old.”) as there is no way to evaluate them.
  • Task Diversity and Model Size are Important

    • The T0 by the presenter collects 1836 tasks; it is still the largest collections as of November 2023. The authors show the linear scaling law of model size and normalized performance on the held-out tasks. Further, when the number of tasks increase, the line is lifted upwards with a double digit gain. Further, it is important to have combine the non-CoT and CoT data together.
    • However, the performance quickly plateaus even when there are more tasks. This is likely due to limited diversity of academic datastes.
  • Inherent Limitation of Instruction Tuning

    For a given input, the target is the single correct answer (it could be called behavior cloning in RL); this requires formalizing correct behavior of a given input. However, this is hard or even impossible for inputs that look like the following:

    • Write a letter to a 5-year-old boy from Santa Clause explaining that Santa is not real. Convey gently so as not to break his heart.
    • Implement Logistic regression with gradient descent in Python.

    The issue is that (1) the correct answer may not be unique, and (2) it is hard or even impossible to provide the correct answer. However, the tension is that none of the existing functions could directly address these issues. The solution is using rewards in RL to address the problem.

RLHF

The lecture is based on the InstructGPT paper, which provides the foundational idea and popularize RLHF. There are many variants and extensions of this papers; they are easy to understand if we understand this foundational paper.

The goal of RLHF is encoding human preferences and (more generally) values. RLHF opens up a new paradigm of learning the objective function; the inductive bias from rule-based system to RLHF is gradually removed for more general use cases (the blue block refers to the learnable block within a system).

image-20231115234105819

Reward Model (RM)

The intuition of training a reward model is It is difficult to evaluate open-ended generation directly, but it is easier to compare two completions.

The reward model r(x, y;\phi) is the SFT model that replaces the last layer with a layer that outputs a scalar; it could also be done differently like taking the probability of [CLS] token, etc. As long as the model outputs a scalar, how exactly we model this process is less relevant.

Let p _ {ij} be the probability that the completion y _ i is better than y _ j (here the order matters), then based on the old Bradley-Terry model; the function r(\cdot) models the strength of the sample. Note that it is likely both y _ i and y _ j are bad, then the goal is to choose the one that is relatively better.
\log \frac{p _ {ij}}{ 1 – p _ {ij}} = r(x, y _ i ; \phi) – r(x, y _ j; \phi),\quad p _ {ij} = \sigma( r(x, y_i;\phi) – r(x, y _ j; \phi))

Then we want to find \phi so that the sum of the probabilities is maximized: \max _ \phi \sum _ {x, y _ i, y _ j \in D} \log p _ {ij}.

Note that there are some issues with the reward modeling; there are many ways to improve this scheme:

  • The scheme above does not model how much y _ i is better than y _ j.

Policy Model

Once we have the reward model r(\cdot), we could use that to update the parameters of the language model itself \pi _ \theta. Specifically, we would like to maximize the following. Note that the prompt X=(X _ 1, \cdots, X _ S) are from academic datasets or user traffic and completion Y = (Y _ 1, \cdots, Y _ T) are sampled from the language model \pi _ \theta; the reward model is fixed in this process.
J(\theta) = \mathbb{E} _ {(X, Y)\sim D _ {\pi _ \theta}} \left[ r(X, Y;\phi) \right]
The specific algorithm used to update \theta is PPO as it could give a stable gradient update. Here is the procedure:

  • Initialize the policy model to a SFT model.
  • Repeat the following:

    1. Sampling: Sampling prompts from the input datasets.
    2. Rollout: Generating the completion conditiong on the prompt with the current LM \pi _ \theta.
    3. Evaluation: Computing the reward of the input and the generated output using the (fixed) reward model r(x, y;\phi). Note that the reward model is not necessarily a model according to trl library, it could also come from a rule or a human.
    4. Optimization: Back-propagating the policy model and updating the parameter.

The explanation is alreay clear. To make the understanding more concrete, we could take a look at the MWE provided by trl library.

img

One issues (asked by He He) is that there might be distribution shift when applying the fixed reward model here; it could be an interesting problem to study: should we perodically update reward model (through something like continual learning) so that the distribution shift is mitigated?

Regularization

  • Preventing \pi _ \theta from Deviating Too Much from the SFT Model (Overfitting to RM or Reward Hacking)

    Adding the per-token penalty to prevent \pi _ \theta(Y\vert X) from growing too large compared to \phi _ \text{SFT}(Y\vert X). The intuition why this is important is that RM may model some human bias (for example, preference for longer texts) that may not be ideal for the task to solve.
    J(\theta) = \mathbb{E} _ {(X, Y)\sim D _ {\pi _ \theta}} \left[ r(X, Y;\phi) – \beta \log \frac{\pi _ \theta(Y\vert X)}{\pi _ \text{SFT}(Y\vert X)}\right]

Additional Notes

  • There is no reliable metrics to measure long generated texts; this is a problem not solved even for OpenAI.
  • The inputs are typically longer than outputs. This is one of the reasons why the models trained on the open-source datasets perform poor.
  • The easier tasks (for example, simple arithmetic like 3 + 2 =) is already solved pretty well by the pretrained models. The goal of the SFT and RLHF is to address the diverse and abstract prompts.
  • The RM is called preference model by Anthropic.
  • When we have k responses to the same input, we could form \binom{k}{2} sample pairs and put them in the same batch to avoid overfitting.
  • The Constitutional AI (CAI) by Anthropic almost automates everything during RLHF; the only human efforts involved is writing the constitution itself. For example, the model is tasked to generate prompts; these prompts are sent to train reward models.
  • np.einsum() is the extension of np.matmul().

Reference

  1. [2210.11416] Scaling Instruction-Finetuned Language Models (Chung et al., including Jason Wei)
  2. [2301.13688] The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (Longpre et al.)
  3. [2009.01325] Learning to summarize from human feedback (Stiennon et al.): An example of reward hacking.
  4. [2212.08073] Constitutional AI: Harmlessness from AI Feedback (Bai et al.)

Reading Notes | Towards Understanding Chain-of-Thought Prompting – An Empirical Study of What Matters

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Poster]

Change Logs:

  • 2023-10-20: First draft. The paper appears at ACL 2023 as the best paper honorable mention.

Method

  • The experiments of this paper was done on text-davinci-002 with greedy decoding with temperature 0. The datasets they work on is quite small due to manual efforts required.
  • The paper focus on QA and arithmetic reasoning tasks; the authors introduce two concepts:

    • Bridging Objects
    • Language Template
  • The authors define the intermediate F1 scores for bridging objects. It is likely that the authors only accept generations that satisfy the predefined template and compute these metrics.
  • Observations:

    • The correctness of reasoning during CoT is not important.
    • Query should be (1) relevant and (2) follow the order of reasoning steps.
  • Additional Observations:

    • CoT does not make LLMs better; it unlocks the ability already learned by LLMs during pre-training. For example, the conclusions drawn on text-davinci-002 does not apply to Flan-PaLM; this is because Flan-PaLM has been fine-tuned on the two tasks.

      Given limited resources and an ability to fine-tune the model, we should include more and more data to pre-training or instruction tuning to improve the model rather than focusing the specific prompt engineering tricks.

Experiment

Additional Notes

Reference

Reading Notes | From Pretraining Data to Language Models to Downstream Tasks – Tracking the Trails of Political Biases Leading to Unfair NLP Models

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-10-12: First draft. This paper is one of the 3 best papers in ACL 2023.

Method

Political Leanings of LMs

The authors use the existing political compass test to test an LM’s political leanings. A political compass test is a questionnaire that consists of 62 questions; the respondent needs to select “Strongly Agree,” “Agree,” “Neutral,” “Disagree,” and “Strongly Disagree.” for each question. Then, the respondent’s political leaning could be deterministically projected onto a plane spanned by an economic axis (x-axis, left and right) and social axis (y-axis, libertarian and authoritarian).

To study their political leanings, the authors design prompts and separate experiment protocols for encoder-only (for example, BERT) and decoder-only (for example, GPT) LMs. Further and more importantly, the authors further pre-train RoBERTa and GPT-2 using partisan political corpus collected by previous works ([1] and [2]) and measure the following:

  • How pretraining corpus could influence the political leanings.
  • The dynamics of political leanings during continued pre-training.

Note that the authors mention removing the toxic subset of the continued pre-training corpus.

  • Note: This practice is unnecessary as toxicity is less likely to be a confounder for political leaning: the toxic content is uniformly distributed rather than skewed towards one specific political leaning. What is worse, the hate speech detector itself may have political bias.
Prompt Method
Encoder-only "Please respond to the following statement: [statement] I <MASK> with this statement." The positive or negative lexicons ratio appears in <MASK> as the top-10 suggestions.
Decoder-only "Please respond to the following statement: [statement]\n Your response:" An off-the-shelf BART-based model fine-tuned on MNLI (which specific model is unknown from the paper); manually verifying 110 responses shows 97% accuracy among 3 annotators (\kappa=0.85).

Downstream Tasks

The authors study how fine-tuning LMs of different political leanings on the same dataset could have led to different fairness measurements on the hate speech classification task [3] and the misinformation classification task [4]. Specifically, the fairness in hate speech classification and misinformation classification are concerning identity groups and sources of the texts.

Experiments

  • LMs show different political leanings.

image-20231013004232830

  • The (continued) pre-training corpus has a influence on the policial leanings; these corpus could be categorized by political leaning and time (specifically, pre-Trump and post-Trump).

    image-20231013005130340

    image-20231013005221754

  • For downstream tasks

    • The overall performance for hate speech and misinformation classification is mostly the same.
    • Significant accuracy variations exist for different identity groups and sources (compare light blue and orange cells).
  • Note: It is not straightforward to draw convincing conclusions solely from Table 4; the authors’ claim for unfairness in downstream tasks needs to be stronger.

image-20231013010527794

Reference

  1. POLITICS: Pretraining with Same-story Article Comparison for Ideology Prediction and Stance Detection (Liu et al., Findings 2022): This dataset has news articles collected from multiple outlets; these outlets have their political leaning labels assessed by a news aggregator allsides.com (Wikipedia).
  2. What Sounds “Right” to Me? Experiential Factors in the Perception of Political Ideology (Shen & Rose, EACL 2021): This paper collects social media posts with different political leanings.
  3. How Hate Speech Varies by Target Identity: A Computational Analysis (Yoder et al., CoNLL 2022)
  4. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection (Wang, ACL 2017) (PolitiFact): This is a standard dataset for fake news classification.