Talk Notes | Causality

Posted on November 28, 2023December 11, 2023 by David Yang

[Homepage]

Change Log:

2023-11-28: The data mentioned in the talk requires full specification. It may not likely work with text or image dataset. What is more relevant to text and images is called “causal representation learning.”

Overview

Causality Ladder (Judea Pearl): Seeing $\rightarrow$ Intervening $\rightarrow$ Imagining
– Seeing: This is where the traditional ML happens.
– Invervening
– Imaging: This requires structural causal model (SCM). This is not discussed in the talk.

Assumptions

Ingredients

Besides, we need to assume (1) we have magically measured all factors; there are no confounders, and (2) iid.
- Data: Assumes to be faithful to the graph.
- Causal Graph: Assumes to satisfy Markov condition.

Identifying Causality

Intuition (Janzing 2012)

If $X$ causes $Y$ , then the noise pattern from $X$ is $Y$ is simpler than the other way around.
Operationalizing the Intuition
- Kolmogorov Complexity: The shortest program (in any programming language) that computes a PDF. Then if $X \rightarrow Y$ , then $K(P(X)) + K(P(Y\vert X)) \leq K(P(Y)) + K(P(X\vert Y))$ .
- The formula above could be realized in practice with some assumptions in systems called SLOOPY, HECI (Xu et al. 2022 and Marx & Vreeken 2017, 2019) based on relatively simple regressions.
These systems could be evaluated using radar plot of some established datasets.

Talk Notes | LLM and RLHF

Posted on November 13, 2023December 11, 2023 by David Yang

[Talk on LLM] – [Talk on RLHF] – [Slides of LLM Talk] – [Tweet Thread of the LLM Talk]

The presenter Hyungwon Chung is a research engineer at OpenAI; he was with Google. He was doing mechanical engineering during Ph.D. that is completely irrelevant (aka. pressure-retarded osmosis) from machine learning.

The “Pretraining” section mostly comes from the LLM talk. The other sections are from the RLHF talk.

Contents

1 Pretraining
2 Supervised Fine-Tuning (SFT)
3 RLHF
4 Additional Notes
5 Reference

Pretraining

Functional Viewpoint of the Transformer LM

The transformer could be viewed as a computation module that receives and outputs the matrices of size (b, d, l). All powerful LLMs are based on transformers. The interaction between tokens have minimal assumptions: each token could interact with any other token; this is done using a mechanism called “dot-product attention.”

For the sake of efficiency, the process above is done in batches. The only interdependence across the batch is finally the loss is divided by the batch size b.
Scaling Transformers

This means efficiently doing matrix multiplication with many machines (with matrices distributed on each and every machine) while minimizing the communication costs between machines.
Scaling Law, Phase Change, and Emergent Abilities
- An idea that does not work now may work when scaling up the model size. We need to constantly unlearn intuitions built on outdated or even invalidated ideas. We can update our intuition by reruning experiments that previously do not work on newer models and pinpointing what is new in these newer models.

Screenshot 2023-11-16 at 12.00.15 AM

Post Training
- Users could not immediately communicate with the pretrained model as the training objective of pretraining is next token prediction. Prompt engineering mitigates this problem by setting up the ground for the LM to generate the relevant content.
- Pretrained models always generate something that is a natural continuation of the prompts even if the content is malicious.

Supervised Fine-Tuning (SFT)

Instruction tuning is the technique that will almost universally beneficial to decoder only model and encoder-decoder model to improve their performances: the answer to “should I try instruction tuning” is almost always “yes.”

Importantly, this is true even if we use encoder-only model as instruction-tuning provides a better initialization for “single-task” fine-tuning (see [2]). For example, we could use instruction-tuned BERT rather than regular BERT for various tasks.

Pareto Improvements to Single Task Finetuning For both sets of Held-In and Held-Out tasks examined, finetuning Flan-T5 offers a pareto improvement over finetuning T5 directly. In some instances, usually where finetuning data is limited for a task, Flan-T5 without further finetuning outperforms T5 with task finetuning.
An Unified Architecture

All tasks are unified with the single text-to-text format (proposed by T5). This was not obviously a valid choice because back to that time people do not believe LMs could “understand.”
Two Flavors of Instruction Tuning
- Using Mixture of Academic Datasets: Flan and T0. The limitation of these models is that they could not generate longer texts due to the limitation of the academic datasets.
- Using User Traffic: For example, InstructGPT and ChatGPT. The user traffics are unavaialble in the academic datasets (for example, “explain the moon landing to a six year old.”) as there is no way to evaluate them.
Task Diversity and Model Size are Important
- The T0 by the presenter collects 1836 tasks; it is still the largest collections as of November 2023. The authors show the linear scaling law of model size and normalized performance on the held-out tasks. Further, when the number of tasks increase, the line is lifted upwards with a double digit gain. Further, it is important to have combine the non-CoT and CoT data together.
- However, the performance quickly plateaus even when there are more tasks. This is likely due to limited diversity of academic datastes.
Inherent Limitation of Instruction Tuning

For a given input, the target is the single correct answer (it could be called behavior cloning in RL); this requires formalizing correct behavior of a given input. However, this is hard or even impossible for inputs that look like the following:
- Write a letter to a 5-year-old boy from Santa Clause explaining that Santa is not real. Convey gently so as not to break his heart.
- Implement Logistic regression with gradient descent in Python.
The issue is that (1) the correct answer may not be unique, and (2) it is hard or even impossible to provide the correct answer. However, the tension is that none of the existing functions could directly address these issues. The solution is using rewards in RL to address the problem.

RLHF

The lecture is based on the InstructGPT paper, which provides the foundational idea and popularize RLHF. There are many variants and extensions of this papers; they are easy to understand if we understand this foundational paper.

The goal of RLHF is encoding human preferences and (more generally) values. RLHF opens up a new paradigm of learning the objective function; the inductive bias from rule-based system to RLHF is gradually removed for more general use cases (the blue block refers to the learnable block within a system).

Reward Model (RM)

The intuition of training a reward model is It is difficult to evaluate open-ended generation directly, but it is easier to compare two completions.

The reward model $r(x, y;\phi)$ is the SFT model that replaces the last layer with a layer that outputs a scalar; it could also be done differently like taking the probability of [CLS] token, etc. As long as the model outputs a scalar, how exactly we model this process is less relevant.

Let $p _ {ij}$ be the probability that the completion $y _ i$ is better than $y _ j$ (here the order matters), then based on the old Bradley-Terry model; the function $r(\cdot)$ models the strength of the sample. Note that it is likely both $y _ i$ and $y _ j$ are bad, then the goal is to choose the one that is relatively better.
$\log \frac{p _ {ij}}{ 1 – p _ {ij}} = r(x, y _ i ; \phi) – r(x, y _ j; \phi),\quad p _ {ij} = \sigma( r(x, y_i;\phi) – r(x, y _ j; \phi))$

Then we want to find $\phi$ so that the sum of the probabilities is maximized: $\max _ \phi \sum _ {x, y _ i, y _ j \in D} \log p _ {ij}$ .

Note that there are some issues with the reward modeling; there are many ways to improve this scheme:

The scheme above does not model how much $y _ i$ is better than $y _ j$ .

Policy Model

Once we have the reward model $r(\cdot)$ , we could use that to update the parameters of the language model itself $\pi _ \theta$ . Specifically, we would like to maximize the following. Note that the prompt $X=(X _ 1, \cdots, X _ S)$ are from academic datasets or user traffic and completion $Y = (Y _ 1, \cdots, Y _ T)$ are sampled from the language model $\pi _ \theta$ ; the reward model is fixed in this process.
$J(\theta) = \mathbb{E} _ {(X, Y)\sim D _ {\pi _ \theta}} \left[ r(X, Y;\phi) \right]$
The specific algorithm used to update $\theta$ is PPO as it could give a stable gradient update. Here is the procedure:

Initialize the policy model to a SFT model.
Repeat the following:
1. Sampling: Sampling prompts from the input datasets.
2. Rollout: Generating the completion conditiong on the prompt with the current LM $\pi _ \theta$ .
3. Evaluation: Computing the reward of the input and the generated output using the (fixed) reward model $r(x, y;\phi)$ . Note that the reward model is not necessarily a model according to trl library, it could also come from a rule or a human.
4. Optimization: Back-propagating the policy model and updating the parameter.

The explanation is alreay clear. To make the understanding more concrete, we could take a look at the MWE provided by trl library.

One issues (asked by He He) is that there might be distribution shift when applying the fixed reward model here; it could be an interesting problem to study: should we perodically update reward model (through something like continual learning) so that the distribution shift is mitigated?

Regularization

Preventing $\pi _ \theta$ from Deviating Too Much from the SFT Model (Overfitting to RM or Reward Hacking)

Adding the per-token penalty to prevent $\pi _ \theta(Y\vert X)$ from growing too large compared to $\phi _ \text{SFT}(Y\vert X)$ . The intuition why this is important is that RM may model some human bias (for example, preference for longer texts) that may not be ideal for the task to solve.
$J(\theta) = \mathbb{E} _ {(X, Y)\sim D _ {\pi _ \theta}} \left[ r(X, Y;\phi) – \beta \log \frac{\pi _ \theta(Y\vert X)}{\pi _ \text{SFT}(Y\vert X)}\right]$

Additional Notes

There is no reliable metrics to measure long generated texts; this is a problem not solved even for OpenAI.
The inputs are typically longer than outputs. This is one of the reasons why the models trained on the open-source datasets perform poor.
The easier tasks (for example, simple arithmetic like 3 + 2 =) is already solved pretty well by the pretrained models. The goal of the SFT and RLHF is to address the diverse and abstract prompts.
The RM is called preference model by Anthropic.
When we have $k$ responses to the same input, we could form $\binom{k}{2}$ sample pairs and put them in the same batch to avoid overfitting.
The Constitutional AI (CAI) by Anthropic almost automates everything during RLHF; the only human efforts involved is writing the constitution itself. For example, the model is tasked to generate prompts; these prompts are sent to train reward models.
np.einsum() is the extension of np.matmul().

Reference

[2210.11416] Scaling Instruction-Finetuned Language Models (Chung et al., including Jason Wei)
[2301.13688] The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (Longpre et al.)
[2009.01325] Learning to summarize from human feedback (Stiennon et al.): An example of reward hacking.
[2212.08073] Constitutional AI: Harmlessness from AI Feedback (Bai et al.)

Talk Notes | Data-Centric AI

Posted on November 6, 2023December 11, 2023 by David Yang

Contents

1 Overview
2 Lecture 1 – Data-Centric AI vs. Model-Centric AI
3 Lecture 2 – Label Errors
4 Lecture 8 – Encoding Human Priors
5 cleanlab Library
- - 5.0.1 Anatomy
- 5.1 Example
6 Additional Notes
7 Reference

Overview

The following notes are the data-centric AI IAP course notes from MIT; Independent Activities Period (IAP) is a special four-week semester of MIT. The standard time for each lecture is 1 hour.

Lecture 1 – Data-Centric AI vs. Model-Centric AI

It is not hard to design fancy models and apply various tricks on the well curated data. However, these models and tricks do not work for real-world data if we do not explicitly consider the real-world complexities and take them into account. Therefore, it is important to focus on data rather than model.

It turns out there are pervasive label errors on the most cited test sets of different modalities, including text, image, and audio. They could be explored in labelerrors.com.
To understand why data is important, we could think about kNN algorithm. The accuracy of kNN is purely based on the quality of datasets. However, the kNN is not a data-centric algorithm because it does not modify the labels.
Two Goals of Data-Centric AI

Rather than modifying loss function, doing HPO, or changing the model itself, we do either of the following:
- Designing an algorithm that tries to understand the data and using that information to improve the model. One such example is curriculum learning by Yoshua Bengio; in curriculum learning, the data is not changed but its order is shuffled.
- Modifying the dataset itself to improve the models. For example, the confident learning (i.e., removing wrong labels before training the model) studied by Curtis Northcutt.
What are NOT Data-Centric AI and Data-Centric AI Counterpart
- Hand-picking data points you think you will improve a model. $\rightarrow$ Coreset Selection.
- Doubling the size of dataset. $\rightarrow$ Data Augmentation. For example, back-translation for texts, rotation and cropping for images. However, we need to first fix label errors before augmenting the data.
Typical Examples of Data-Centric AI

Curtis Northcutt cites Andrew Ng and other sources on the importance of data in machine learning ([1] through [3]). Here are some examples of data-centric AI:
- Outlier Detection and Removal. However, this process relies on a validation process on which threshold to choose.
- Label Error Detection and Correction
- Data Augmentation
- Feature Engineering and Selection. For example, solving XOR problem by adding a new column.
- Establishing Consensus Labels during Crowd-sourcing.
- Active Learning. I want to improve 5% accuracy on the test set but I could afford as little new annotated data as possible.
- Curriculum Learning.
Data-Centric AI Algorithms are Often Superior to Model-Centric Algorithms

The model centric approaches (i.e., training less on what a model believes are the bad subset of data) is a much worse idea than the data-centric approach (i.e., confident learning).

Root-Causing the Issues – Models or Data
- The model should perform well on the slices of data. Slicing means not only sampling data to a smaller number but also reducing the number of classes from a large number to a very small number. For example, rather than classifying images to 1000 classes, we only focus on performance on two classes.
- The model should perform similarly on similar datasets (for example, MNIST datasets and other digits dataset).

Lecture 2 – Label Errors

Notation

Notation	Meaning	Note
$\tilde{y}$	Noisy observed label
$y ^ *$	True underlying label
$\mathbf{X} _ {\tilde{y} =i, y ^ {*} = j}$	A set of examples whose true label is $j$ but they are mislabeled as $i$ .
$\mathbf{C} _ {\tilde{y} =i, y ^ {*} = j}$	The size of the dataset above.
$p(\tilde{y} =i, y ^ {*} = j)$	The joint probability of label $i$ and label $j$ ; it could be estimated by normalizing $\mathbf{C}$ ; it is indeed dividing each entry by the sum of all entries in the matrix $\mathbf{X}$ .
$p(\tilde{y} =i\vert y ^ {*} = j)$	The transition probability that the label $j$ flips to label $i$ ; it could also be called flipping rate.

Categories of Label Errors

When comparing the consensus crowd-sourcing labels and the final label in the dataset, there are 4 types of label errors:

Correctable: The given label is wrong and it could be corrected with crowd-sourcing. This is the type of label the lecture focus on detecting.
Multi-label: The given label and the consensus label are both right. However, more than one label in $\mathcal{Y}$ could be used to label the samples. For example, an image with co-existence of laptop and humans that is incorrectly labeled as “laptop.”
Neither: The given label and the consensus label are both wrong.
Non-agreement: There is no way to tell whether the given label or the consensus label is correct.

There are also two categories of the label errors the presenter does not focus on:

Uniform Random Flipping $p(\hat{y} = i \vert y ^ * = j) = \epsilon, \forall i\neq j$ : This will show as a symmetric $\mathbf{X}$ matrix. It is easy to solve and this type of errors are unlikely to happen in the real world.
Instance-Dependent Label Noise $p(\hat{y} = i \vert y ^ * = j, \mathbf{x})$ : This will require a lot of assumptions on the data distribution. Importantly, this type of label errors seldom happen in the real world.

Uncertainty

There are two sources of uncertainty:

Aleatoric Uncertainty: Label noise. It is the difficulty of an sample. This difficulty could come from incorrect label $y$ or the strange distribution of $\mathbf{x}$ .
Epistemic Uncertainty: Model noise. It is the model’s inability to understand the example. For example, the model has never seen similar examples before or the model class is too simple.

Confident Learning

The focus of the lecture is the correctable errors; it is defined in previous sections; the matrix $\mathbf{X}$ is non-symmetric. Furthermore, the lecture will focus on samples with one label and one annotation.

Motivation of Using Confident Learning
- Ranking samples by loss does not work. We could not find a loss threshold and claim the samples above this threshold are label errors.
- Deep learning does not solve the label noise problem (despite many papers and many claims) because these problems try to solve the datasets polluted by uniform noise.
Assumption: Class-Conditional Label Noise
p(\hat{y} \vert y ^ {_}; \mathbf{x} ) = p(\hat{y} \vert y ^ {_})
- Interpretation: Given the true label, there is a constant flipping rate for the samples under that true label to other labels.
- Rationale: A pig image often confused with a boar image but not other items such as “missiles” and “keyboards.” This tendency has nothing to do with what exactly a pig look like in an image but the similarities of the classes.
- Motivation: This assumption is made because the LHS couples the aleatoric uncertainy and epistemic uncertainty and this assumption decouples these two uncertainties.
Confident Learning
- For each of the class $j$ , we could define a model’s self-confidence. If the self-confidence score of class $j$ is low, but some of the samples have very high confidence, then we could say that there is something wrong with that label.
$t _ j = \frac{1}{ \vert \mathbf{X} _ {\tilde{y} = j}\vert } \sum _ {x \in \mathbf{X} _ {\tilde{y} = j}} \hat{p} ( \tilde{y} = j; \mathbf{x}, \theta)$
- For samples labeled with $i$ , if its predicted probability for class $j$ larger then $t _ j$ , then this sample is likely mislabeled and we could assign it to the set. We could obtain this matrix in a cross-validation style. For example, if we have 3 folds, we use 2/3 of the data to train the model $\hat{p}$ and use the remaining 1/3 to compute this matrix.
  $\hat{ \mathbf{X} } _ {\tilde{y} = i, y ^ {*} = j} = { \mathbf{x} \in \mathbf{X} _ {\tilde{y} = i}: \hat{p} (\tilde{y} = j; \mathbf{x}, \theta) \geq t_j}$
- Example
  
  Suppose we know the $t _ j$ for “dog”, “fox”, and “cow” are 0.7, 0.7, and 0.9. We have following predictions and labels. We could obtain a matrix that looks like one below. The off-diagonal entries correspond to labeling errors.
  
  $y ^ {*} = \text{dog}$ $y ^ {*} = \text{fox}$ $y ^ {*} = \text{cow}$
  
  $\hat{y} = \text{dog}$ 1 1 0
  
  $\hat{y} = \text{fox}$ 1 3 0
  
  $\hat{y} = \text{cow}$ 0 0 1
  
  Note the following:
  - The last sample does not contain any animal and it is not counted. This shows that this scheme is robust to outliers.
  - It is possible $t _ j$ is very small but this will happen when there are many classes. In this case, the predicted probability for each class will also small.
Applications
- Confident Learning + Ranking by Loss
  
  If we see there are in total $k$ off-diagonal samples, then we could pick the top- $k$ samples based on loss values.
- Confident Learning + Ranking by Normalized Margin
  
  We could also rank by normalized margin for a specific class $i$ ; normalized margin is defined as following
  $p(\tilde{y} = i) – \max _ {j\neq i} p(\tilde{y} =j; \mathbf{x} \in \mathbf{X} _ i)$
- Self-Confidence
  
  When $p(\tilde{y}=i)$ is close to 1, then as far as the model could think, the sample is not likely to be a label error.

	$y ^ {*} = \text{dog}$	$y ^ {*} = \text{fox}$	$y ^ {*} = \text{cow}$
$\hat{y} = \text{dog}$	1	1	0
$\hat{y} = \text{fox}$	1	3	0
$\hat{y} = \text{cow}$	0	0	1

Theory of Confident Learning

The model-centric approaches (i.e., model reweighting methods) will still propagate the errors back to the weights. However, the data-centric approaches (i.e., pruning methods) does not have this problem.
We could prove that even if the model is miscalibrated (i.e., overly confident in some classes), the confident learning method is still robust.

Implications on Testing

When focusing on the subset of data whose labels could be corrected, more capable models (for example, ResNet-50 vs. ResNet-18) perform worse as they fit the random noise in the training set.

Lecture 8 – Encoding Human Priors

Human priors could be encoded (i.e., finding a function to represent) into the ML pipeline in two ways. During training time, it could be done using data augmentation. During test time, this is done through prompt engineering with an LLM.

Data Augmentation
- Images: Flip, Rotation, Mobius transformation, Mixup. Mixup could be thought of as the linear interpolation of two images.
- Texts: Back-translation.

cleanlab Library

Anatomy

Understanding Cross-Validation in cleanlab

The cross-validation in cleanlab means the probabilities have to be the test scores. Specifically, if we have 3 folds, then what we will keep are the test prediction probabilities of each 1/3 fold using the model trained on the remaining 2/3 folds.

This logic could be found in estimate_confident_joint_and_cv_pred_proba() in cleanlab/count.py; it is the most important functions for cleanlab. It is used in find_label_issues function in CleanLearning class; this class also inherits from sklearn.base.BaseEstimator. The code could be found here.
keras is Necessary to Port cleanlab and transformer
- cleanlab requires an API that similar to sklearn.
- As of 2023-11-08, neither transformer or sklearn team provides a solution to port each other (except a less relevant library called skops that is about sharing sklearn models to HuggingFace hub; also see news). We therefore need to rely the keras-based code from cleanlab official tutorial that fine-tunes a TF-based bert-base-uncased to find label errors in imdb dataset.
- The complete script is available here.

Example

In the official demo that tries to find label errors in the imdb dataset, the authors use a simple MLP as the base model. The following (confusing at the first look) code indeed tokenize the texts into fixed length vectors (i.e., sequence_length).

import re
import string

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers import TextVectorization

raw_train_ds = tfds.load(name="imdb_reviews", split="train", batch_size=-1, as_supervised=True)
raw_test_ds = tfds.load(name="imdb_reviews", split="test", batch_size=-1, as_supervised=True)

raw_train_texts, train_labels = tfds.as_numpy(raw_train_ds)
raw_test_texts, test_labels = tfds.as_numpy(raw_test_ds)

max_features = 10000
sequence_length = 250

def preprocess_text(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "
", " ")
    return tf.strings.regex_replace(stripped_html, f"[{re.escape(string.punctuation)}]", "")

vectorize_layer = TextVectorization(
    standardize=preprocess_text,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

vectorize_layer.reset_state()
vectorize_layer.adapt(raw_train_texts)

# (N, sequence_length)
train_texts = vectorize_layer(raw_train_texts).numpy()
test_texts = vectorize_layer(raw_test_texts).numpy()

Additional Notes

“You are what you eat” is particularly relevant to the process of training machine learning models.
The data collection, labeling, and cleaning process could be called “data engine” or “data flywheel” in tech firms (blog).
The benefits of data-centric AI is that it disentangle the effects of data and modeling. Previously, we blindly trust the labels and efforts (including using larger models, changing loss functions, doing HPO) to improve the performance may only end up fitting the noise. If we make the data clean, we could identify what are the truly useful techniques and what are not.
cleanlab could not only flag the label issues but also automatically fix the top label issues. (blog).

Here we use Cleanlab Studio’s Clean Top K feature, which allows us to automatically correct the top most severe issues detected in our dataset with an automatically suggested label (inferred to be more suitable for each example than its original label in the dataset).

Reference

Why it’s time for ‘data-centric artificial intelligence’ | MIT Sloan
Bad Data Costs the U.S. $3 Trillion Per Year (Harvard Business Review)
Bad Data: The $3 Trillion-Per-Year Problem That’s Actually Solvable | Entrepreneur
[1710.09412] mixup: Beyond Empirical Risk Minimization (Zhang et al., ICLR 2017)

Talk Notes | ACL 2023 Tutorial – Retrieval-based Language Models and Applications

Posted on November 2, 2023December 11, 2023 by David Yang

[Zoom Recording] – [Website and Slides] – [Proposal] – [Q&A] – [Backup Recording]

This tutorial is given by Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen

Contents

1 Overview
2 Architecture
3 Training
4 Applications
5 Additional Notes
6 Reference

Overview

RALM is an LM that uses external databases in the test time.

The motivation of RALM is the pessimism about the current editing-based approaches. If we are able to edit the LLMs as fast as CRUD operations on databases, the retrieval-based approaches and editing-based approaches are comparable.

Moreover, these parametric LMs are fundamentally incapable of adapting over time, often hallucinate, and may leak private data from the training corpus. […] Retrieval-based LMs can outperform LMs without retrieval by a large margin with much fewer parameters, can update their knowledge by replacing their retrieval corpora, and provide citations for users to easily verify and evaluate the predictions.

Besides ease of updating knowledge, the retrieval-based approaches also have following advantages:

Traceability, Verifiability, Interpretability, and Controllability: They mean the same thing for RALMs.
Privacy and Copyright: The LMs are only responsible for making inference. The more relevant documents are stored in the databases.

There are some use cases where the RALMs are most suitable:

Long-tail: For example, “Where is the Toronto zoo located?”
Knowledge Update: For example, “what is the population of Toronto metropolitan area in 2000?”
Verifiability, Factuality, and Hallucination
Parameter-Efficiency: RALMs could improve performance of smaller LMs and make them competitive with larger ones.
Parameter Update: We do not have to update the model itself when we could update the database.
OOD Generalization

Architecture

Three elements of RALMs:

What to Retrieve? What is the minimal unit for retrieval. The choices could be token, document, or text chunk.
How to Use the Retrieval Results? Input, output, or somewhere in between?
When to Retrieve? Only once, every token, or somewhere in between?

REALM is one of the first works in RALM. The goal is improving MLM LM pretraining. The followup papers include DPR, RAG, Atlas; they all focus on knowledge intensive tasks:
- DPR
- RAG
- Atlas
Retrieval-in-context LM
- REPLUG: The prompt is used to retrieved a set of documents; these documents are then prepended to the prompt and form an ensemble to predict the new tokens.
- Ram et al.: We do not have to use the entire prompt as query. It may be better to use more recent tokens (due to higher relevance to the tokens to generate) as long as they are not too short.
  
  Further, we may want to retrieve more often. For example, after we have already generated some tokens, it is time to retrieve again for the next batch of new tokens.
RETRO
- Used in the intermediate layer. Specifically, each prompt is chunked into multiple pieces and sent to retrieve some results; these results are sent into the LM through a specially design attention mechanism. The authors also consider some parallelization techniques for the sake of efficiency.
- Another orthogonal finding of this paper is that the scale of the datastores are important.
kNN-LM
- Using the test prefix to query for the prefixes that are already continued with new tokens. The token to generate will be an linear interpolation between the actual generated new tokens and the new tokens that belong to the best match prefix.
- The motivation of this design is that the text representations of the entire sentences could be quite different even though the same word appears in it. For example, the word “torch” and “cheap.”
  
  Comment: The motivation of this design is dubious. Furthermore, the interaction between the input and the retrieved results is limited.
Extensions to kNN-LM
- Adaptive Retrieval. Whether the retrieval is enabled depends on the confidence of the outputs. For example,
  - FLARE and He et al. More specifically, the $\lambda$ in kNN-LM could be a function of confidence.
  - Alon et al.
Entities as Experts
- Entities could be represented as dense vectors and incorporated into the intermediate layers of a model.
- Extension: Mention Memory

Paper	What	When	How	Note
REALM	Chunks	Once	Input	Used in real-world applications such as `you.com`, Bing Chat, and `perplexity.ai`.
Retrieve-In-Context LM	Chunks	Every $n$ Tokens	Input	Used in real-world applications such as `you.com`, Bing Chat, and `perplexity.ai`.
RETRO	Chunks	Every $n$ Tokens	Intermediate
kNN-LM	Tokens	Every Token	Output
FLARE	Chunks	Adaptive	Input
Adaptive kNN-LM (He et al., Alon et al.)	Tokens	Adaptive	Output
Entities of Experts; Mention Memory	Entities or Entity Mentions	Every Entity Mention	Intermediate
Wu et al., Bertsch et al., Rubin & Berant	Chunks from Input	Once or Every $n$ Tokens	Intermediate	All methods above retrieve from external text. We can also retrieve from the book-length input.

Training

We could update (1) the LM itself and (2) retrieval model. However, training either of them is difficult as (1) LM is typically large and therefore expensive to make parameter updates (2) index has to be updated every time we update the the encoder and this is proportional to the number of documents in the database.

There are 4 strategies for training the RALMs. Independent and sequential training render no or weak dependence between the LM and RM but the system performance is not as strong as the joint training (i.e., training the LM and RM jointly); the downside of the joint training is the requirement for a special training protocl.

Independent Training

Training LM and retrieval model independently. Each component could be improved separately; the improvement in each component will translate to the final improvement.

LM
Retrieval Model: It could be BM25 or DPR. BM25 does not need explicit training and the training of DPR is pretty straightforward. Note that the loss used to promote the correct pairs from the in-batch negatives is a type of contrastive learning.

Besides DPR, another model is the contriver model (Izacard et al.), which is able to train in an unsupervised fashion.

Here are some examples of using this scheme:

kNN-LM: The retrieval model is fixed and only the LM is trained.
Ram et al.

Sequential Training

Sequential training means training one component first and then training the second component; the training of the second component depends the first component. As there are two components, we could either start from training the LM or the retrieval model.

RM $\rightarrow$ LM: For example, RETRO.
LM $\rightarrow$ RM: For example, REPLUG. Besides the ensemble prediction scheme, the authors further propose a method to fine-tune the retrieval model (dubbed as “LSR” by the authors) based on the feedback of the LM.

Joint Training with Asynchronous Index Update

Asynchronous index update means that we allow the index to “stale:” we do not reindex every document every time we update the encoder; rather, we only reindex the documents every $T$ steps.

Joint Training with In-Batch Approximation

Applications

Questions

Where should we use RALMs.
Where should we plug in the RM + database.
Should we update the LM or RM.
What database should be used? Wikipedia, training data, or code documtation.
WebGPT and GopherCIte uses the Google search results as data store.

	Task	Method	Database
DocPrompt	Code Generation	Prompting (Input) Fine-Tuning LM	Code Documentation
KNN-Prompt	Classification	Prompting (Output)	Wikipedia + CC
REPLUG	Knowledge-Intensive	Prompting (Input)	Wikipedia + CC
ATLAS	Knowledge-Intensive	Fine-Tuning LM and RM	Wikipedia + CC
GopherCite	QA	Fine-Tuning + RL on LM	Google Search Results

Additional Notes

The backup video is downloaded based on the instruction documented here. Basically, we just need to replace the <cookie content> and <request url> with the content we obtain after F5 $\rightarrow$ F12 in the browser.

youtube-dl -o video.mp4 --referer "https://zoom.us/" --add-header "Cookie: COOKIE_CONTENT" 'REQUEST_URL'

Reference

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval | SAIL Blog
Atlas: Few-shot Learning with Retrieval Augmented Language Models (Izacard et al., 2022)
Teaching language models to support answers with verified quotes (Menick et al., 2022)
REPLUG: Retrieval-Augmented Black-Box Language Models (Shi et al., 2023)
kNN-Prompt: Nearest Neighbor Zero-Shot Inference (Shi et al., 2022)
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models (Bohnet et al., 2023)
DocPrompting: Generating Code by Retrieving the Docs (Zhou et al., 2022)
REALM: Retrieval-Augmented Language Model Pre-Training (Guu et al., 2020)
In-Context Retrieval-Augmented Language Models (Ram et al., 2023)
Improving language models by retrieving from trillions of tokens (Borgeaud et al., 2022)
Generalization through Memorization: Nearest Neighbor Language Models (Khandelwal et al., 2020)

Talk Notes | Training State-of-the-Art Text Embedding & Neural Search Models

Posted on October 13, 2023December 11, 2023 by David Yang

[YouTube] – [Personal Website]

The presenter of this tutorial is Nils Remiers; he is the author of sentence_transformers and he is a researcher at HuggingFace.

Dense representations are interesting as they allow for zero-shot classification in the embedding space. This not only works for text embeddings, but multi-lingual and multi-modal as well.

Using out-of-the-box embeddings (for example, averaging BERT embeddings or using GPT-3 embeddings) does not work (see [1], [2]).
Vector Space

The contrastive or triplet loss may only optimize the local structure. A good embedding model should both optimize global and local structures.
- Global Structure: Relation of two random sentences.
- Local Structure: Relation of two similar sentences.

Reference

OpenAI GPT-3 Text Embeddings – Really a new state-of-the-art in dense text embeddings? | by Nils Reimers | Medium: This benchmarking was done in late December 2021, when the embedding endpoint was released not long.
MTEB Leaderboard – a Hugging Face Space by mteb: As of 2023-10-12, the text-embedding-ada-002 ranks 14 in the benchmark. All of the first 13 models that rank higher are open-source models.

Talk Notes | Lessons Learned from Analyzing Systems for Hate Speech Detection and Bias Mitigation by Sarah Masud

Posted on September 22, 2023December 11, 2023 by David Yang

[YouTube] – [Personal Website]

The presenter has authored several interesting papers ([1] through [5]) on hate speech detection.

Contents

1 Notes
2 Reference

Notes

Status Quo of Hate Speech Detection

There are varying definitions of hate speech.
Labels related to hate speech include hate, offensive, toxic, profane, and toxic. There could be also more fine-grained categories, such as sexist, racist, and islamophobic.
Because of the reasons mentioned above, there is no leaderboard in hate speech detection.

Data Sources

We should pay attention to data bias; it is doubtful to collect hate speeches from people and sites that are more likely to generate hate speech. The authors propose to collect datasets from neutral sources; this design choice makes the data annotation difficult.

Annotations

Current approaches of hate speech annotation rely on people (crowdworkers or experts). The authors use the two-phase approach to ensure the label quality.

Building Better Hate Speech Detection Models

The complexity of models does not necessarily help. It is more important to capture the signals that predict the final labels, for example, the history and the social network information. This observation also applies to other tasks that involve modeling social behaviors.
However, we should carefully monitor the overfitting: spurious correlation between overfitted phrases and labels should not be the signals we allow the models to pick up. That is, the models should generalize without the presence of these words.
In the work [2], the authors propose a system that considers not just the text information, but also the timeline and social network information. They merge the three sources of signal using an attention mechanism. However, we could see two limitations:
- This design is specific to Twitter. Other platforms, such as Reddit, do not have this information with respect to users.
- The best performing system (M14) does not significantly outperform the baseline system, which is simply fine-tuning a mBERT (M8).

Lexical Bias

Replacing the bias sensitive words with more general words is likely shift the bias towards the WordNet ancestors. This hypothesis could be supported by a measurement called pinned bias, where $t$ is the single word in the sensitive word list.

$pB _ T = \sum _ {t \in T} \frac{\vert p(\text{“toxic”}\vert t) – \phi\vert}{ \vert T \vert},\quad \phi=\min(p(\text{“toxic”}\vert t), 0.5)$

Horizons

The presenter has three high-level observations:

Like energy: Bias seems to be transferring from one source to the other.
Like a system at rest: A model or dataset will remain biased unless external force (for example, mitigation and regularization) is enabled.
Like interactive systems: A system is evolving more chaotic over time. The toxicity needs to be monitored and mitigated in a continuous fashion.

Reference

[2010.04377] Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter (Masud et al., ICDE 2021): This paper presents a dataset called RETINA that focus on hate speech in the Indian context.
[2206.04007] Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization (Masud et al., KDD 2022)
[2201.00961] Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media (Chakraborty and Masud)
[2306.01105] Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment (Masud et al., KDD 2023)
[2202.00126] Handling Bias in Toxic Speech Detection: A Survey (Garg et al., CSUR).
Language (Technology) is Power: A Critical Survey of “Bias” in NLP (Blodgett et al., ACL 2020)
[2305.06626] When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks (Fleisig et al.)
Handling Disagreement in Hate Speech Modelling | SpringerLink (Novak et al., IPMU 2022)
[2001.05495] Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations (Badjatiya et al., WWW 2019).

Talk Notes | Data-Centric NLP @ USC CSCI-699 Fall 2022

Posted on September 14, 2023December 11, 2023 by David Yang

Outline

The following is the course schedule (indeed a reading list) compiled from the course website for quick reference.

Section	Date	Topic	Readings
I. Datasets in NLP	Aug 22	Introduction, Historical Perspective, and Overview	Fair ML Book Chapter 7. Datasets Sambasivan et al., 2021: “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI Paullada et al., 2021 Data and its (dis)contents Raji et al., 2022 Ethical Challenges of Data Collection & Use in Machine Learning Research
	Aug 24	Data Collection and Data Ethics	Deng et al., 2009 ImageNet: A large-scale hierarchical image database Kwiatkowski et al., 2019 Natural Questions: A Benchmark for Question Answering Research Sakaguchi et al., 2019 WinoGrande: An Adversarial Winograd Schema Challenge at Scale Bowman et al. 2015 A large annotated corpus for learning natural language inference Nie et al., 2020 Adversarial NLI: A New Benchmark for Natural Language Understanding
	Aug 31	More on Data Ethics	Bender et al., 2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Koch et al., 2021 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research Klein and D’Ignazio, 2020 Data Feminism Book: Intro and Chapter 1 Strubell et al., 2019 Energy and Policy Considerations for Deep Learning in NLP
II. Bias and Mitigation	Sep 7	Biases: An Overview	Geirhos et al., 2020 Shortcut Learning in Deep Neural Networks Hort et al., 2022 Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey Feder et al., 2021 Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
	Sep 12	Spurious Biases I	Torralba & Efros, 2011 Unbiased Look at Dataset Bias Geva et al., 2019 Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets McCoy et al., 2019 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in NLI
	Sep 14	Spurious Biases II	Gardner et al., 2021 Competency Problems: On Finding and Removing Artifacts in Language Data Eisenstein, 2022 Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language
	Sep 19	Data-Centric Bias Mitigation	Srivastava et al., 2020 Robustness to spurious correlations via human annotations Dixon et al., 2018 Measuring and mitigating unintended bias in text classification Gardner et al., 2019 On Making Reading Comprehension More Comprehensive
	Sep 21	Data Augmentation for Bias Mitigation	Ng et al., 2020 SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving O.O.D. Robustness Kaushik et al., 2019 Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
III. Estimating Data Quality	Sep 26	Estimates of Data Quality	Le Bras et al., 2020 Adversarial Filters of Dataset Biases Swayamdipta et al., 2020 Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Liu et al., 2022 WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation Ethayarajh et al., 2022 Understanding Dataset Difficulty with V-Usable Information
	Sep 28	Aggregate vs. Point-wise Estimates of Data Quality	Ghorbani & Zou, 2019 Data Shapley: Equitable Valuation of Data for Machine Learning; Perez et al., 2021 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length; Mindermann et al., 2022 Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
	Oct 3	Anomalies, Outliers, and Out-of-Distribution Examples	Hendrycks et al., 2018 Deep Anomaly Detection with Outlier Exposure Ren et al., 2019 Likelihood Ratios for Out-of-Distribution Detection
	Oct 5	Disagreements, Subjectivity and Ambiguity I	Pavlick et al., 2019 Inherent Disagreements in Human Textual Inferences; Röttger et al., 2022 Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks; Denton et al., 2021 Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation
	Oct 12	Disagreements, Subjectivity and Ambiguity II	Miceli et al., 2020 Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision; Davani et al., 2021 Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations
IV. Data for Accountability	Oct 17	Creating Evaluation Sets	Recht et al., 2019 Do ImageNet Classifiers Generalize to ImageNet?; Card et al., 2020 With Little Power Comes Great Responsibility; Clark et al. 2021 All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text Ethayarajh & Jurafsky, 2020 Utility is in the eye of the user: a critique of NLP leaderboards
	Oct 19	Counterfactual Evaluation	Gardner et al., 2020 Evaluating Models’ Local Decision Boundaries via Contrast Sets; Ross et al., 2021 Tailor: Generating and Perturbing Text with Semantic Controls
	Oct 24	Adversarial Evaluation	Jia and Liang, 2017 Adversarial Examples for Evaluating Reading Comprehension Systems; Kiela et al., 2021 Dynabench: Rethinking Benchmarking in NLP; Li and Michael, 2022 Overconfidence in the Face of Ambiguity with Adversarial Data
	Oct 26	Contextualizing Decisions	Gebru et al., 2018 Datasheets for Datasets; Bender and Friedman, 2018 Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
V. Beyond Labeled Datasets	Oct 31	Unlabeled Data	Dodge et al., 2021 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Lee et al., 2022 Deduplicating Training Data Makes Language Models Better Gururangan et al., 2022 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
	Nov 2	Prompts as Data?	Wei et al., 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models
	Nov 7	Data Privacy and Security	Amodei et al., 2016 Concrete Problems in AI Safety Carlini et al., 2020 Extracting Training Data from Large Language Models Henderson et al., 2022 Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
	Nov 9	Towards Better Data Citizenship	Jo & Gebru, 2019 Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning Hutchinson et al., 2021 Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

Talk Notes | Paraphrasing Evades Detectors of AI-generated Text, But Retrieval is an Effective Defense by Kaplesh Krishna @ Google

Posted on September 12, 2023December 11, 2023 by David Yang

[YouTube] – [Personal Website]

The presenter is the author of multiple influential papers on the topics such as paraphrasing and attacks.

Reference

Reformulating Unsupervised Style Transfer as Paraphrase Generation (Krishna et al., EMNLP 2020)
[1910.12366] Thieves on Sesame Street! Model Extraction of BERT-based APIs (Krishna et al., ICLR ’20’)

Talk Notes | Building End-to-End Content Moderation Pipelines in the Real World

Posted on August 30, 2023December 11, 2023 by David Yang

[Website] – [Paper] – [Blog]

Note:
– The presenter of this talk is the lead author of the paper A Holistic Approach to Undesired Content Detection in the Real World.

Change Logs:

2023-08-29: First draft.

Overview

There are two main iterations to build an end-to-end content moderator.
– Annotation Iteration: OpenAI outsource the most of the annotation iteration to external data providers. They also have internal expert annotators to provide the labels of the quality control set.
– Main Iteration: This is the bulk of the OpenAI’s contribution.

Annotation Iteration

Labeling guidelines need to be clarified and updated multiple times with more and more edges surface. The specifications from OpenAI are finally turned into training materials of their label providers to educate their annotators.
There should be sessions that
- Calibrating the annotators by clarifying the annotation guidelines.
- Auditing data that are flagged harmful either by the annotators or the model. Removing annotations from the annotator that has low per-category F1 scores. This process could be accelerated using cross-auditing with multiple annotators.

Main Iteration

There following are the diagrams that outline the steps above:

Step 0: Creating an initial dataset. This initial dataset includes those from “bad” (and unlabeled) subset of CommonCrawl, expert selected academic dataset, and zero-shot synthetic data from GPT-3 based on hand-crafted templates.
Step $k-1$ : $\cdots$
Step $k$ : In the iteration $k$ , training a model $\mathcal{M}_k$ based on GPT-series model using the standard cross-entropy loss.

One of the things the OpenAI could not solve well is the calibration.

Step k+1: Using \mathcal{M}_k to run inference on the unlabeled production data; the probabilities are used to select the subset for annotation. Three methods are compared:
- Purely Random Sampling
- Random Sampling for Samples Above a Threshold
- Uncertainty Sampling

Active learning substantially improves the ratio of harmful contents in the user traffic (10 – 22 times).

After the subset is annotated, it is added back to the training set. Further, there is also synthetic data that is added to address the counterfactual bias.

Step k+2: Running the following steps to further improve the training data.
- Overfitted Phrase Detection.
- Mislabeling Detection.
Step $k+3$ : Internal red teaming.
Step $k+4$ : $\cdots$
Step -3:
Evaluating on the static test set.
A/B testing.
Step -1: Product release.

Here is a more detailed diagram; it is same as the one provided in the paper.

Future Direction

Dataset
- A more systematic approach to create synthetic dataset. The current approach OpenAI uses is described ad-hoc.
- Robustness to prompt injection and ciphers.
Continuous GPT-Assisted Red Teaming
Active Leraning
- The current active learning approach relies on the model $\mathcal{M}_k$ at Step $k+1$ , which the model $\mathcal{M}_k$ may not be able to generalize.
- The presenter also mentions anomaly detection; it is not prioritized in OpenAI due to time constraint.

Reference

A Sequential Algorithm for Training Text Classifiers (SIGIR ’94)