Reading Notes | Revisiting Hate Speech Benchmarks – From Data Curation to System Deployment

Posted on September 22, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-21: First draft. This paper appears at KDD 2023. The co-lead author – Sarah Musud – has published numerous papers on hate speech detection.

Additional Notes

Measuring Dataset Difficulty

The authors compare different datasets’ difficulty using the JS divergence between Laplician smoothed unigram distributions of texts under different label pairs; the lower the divergence, the closer the unigram distributions and this makes texts under a label pair more difficult to distinguish.

For example, the proposed datasets have 4 labels, this will lead to $\binom{4}{2} = 6$ divergence measures.
Matthews Correlation Coefficient (MCC)

Reference

Coding Notes | HuggingFace Reference

Posted on September 22, 2023December 11, 2023 by David Yang

Basics

Hyperparameters

The hyperparameters are specified through TrainingArguments and Seq2SeqTrainingArguments.
model_name_or_path and output_dir are the only two required arguments. However, we should also set other critical hyperparameters, including num_train_epochs, per_device_train_batch_size, per_device_eval_batch_size, learning_rate.

Evaluation, Logging, and Saving

It is better to set logging_steps to 1 and logging_strategy to step as logging is beneficial whatsoever yet does not cause significant overhead.
It is better to specify eval_steps as 1 / n and eval_strategy to "steps", where n is number evaluations. This will help collect enough samples even if we have fewer training steps or training epochs.
load_best_model_at_end=True has to pair with the following configurations (answer). It will save the best checkpoints according to the evaluations done throughout the training process:
- After setting eval_steps to a decimal number, save_strategy has to be set to "steps" since save_steps has to be multiple of eval_steps. As saving larger models will take long time, we need to set save_steps to a reasonable number. For example, if we would like to evaluate the model for 10 times (i.e., eval_steps is set to 0.1), we should save twice (i.e., save_steps is set to 0.5).
- save_total_limit governs the saving of the latest models; it is likely to save k+1 checkpoints even if save_total_limit=k as the best model is not the latest k models saved.
- compute_metric has special syntax to follow. For example, the following is taken from the official run_glue.py. Here p.predictions depends on the specific model.

# You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
# predictions and label_ids field) and has to return a dictionary string to float.
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    result = metric.compute(predictions=preds, references=p.label_ids)
    if len(result) > 1:
        result["combined_score"] = np.mean(list(result.values())).item()
    return result

Index	Hyperparameter	Value
1	`save_strategy`, `eval_strategy`	`steps` or `epoch`; they have to be same.
2	`eval_steps`	A reasonable value such as `0.1`.
3	`save_steps`	Must be the multiples of the `eval_steps`.
4	`metric_for_best_model` and `compute_metric`	`metric_for_best_model` defaults to `loss` (or `eval_loss` with an automatically prepended `eval_`). It could be set to other custom metrics defined in `compute_metric`

It is recommended to use wandb. In order to do so, we need to set report_to and run_name. Note that if we need to use custom name on wandb portal, we should not rename the default output directory.

Testing Training Scripts

Index	Hyperparameter	Value
1	`max_train_samples`, `max_eval_samples`, `max_test_samples`	100
2	`save_strategy`	`no`
3	`load_best_model_at_end`	`False`

Checkpoints

If a model has been fine-tuned, then most likely there will be only updates in pytorch_model.bin file. We could reuse the original config.json and the tokenizer.

A runnable model only consists of a pytorch_model.bin and a config.json file. The config.json documents the metadata of the model.
A tokenizer consists of a list of files:

bash tokenizer/ ├── added_tokens.json ├── merges.txt ├── special_tokens_map.json ├── tokenizer_config.json ├── tokenizer.json └── vocab.json

However, if we save checkpoints during training, then the code of saving checkpoints has already been taken care of.

Inference

langchain, Pipeline, and Model Classes

The classes and methods provided inmodel.generate() ,pipeline (or TextGenerationPipeline) and langchain are increasingly more high-level: the TextGenerationPipeline internally calls model.generate() and langchain.llms.huggingface_pipeline.HuggingFacePipeline internally uses TextGenerationPipeline.

Therefore, it is sufficient to understand how model.generate() works and how the more abstract classes wrap the other classes. See the following example. Note that

A better way to specify arguments is not through a dictionary but through a predefined class such as transformers.GenerationConfig and then model.config = config. This could make the most use of the code reference feature available in PyCharm.
We should stick to transformers.pipeline rather than TextGenerationPipeline as the former has the unified API across different tasks.

Here is the decision flow of which API to use:

Index	API	Case
1	`model.generate()`	When we need special control over the outputs. For example, adding human bias to the distribution similar to `logit_bias` for OpenAI APIs (example) or `transformers.NoBadWordsLogitProcessor`.
2	`transformers.pipeline`	Preferred as the first choice.
3	`langchain`	When working with `langchain`

import os

os.environ["CUDA_VISIBLE_DEVICES"] = str(0)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import torch
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForCausalLM,
)
from langchain.llms.huggingface_pipeline import (
    HuggingFacePipeline,
)

##################################################

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

prompt = "Great changes have taken place in the past 30 years"

##################################################
# method 1
# BAD: requires manually moving model and data to the device

config = {
    "do_sample": True,
    "top_p": 1,
    "num_return_sequences": 5,
    "temperature": 1,
    "max_new_tokens": 16,
}

device = torch.device("cuda:0")
model = model.to(device)
tokenizer.pad_token = tokenizer.eos_token

raw_response1 = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(device),
    **config,
).squeeze()

texts1 = tokenizer.batch_decode(raw_response1)

##################################################
# method2
# GOOD

config = {
    "do_sample": True,
    "top_p": 1,
    "num_return_sequences": 5,
    "temperature": 1,
    "max_new_tokens": 16,
    "device": "cuda:0"
}

pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **config
)

response2 = pipeline(prompt)
texts2 = [response["generated_text"] for response in response2]

##################################################
# method 3
# GOOD: However, it could NOT generate multiple sequences at the same time

llm = HuggingFacePipeline(pipeline=pipeline)
texts3 = list()
for _ in range(5):
    texts3.append(llm(prompt))

Controlled Generation

Enforcing or Forbidding Specific Tokens

This is done using disjunctive constraints (enforcing) or NoBadWordsLogitsProcessor (forbidding) internally in model.generate(). This could be easily implemented using the snippet below.

Note that when enforcing generation, setting num_beams to an integer greater than 1 is critical as enforcing presence of some tokens is implemented using beam search.

from transformers import AutoTokenizer, AutoModelForCausalLM

def get_tokens_as_list(model_name, word_list):
    tokenizer_with_prefix_space = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
    tokens_list = []
    for word in word_list:
        tokenized_word = tokenizer_with_prefix_space([word], add_special_tokens=False).input_ids[0]
        tokens_list.append(tokenized_word)
    return tokens_list


model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token_ids = tokenizer.eos_token_id

prompt = "Great changes have taken place in the past 30 years"

inputs = tokenizer(prompt, return_tensors="pt")
output_ids = model.generate(inputs["input_ids"], max_new_tokens=5)
print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])

words_ids = get_tokens_as_list(model_name="gpt2", word_list=["Donald", "Trump"])

output_ids = model.generate(inputs["input_ids"], max_new_tokens=5, bad_words_ids=words_ids)
print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])

output_ids = model.generate(inputs["input_ids"], max_new_tokens=5, force_words_ids=words_ids, num_beams=10)
print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])

Inference on Multiple Devices

It does not seem to easily make inferences on multiple devices. However, we could use optimized attention implemented torch>=2.0.0 and optimum to reduce the time and space requirement.

Instruction Tuning

Using the Basic transformers Library

It is possible to instruction-tune a language model using the official example script run_clm.py working with gpt2 or the Phil Schimid’s blog working with google/flan-t5-xl.

Using the trl Library

SFTTrainer() provided in the trl provides an another layer of abstraction; this makes instruction-tuning even easier and cleaner. However, the downsides are (1) it could not work well with deepspeed, and (2) it does not support everything (for example, setting save_steps to a decimal number) defined in transformers.TraingingArguments; this limits its flexibility.

Tuning a Model with the Language Modeling Objective

This could be done in fewer than 14 lines of code. For example, tuning an LM on the imdb dataset. We could add more configurations to the code skeleton below (for example, PEFT, 4-bit / 8-bit) following the example script here.

from datasets import load_dataset
from transformers import AutoModelForCausalLM
from trl import SFTTrainer

dataset = load_dataset("imdb", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)
trainer.train()

Tuning a Model using the Completions – Self-Instruction

Tuning Large Models with Constrained Hardware

Overview

We have the following decision matrix when we are working on the single node; there may be other considerations when working with multiple nodes; a node means a machine’s GPUs are physically connected.

	Single GPU	Multiple GPUs
Mode Fits into Single GPU	–	DDP ZeRO (may or may not be faster)
Model Does not Fit into Single GPU	ZeRO + Offload CPU + MCT (optional) + NVMe (optional)	PP (preferred) if NVLink or NVSwitch is not available ZeRO TP
Largest Layer Does not Fit into Single GPU	ZeRO + Offload CPU + MCT + NVMe (optional)	TP ZeRO + Offload CPU + MCT + NVMe (optional)

One single 7B LLaMA model is already almost 30 GB on HuggingFace; the 13B version will be even larger.
When using custom training loops. The accelerate library improves pytorch.distributed and makes it possible that the same code could be run on any hardware settings without making updates to the code.

When using Trainer(), all of the distributed training settings could be done without using accelerate.
ZeRO is implemented using deepspeed.

Using a Single GPU

A typical model with AdamW optimizer requires 18 bytes per parameter.
Besides the methods described below, one could try accelerate library to use same torch code for any hardware configuration (CPU, single GPU, and multiple GPUs).
Besides the methods described below, one could try accelerate library to use same torch code for any hardware configuration (CPU, single GPU, and multiple GPUs).

Method	Speed📈	Memory📉	Note
Batch Size	Yes	Yes	It should be default to 8. But choosing a batch size that makes most of GPUs is complicated.
Dataloader	Yes	No	Always set `pin_memory=True` and`num_workers=4` (or 8, 16, …) when possible.
Optimizer	Yes	Yes	Using Adafactor saves 50% compared to Adam or AdamW. But it does not converge fast. This is supported out-of-box. One could alternatively use 8-bit AdamW to save more than 50% memory when bibsandbytes is installed and used.
Gradient Checkpointing	No	Yes	Supported by `Trainer(..., gradient_checkpointing=True, ...)`.
Gradient Accumulation	No	Yes	Supported by `Trainer(..., gradient_accumulation_steps=4,...)`.
Mixed Precision Training	Yes	No	`fp16` is supported in `TrainingArguments(.., fp16=True, ...)`. With Ampere GPUs such as A100 or RTX-3090, `bf16=True` or `tf32=True` (with `torch.beakends.cuda.mamul.allow_tf32=True`) could be set.
DeepSpeed ZeRO	No	Yes	The model with the smallest batch size does not fit into the GPU. Using Trainer() is supported out-of-box.

We could use the code below to measure the GPU utilization:

from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

For example, when tuning a bert-large-uncased model with some dummy data, we could see on top of the following vanilla code:

Vanilla Code to Tune a Classification Model

import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(0)

import numpy as np

from datasets import Dataset
from transformers import (
    Trainer,
    logging,
    TrainingArguments,
    AutoModelForSequenceClassification,
)

from utils.common import print_gpu_utilization

##################################################
logging.set_verbosity_error()

dataset_size, seq_len = 512, 512
train_dataset = Dataset.from_dict(
    {
        "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
        "labels": np.random.randint(0, 1, dataset_size),
    }
)
train_dataset.set_format("pt")
print_gpu_utilization()

##################################################

default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none"
}

training_args = TrainingArguments(
per_device_train_batch_size=4,
    **default_args
)

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
print_gpu_utilization()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)
result = trainer.train()

After making updates to the vanilla code, we could see the changes in the memory usage.

Basic Setup	Memory (MB)
Loading Dummy Data	2631
Loading Model with per_device_train_batch_size=4	14949
Loading Model with per_device_train_batch_size=4 + 8-bit Adam	13085
Loading Model with per_device_train_batch_size=4 + optim=”adafactor”	12295
Loading Model with per_device_train_batch_size=4 + fp16=True	13939
Loading Model with per_device_train_batch_size=4 + fp16=True + gradient_checking=True	7275
Loading Model with per_device_train_batch_size=1 + gradient_accumulation_steps=4	8681
Loading Model with per_device_train_batch_size=1 + gradient_accumulation_steps=4+ gradient_checkpointing=True	6775
Loading Model with per_device_train_batch_size=1 + gradient_accumulation_steps=4+ gradient_checkpointing=True + fp16=True and using accelerate.	5363

Using Multiple GPUs

There are data, tensor, and pipeline parallelism when working with multiple GPUs. Each of them has pros and cons, there does not exist an universally good solution that fits into any situation.

Data Parallelism (DP): The same setup is replicated on all devices but we split the data and send them to different devices. One may see the acronym DDP, which refers to Distributed DP.
Tensor Parallelism (TP): Splitting a tensor into multiple shards and process each shard on different devices; it is also called horizontal parallelism.

ZeRO (Zero Redundancy Optimizer, ZeRO) is a preferred type of TP that does not require make changes to the model.
Pipeline Parallelism (PP): Splitting a few layers of the model into a single GPU; it is also called vertical parallelism.

According to Jason Phang, the ZeRO is the more efficient method than PEFT and PP:

There ought to be more efficient methods of tuning (DeepSpeed / ZeRO, NeoX) than the ones presented here, but folks may find this useful already.

Reference

Index	Name	Note
1	https://huggingface.co/docs/transformers/perf_train_gpu_one	Official Tutorial
2	https://huggingface.co/docs/transformers/perf_train_gpu_many	Official Tutorial
3	https://github.com/zphang/minimal-llama	Jason Phang
4	https://huggingface.co/docs/transformers/main_classes/deepspeed	HuggingFace Documentation
5	https://huggingface.co/blog/4bit-transformers-bitsandbytes	Fine-Tuning LLMs like llama, gpt-neox, and t5.
6	https://huggingface.co/blog/pytorch-ddp-accelerate-transformers	Official Tutorial

Using simpletransformers

Comparing transformers and simpletransformers

simpletransformers is a wrapper of transformers that abstracts out some unnecessary details for training and inference a wide array of models, including text classification (multi-class and multi-label) and regression.

The number of lines of code is significantly reduced if we switch from transformers to simpletransformers. For example, the code below tries to make a inference using bert-base-uncased on a safety dataset:

import os

os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import pandas as pd

from sklearn.metrics import (
    classification_report,
)
from datasets import (
    Dataset,
    load_dataset
)
from transformers import (
    Trainer,
    AutoTokenizer,
    TrainingArguments,
    default_data_collator,
    AutoModelForSequenceClassification,
)


model_name = "distilbert-base-uncased-finetuned-sst-2-english"
dataset = load_dataset("glue", "sst2", split="validation")

model = AutoModelForSequenceClassification.from_pretrained(model_name)  
tokenizer = AutoTokenizer.from_pretrained(model_name)  

tokenized_dataset = dataset.map(  
lambda x: tokenizer(x["sentence"], padding="max_length", truncation=True, max_length=256),  
)

training_args = TrainingArguments(
    output_dir="outputs",
    per_device_eval_batch_size=256,
    remove_unused_columns=True,

)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=default_data_collator,
)

y_pred = np.argmax(trainer.predict(tokenized_dataset).predictions, axis=1)
y_true = dataset["is_safe"]

print(classification_report(
    y_true=y_true,
    y_pred=y_pred
))

By comparison, we could obtain the exactly same results with simpletransformers:

import os

os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from sklearn.metrics import (
    classification_report,
)
from datasets import (
    load_dataset
)

from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs,
)


model_name = "distilbert-base-uncased-finetuned-sst-2-english"
dataset = load_dataset("glue", "sst2", split="validation")

model_args = ClassificationArgs()

model_args.eval_batch_size = 256
model_args.max_seq_length = 256
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False

model = ClassificationModel(
    model_type="distilbert",
    model_name=model_name,
    num_labels=2,
    args=model_args,
)
y_pred, _ = model.predict(dataset["sentence"])
y_true = dataset["label"]

print(classification_report(
    y_true=y_true,
    y_pred=y_pred
))

Minimal Working Example

simpletransformers could more quickly and cleanly train and evaluate PLMs compared to transformers, where the simpletransformers library is based upon; it also comes with full support from wandb. Note that

The number of steps is computed based on one GPU even though model_args.n_gpu is set to a different value. Therefore, we should not further divide n_total_steps by model_args.n_gpu.
By default, there will be evaluations at the end of each epoch. Therefore, setting n_eval=10 will lead to model_args.num_epochs + n_eval evaluations; in the example below, there will be 13 evaluations.

The following example fine-tunes bert-base-uncased on the imdb dataset:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

import pandas as pd

from datasets import load_dataset
from sklearn.model_selection import train_test_split
from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs
)

model_args = ClassificationArgs()

model_class = "roberta"
model_name = "roberta-base"

##################################################
# see full list of configurations:
# https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# critical settings
model_args.learning_rate = 1e-5
model_args.num_train_epochs = 3
model_args.train_batch_size = 32
model_args.eval_batch_size = 32
model_args.gradient_accumulation_steps = 1
model_args.fp16 = False
model_args.max_seq_length = 128
model_args.n_gpu = 4
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False

# saving settings
model_args.no_save = False
model_args.overwrite_output_dir = True
model_args.output_dir = "outputs/"

# the following mandates that only the best checkpoint will be saved; there will be only 1 checkpoint
model_args.best_model_dir = "{}/best_model".format(model_args.output_dir)
model_args.save_model_every_epoch = False
model_args.save_best_model = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1

# validation criterion
model_args.use_early_stopping = False
model_args.early_stopping_metric = "auroc"
model_args.early_stopping_metric_minimize = False

# evaluation settings
model_args.evaluate_during_training = True

# logging settings
model_args.silent = False
model_args.wandb_project = "simpletransformers"
model_args.wandb_kwargs = {
    "name": "sanity-check-imdb"
}

##################################################
# loading dataset

ds = load_dataset("imdb")

# data splits
# there has to be "text" and "labels" columns in the input dataframe
df = pd.DataFrame(ds["train"]).rename(columns={"label": "labels"})

train_df, eval_df = train_test_split(df.sample(frac=0.1), test_size=0.2)
test_df = pd.DataFrame(ds["test"]).sample(frac=0.1).rename(columns={"label": "labels"})

##################################################
# adaptive steps settings
# we will evaluate 10 times and log 100 times no matter how small the dataset

n_eval, n_log = 10, 100
n_total_steps = round(len(train_df) / model_args.train_batch_size) * model_args.num_train_epochs

model_args.evaluate_during_training_steps = max(1, round(n_total_steps / n_eval))
model_args.logging_steps = max(1, round(n_total_steps / n_log))
model_args.save_steps = -1

##################################################
# training
model = ClassificationModel(
    model_class,
    model_name,
    num_labels=2,
    args=model_args,
)

model.train_model(
    train_df=train_df,
    eval_df=eval_df
)

# test
result, model_outputs, wrong_predictions = model.eval_model(test_df)

Validation and Early Stopping

Validation

Choosing which model checkpoint to save (aka. validation) depends on early_stopping_metric and early_stopping_metric_minimize even though early stopping itself is disabled.
Early Stopping

If we need to use early stopping, we need to also be aware of the other hyperparameters.

Name	Default	Note
`use_early_stopping`	`False`
`early_stopping_metric`	`"eval_loss"`	`eval_during_training` has to be `True`; it will use metrics computed during evaluation.
`early_stopping_metric_minimize`	`True`
`early_stopping_consider_epochs`	`False`
`early_stopping_patience`	`3`	Terminate training after `early_stopping_patience` evaluations without improvement specified by`early_stopping_delta`.
`early_stopping_delta`	`0`

class ClassificationModel:

    def train_model(
        self,
        train_df,
        multi_label=False,
        output_dir=None,
        show_running_loss=True,
        args=None,
        eval_df=None,
        verbose=True,
        **kwargs,
    ):
        // ...
        global_step, training_details = self.train(
            train_dataloader,
            output_dir,
            multi_label=multi_label,
            show_running_loss=show_running_loss,
            eval_df=eval_df,
            verbose=verbose,
            **kwargs,
        )
        // ...


    def train(
        self,
        train_dataloader,
        output_dir,
        multi_label=False,
        show_running_loss=True,
        eval_df=None,
        test_df=None,
        verbose=True,
        **kwargs,
    ):
        //...
        best_eval_metric = None

        // ...
        if not best_eval_metric:
            best_eval_metric = results[args.early_stopping_metric]
            self.save_model(
                args.best_model_dir,
                optimizer,
                scheduler,
                model=model,
                results=results,
            )
        // ...

Using Sentence-Transformers

Overview

sentence_transformer is built with torch despite a resemblance to the keras API.
The famous METB benchmark is also largely built on top of the sentence_transformer library.

Fine-Tuning Embeddings

Besides an easy interface to generate embeddings, the sentence_transformers library also supports fine-tuning the provided embedding models. The following data formats all have their corresponding loss functions without a need to convert data to a specific format (for example, triplets) (see blog).

Note that these loss functions come from the sentence_transformers library rather than torch or transformers. These loss functions have been discussed in a blog post that is not affiliated with the developers of sentence_transformers.

Index	Description	Data	Loss	Note
1	A pair of sentences and a label	`(premise, hypothesis, label)`	`ContrastiveLoss`; `SoftmaxLoss`; `CosineSimilarityLoss`
2	Individual sentence and corresponding label	`(text, label)`	`BatchHardTripletLoss` and variants	“batch hard” performs best in the blog post.
3	A pair of similar sentences	`(query, response)`, `(src_lang, tgt_lang)`, `(full_text, summary)`, `(text1, text2)` (e.g., QQP), `(text, entailed_text)` (e.g., NLI)	`MultipleNegativeRankingLoss`; `MegaBatchMarginLoss`	Frequent
4	A triplet of sentences of an positive, a positive, and a negative	`(anchor, positive, negative)`	`TripletLoss`	Rare as it requires offline mining

Here is a minimal working example of fine-tuning representation using sst2 dataset; we could optionally evaluate the fine-tuned model on the MTEB benchmark as it is also built with sentence_transformer library.

Note that:

The sentence_transformers does not have a native support for wandb as simpletransformers. We could only monitor one score through the log_with_wandb() with an exactly same signature. The score it monitors depends on which specific evaluator is used (see the complete list of evaluators here).

When working with TripletEvaluator as in the example below. The returned metric a ratio of the number of triplets among all triplets that satisfy $d(a, p) < d(a, n)$.
We could easily replace the model with models available on the HuggingFace hub.

import os
import wandb
import random
import logging

import pandas as pd

from datetime import datetime
from collections import defaultdict
from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    SentencesDataset
)
from sentence_transformers.evaluation import (
    TripletEvaluator,
)

from sentence_transformers import LoggingHandler
from sentence_transformers.losses import (
    BatchHardTripletLoss,
)

from datasets import load_dataset
from torch.utils.data import DataLoader


logging.basicConfig(
    format="%(asctime)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=logging.INFO,
    handlers=[LoggingHandler()],
)

def triplets_from_labeled_dataset(
    records,
    text_column="sentence",
    label_column="label"
):
    # Create triplets for a [(label, sentence), (label, sentence)...] dataset
    # by using each example as an anchor and selecting randomly a
    # positive instance with the same label and a negative instance with a different label

    input_examples = [
        InputExample(guid=str(guid), texts=[record[text_column]], label=record[label_column])
        for guid, record in enumerate(records)
    ]

    triplets = []
    label2sentence = defaultdict(list)
    for inp_example in input_examples:
        label2sentence[inp_example.label].append(inp_example)

    for inp_example in input_examples:
        anchor = inp_example

        if len(label2sentence[inp_example.label]) < 2: #We need at least 2 examples per label to create a triplet
            continue

        positive = None
        while positive is None or positive.guid == anchor.guid:
            positive = random.choice(label2sentence[inp_example.label])

        negative = None
        while negative is None or negative.label == anchor.label:
            negative = random.choice(input_examples)

        triplets.append(InputExample(texts=[anchor.texts[0], positive.texts[0], negative.texts[0]]))

    return triplets

##################################################

model_name = 't5-base'
num_epochs = 10

##################################################

current_time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
wandb.init(
    project="sentence_transformers",
    name=f"{model_name}-{current_time}"
)

##################################################
# model

output_path = (
    "output/"
    + model_name
    + "-"
    + current_time
)

model = SentenceTransformer(model_name)

##################################################

def get_dataloader(df, split, text_column, label_column, batch_size=8):
    records = df.to_dict("records")
    examples = [
        InputExample(texts=[record[text_column]], label=record[label_column])
        for record in records
    ]
    dataset = SentencesDataset(
        examples=examples,
        model=model,
    )
    dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)

    return dataloader

##################################################
# data

ds = load_dataset("sst2")

train_df = pd.DataFrame(ds["train"])
val_df = pd.DataFrame(ds["validation"])
test_df = pd.DataFrame(ds["test"])

train_dataloader = get_dataloader(train_df, "train", text_column="sentence", label_column="label")

##################################################

train_loss = BatchHardTripletLoss(model=model)
val_evaluator = TripletEvaluator.from_input_examples(
    triplets_from_labeled_dataset(val_df[["sentence", "label"]].to_dict("records")),
    name="eval"
)
val_evaluator(model)

##################################################

def log_with_wandb(score, epoch, steps):
    # https://docs.wandb.ai/ref/python/log
    wandb.log(
        data={"score": score},
        step=steps,
    )

warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data

model.fit(
    [(train_dataloader, train_loss)],
    show_progress_bar=True,
    epochs=num_epochs,
    evaluator=val_evaluator,
    evaluation_steps=50,
    warmup_steps=warmup_steps,
    output_path=output_path,
    callback=log_with_wandb
)
##################################################

test_evaluator = TripletEvaluator.from_input_examples(
    triplets_from_labeled_dataset(test_df[["sentence", "label"]].to_dict("records")),
    name="test"
)
model.evaluate(test_evaluator)

As our goal is not evaluating triplets but the quality of clustering, we could define our own evaluator.

from sklearn.cluster import KMeans
from sklearn.metrics import v_measure_score

from sentence_transformers.evaluation import (
    SentenceEvaluator,
)

class ClusteringEvaluator(SentenceEvaluator):
    def __init__(self, texts, labels, batch_size=32, show_progress_bar=False):
        self.texts = texts
        self.labels = labels
        self.batch_size = batch_size
        self.show_progress_bar = show_progress_bar

    def __call__(self, model, output_path: str = None, epoch: int = -1, steps: int = -1):
        embeddings = model.encode(
            self.texts, batch_size=self.batch_size, show_progress_bar=self.show_progress_bar, convert_to_numpy=True
        )
        y_pred = KMeans(n_clusters=len(set(self.labels)), n_init="auto").fit_predict(embeddings)
        score = v_measure_score(
            labels_true=self.labels,
            labels_pred=y_pred
        )

        return score

Customization

Saving Checkpoints

Similar to simpletransformers, sentence_transformer could save best checkpoints according to the evaluation metric. Model saving are controlled by _eval_during_training() and _save_checkpoint() functions.

If save_best_model=True, the best model will be saved at the root directory of the output_path. Saving best checkpoint is enabled by default.
If we want to save additional checkpoints, these additional checkpoints will be saved at checkpoint_path; the total number of saved checkpoints is governed by checkpoint_save_steps and checkpoint_save_total_limit. Different checkpoints will be stored in the folder named <step>.

Saving additional checkpoints is disabled by default.

Loss Functions

According to the doc, we should choose which loss to use based on the available format of data we have. There are 14 loss functions supported by sentence_transformer.

Index	Loss Function	Data Format	Publication	Note
1	`BatchAllTripletLoss`	`(text, label)`	1	Using all positive and negative within the $PK$ batch; leading to $PK\cdot (PK-K) \cdot (K-1)$ pairs.
2	`BatchSemiHardTripletLoss`	`(text, label)`	1
3	`BatchHardTripletLoss`	`(text, label)`	1	Finding the hardest positive and negative within the $PK$ batch, leading to $PK$ pairs.
4	`BatchHardSoftMarginTripletLoss`	`(text, label)`	1	Replacing the hinge function with a softplus function.
5	`ConstrativeLoss`	`(text1, text2, label)`	4
6	`OnlineContrastiveLoss`	`(text1, text2, label)`	4
7	`SoftmaxLoss`	`(text1, text2, label)`	2
8	`CosineSimilarityLoss`	`(text1, text2, similarity)`
9	`DenoisingAutoEncoderLoss`	`(corrupted text, original text)`	5
10	`MultipleNegativeRankingLoss`	`(anchor, positive)`	8
11	`MegaBatchMarginLoss`	`(anchor, positive)`	7	Requires a large batch size (like 500).
12	`TripletLoss`	`(anchor, positive, negative)`		Requires Offline Hard Mining (OHM) described in 1.
13	`MSELoss`	`(src embedding, tgt embedding)`	3	Aligning embeddings of multiple languages.
14	`MarginMSELoss`	`(a, p, n, d(a, p), d(a, n))`	6	Very stringent requirement for input data.

[1703.07737] In Defense of the Triplet Loss for Person Re-Identification: This paper overturns the prevailing belief that the a more intuitive triplet loss is worse than the surrogate classification loss by proposing new loss functions; it also critically points out the limitations of the TripletLoss:

A major caveat of the triplet loss, though, is that as the dataset gets larger, the possible number of triplets grows cubically, rendering a long enough training impractical. To make matters worse, $f _ \theta$ relatively quickly learns to correctly map most trivial triplets, rendering a large fraction of all triplets uninformative.

The goal of metric learning is to preserve the "semantic distance" in the metric space: two semantically similar sentences should be close in the metric space and two dissimilar ones should be remote to each other in the embedding space.

Overall, the "batch-hard" version, possibly with a soft margin, performs best among all loss functions.
[1908.10084] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
[2004.09813] Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Dimensionality Reduction by Learning an Invariant Mapping (CVPR 2006, Yann LeCun)
[2104.06979] TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning
[2010.02666] Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations (Wieting & Gimpel, ACL 2018)
[1705.00652] Efficient Natural Language Response Suggestion for Smart Reply: Section 4.4 defines the multiple negative loss.

RLHF with trl Library

The trl library provides a one-stop solution to instruction tuning (i.e., SFT), reward modeling, and PPO. The library supports peft and 4-bit (or 8-bit) tuning natively so that we could tune an LM on the customer device.

The trl defines a custom class AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqMWithValueHead so that the PPO could be done; it returns an unbounded score (through nn.Linear(hidden_size, 1)) for each returned token.

ICL with Long Prompt

There are several solutions to long-prompt generation, including Alibi and Yarn.

Yarn

As of 2023-11-22, we have an open-source model of 128K context window. This does not mean that we could do in-context learning with arbitrary number of shots. There is a major gap between the paper and the real-world application: both model and data will consume graphic memory and the memory required for data scales with context length (see an explanation here); however, the paper performs evaluation using a surrogate metric; the authors show that they could successfully work with a context window of 128K by computing the perplexity with a sliding window.
Alibi
- mosaicml/mpt-7b-8k-instruct, mosaicml/mpt-7b-8k-chat, and mosaicml/mpt-7b-8k.
- mosaicml/mpt-7b-storywriter: This model could extrapolate beyond 65K tokens.
LLongMA

Reading Notes | Understanding Dataset Difficulty with V-Usable Information

Posted on September 19, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-19: First draft. This paper appears as one of the outstanding papers at ICML 2022.

Overview

The main contribution of the paper is a metric to evaluate the difficulty of the aggregate and sample-wise difficulty of a dataset for a model family $\mathcal{V}$ : a lower score indicates a more difficult dataset. This metric is appealing because it is able to do five things while previous approaches could only do 1 to 3 of them. Specifically,

Comparing Datasets: DIME (accepted as a workshop paper at NeurIPS 2020), IRT [4].
Comparing Models: Dynascore [3]
Comparing Instances: Data Shapley [5]
Comparing Dataset Slices
Comparing Attributes: The paper [6] estimates the attribute importance using MDL.

Method

Despite a lot of theoretical construct in Section 2, the way to compute the proposed metric is indeed fairly straightforward.

Suppose we have a dataset $\mathcal{D} _ \text{train}$ and $\mathcal{D} _ \text{test}$ of a task, such as NLI, the proposed metric requires fine-tuning on $\mathcal{D} _ \text{train}$ two models from the same base model $\mathcal{V}$ and collecting measurements on $\mathcal{D} _ \text{test}$ (Algorithm 1):

Step 1: Fine-tuning a model $g’$ on $\mathcal{D} _ \text{train} = { (x_1, y_1), \cdots, (x_m, y_m) }$ and another model $g$ on ${ (\phi, y_1), \cdots, (\phi, y_m) }$ , where $\phi$ is an empty string; both $g’$ and $g$ are the model initialized from the same base model, such as bert-base-uncased.
Step 2: For each test sample, the sample-wise difficulty (aka. PVI) is defined as $\mathrm{PVI}(x_i \rightarrow y_i) := -\log_2 g(y_i\vert \phi) + \log_2 g'(y_i\vert x_i)$ ; the aggregate difficulty is its average $\hat{I} _ \mathcal{V}(X \rightarrow Y) = \frac{1}{n}\sum _ i \mathrm{PVI}(x_i \rightarrow y_i)$ .

If the input and output are independent, the metric is provably and 0; it will be empirically close to 0.

Note that:

The method requires a reasonably large dataset $\mathcal{D} _ \text{train}$ . However, the exact size is not known in advance unless we train many models and wait to see when the curve plateaus, which is not feasible in practice. The authors use 80% of the SNLI dataset for estimation (Appendix A).
The specific choice of models, hyperparameters, and random initializations does not influence the results a lot (Section 3.2).

Applications

There are several applications when we use the proposed metric to rank the samples in a dataset:

Identifying the annotation errors (Section 3).
Using the metric to select challenging samples for data selection, including training data selection, data augmentation, and TCP (Section 4).
Guiding the creation of new specifications as it is possible to compute the token-wise metric (Section 4.3).

Additional Notes

It is quite surprising that the CoLA dataset is more difficult than SNLI and MNLI according to the authors’ measure.

Code

Reference

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics (Swayamdipta et al., EMNLP 2020): The method in the main paper and this paper both requires training a model.
[2002.10689] A Theory of Usable Information Under Computational Constraints (Xu et al., ICLR 2020).
[2106.06052] Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? (Rodriguez et al., ACL-IJCNLP 2021)
[1904.02868] Data Shapley: Equitable Valuation of Data for Machine Learning (ICML 2019): Data shapley could give a pointwise estimate of a sample’s contribution to the decision boundary.
[2103.03872] Rissanen Data Analysis: Examining Dataset Characteristics via Description Length (ICML 2021).

Reading Notes | GIO – Gradient Information Optimization for Training Dataset Selection

Posted on September 19, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-19: First draft.

Reference

Research Notes | Testing NLP Systems

Posted on September 19, 2023December 11, 2023 by David Yang

Reference

Machine Learning in Production / AI Engineering: A CMU course on software engineering for AI by Christian Kästner and Eunsuk Kang.
Designing Software Systems to be Robust by Eunsuk Kang – YouTube

Research Notes | Manuscript Preparation in LaTeX

Posted on September 15, 2023December 11, 2023 by David Yang

Overview

The computer science conferences have high tolerance for style variability, which leads to stark variances in the typesetting quality even for the final camera-ready version. Here is one such example: the left from [1] is much better than [2], which is a random sample from the same conference in the same year. As the latter is the impression of almost all of the papers from that conference, the paper [1] will easily stand out.

Template

Some of the templates look more professional than others. Whenever possible, we should such templates.

Fonts

Using lmodern package through \usepackage{lmodern} in preamble; this single command will significantly improve the first impression of the manuscript.

Graphics

Reference

packages – Suggest a “nice” font family for my basic LaTeX template (text and math) – TeX – LaTeX Stack Exchange

Reading Notes | Beyond Class-Conditional Assumption – A Primary Attempt to Combat Instance-Dependent Label Noise

Posted on September 15, 2023December 11, 2023 by David Yang

Reference

Reading Notes | NoisywikiHow – A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing

Posted on September 14, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-14: First draft. The paper appears at ACL 2023. The code base has very detailed instructions on how to reproduce their results.

Method

The authors find that the labeling errors are both annotator-dependent and instance-dependent.

Experiments

The best performing LNL method on the benchmark is SEAL [1]: one could also consider MixUp regularization [2]. All other LNL methods have almost indistinguishable difference as the base models, i.e., not doing any intervention on the training process.

Additional Note

Comments

The reason why creating a new dataset is necessary is that the users could customize the noise level to compare performances of different algorithms in a controlled setting.

Reference

[2012.05458] Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise (Chen et al. AAAI 2021).
[1710.09412] mixup: Beyond Empirical Risk Minimization (Zhang et al., ICLR 2018, 7.6K citations).
Nonlinear Mixup: Out-of-Manifold Data Augmentation for Text Classification (Guo, AAAI 2020). One application of MixUp regularization in NLP. It is based on a CNN classifier and the improvement is quite marginal.
[2006.06049] On Mixup Regularization (Carratino et al., JMLR): A theoretical analysis of MixUp regularization.
Learning with Noisy Labels (Natarajan et al., NIPS 2013): This paper is the first paper that (theoretically) studies LNL. It considers the binary classification problem where labels are randomly flipped, which is theoretically appealing but less relevant empirically according to the main paper.

Research Notes | Training Data Optimization

Posted on September 14, 2023December 11, 2023 by David Yang

Problem Statement

Suppose we have a collection of datasets from $K$ sources $\mathcal{D} _ 1, \cdots, \mathcal{D} _ K$ . These $K$ datasets have been unified regarding input and output spaces.

Now we split each $\mathcal{D} _ i$ into train, validation, and test splits $\mathcal{D} _ i ^ \text{train},\ \mathcal{D} _ i ^ \text{val}$ and $\mathcal{D} _ i ^ \text{test}$ and form the aggregated train, validation, and test sets as $\mathcal{D}^\text{train} := \cup _ {i=1}^ K D _ i^\text{train}$ , $\mathcal{D}^\text{val} := \cup _ {i=1}^ K D _ i^\text{val}$ , and $\mathcal{D}^\text{test} := \cup _ {i=1}^ K D _ i^\text{test}$ .

The learning problem could vary depending the quality of the datasets after (1) dataset collection and annotation by authors of different datasets, and (2) dataset unification when merging $K$ datasets into one. This is because:

If labels are reliable, then this is dataset selection problem. The argument is to save computation resources when training on $\mathcal{D} \subseteq \mathcal{D} ^ \text{train}$ while maintaining the performance as a model trained in (1) each $\mathcal{D}_i,\ i \in [K]$ , (2) $\mathcal{D} ^ \text{train}$ , and (3) $\mathrm{Sample}(\mathcal{D} ^ \text{train})$ that matches the size of $\mathcal{D}$ .

In some special cases, another motivation for dataset selection is that we know the size of a sampled dataset (for example, the dataset statistics described in a paper) but we are not sure what are exactly these samples.
If labels are not reliable, then the argument is to prevent the low-quality labels from offsetting the benefits of a larger training dataset (rather than distilling a smaller dataset to save compute). We have three options:

Index	Method	Type
1	Reannotating the entire dataset. This could be reduced as a dataset distillation problem as now we have more confidence on the filtered datasets.	Offline
2	Identifying and removing unreliable labels and optionally using these samples as an unsupervised dataset. This is also reducible to a dataset selection problem as 1.	Offline
3	Learning with the noisy labels (LNL as described in 1) they are; this requires the learning algorithm to explicitly consider the variablity in the label quality.	Online

Note that there is a easily topic called “dataset distillation” that one may easily confused with. The goal of dataset distillation is to create synthetic dataset in the feature space based on the original one to match the performance on the test set. Previous show that it is possible to attain the original performance on MNIST ([3]) and IMDB ([4]) with a synthetic dataset of size (surprisingly) 10 and 20.

Adaptive Data Selection

With the test sets finalized, we could now work on sampling training sets, i.e., choosing one specific $\mathrm{Sample}(\cdot)$ function described above. The goal here is to sample the training set so that the scores on the test sets are maximized:

DSIR: Suppose we need to sample $B$ batches of samples totaling $K$ , then we could start by randomly sampling the 1st batch and then calling the DSIR algorithm in the future batches until we have collected $K$ samples. This should be done for each label.

Reference

NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing (Wu et al., Findings 2023)
[2202.01327] Adaptive Sampling Strategies to Construct Equitable Training Datasets (Cai et al., FAccT 2023)
[2301.04272] Data Distillation: A Survey (Sachdeva and McAuley, JMLR).
[1811.10959] Dataset Distillation (Wang et al.)
[1910.02551] Soft-Label Dataset Distillation and Text Dataset Distillation (Sucholutsky and Schonlau, IJCNN 2020). This is the only paper referenced in 3 describing the dataset distillation for texts. This paper is based on the very original data distillation objective proposed in 4.
[2302.03169] Data Selection for Language Models via Importance Resampling (Xie et al.)
[2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al.)
[2306.11670] GIO: Gradient Information Optimization for Training Dataset Selection (Everaert and Potts): This paper has similar settings as the DSIR paper [6]: we are selecting new samples by minimizing their KL divergence with an existing set of unlabeled samples. The paper claims an advantage over the DSIR as the proposed algorithm requires fewer samples:

Like GIO, these heuristic methods aim to select a subset of data that is higher quality and more relevant. However, they are either highly tailored to their particular tasks or they require very large numbers of examples (to develop classifiers or construct target probabilities). By contrast, GIO is task- and domain-agnostic, it can be applied plug-and-play to a new task and dataset, and it requires comparatively few gold examples X to serve as the target distribution.

Talk Notes | Data-Centric NLP @ USC CSCI-699 Fall 2022

Posted on September 14, 2023December 11, 2023 by David Yang

Outline

The following is the course schedule (indeed a reading list) compiled from the course website for quick reference.

Section	Date	Topic	Readings
I. Datasets in NLP	Aug 22	Introduction, Historical Perspective, and Overview	Fair ML Book Chapter 7. Datasets Sambasivan et al., 2021: “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI Paullada et al., 2021 Data and its (dis)contents Raji et al., 2022 Ethical Challenges of Data Collection & Use in Machine Learning Research
	Aug 24	Data Collection and Data Ethics	Deng et al., 2009 ImageNet: A large-scale hierarchical image database Kwiatkowski et al., 2019 Natural Questions: A Benchmark for Question Answering Research Sakaguchi et al., 2019 WinoGrande: An Adversarial Winograd Schema Challenge at Scale Bowman et al. 2015 A large annotated corpus for learning natural language inference Nie et al., 2020 Adversarial NLI: A New Benchmark for Natural Language Understanding
	Aug 31	More on Data Ethics	Bender et al., 2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Koch et al., 2021 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research Klein and D’Ignazio, 2020 Data Feminism Book: Intro and Chapter 1 Strubell et al., 2019 Energy and Policy Considerations for Deep Learning in NLP
II. Bias and Mitigation	Sep 7	Biases: An Overview	Geirhos et al., 2020 Shortcut Learning in Deep Neural Networks Hort et al., 2022 Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey Feder et al., 2021 Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
	Sep 12	Spurious Biases I	Torralba & Efros, 2011 Unbiased Look at Dataset Bias Geva et al., 2019 Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets McCoy et al., 2019 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in NLI
	Sep 14	Spurious Biases II	Gardner et al., 2021 Competency Problems: On Finding and Removing Artifacts in Language Data Eisenstein, 2022 Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language
	Sep 19	Data-Centric Bias Mitigation	Srivastava et al., 2020 Robustness to spurious correlations via human annotations Dixon et al., 2018 Measuring and mitigating unintended bias in text classification Gardner et al., 2019 On Making Reading Comprehension More Comprehensive
	Sep 21	Data Augmentation for Bias Mitigation	Ng et al., 2020 SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving O.O.D. Robustness Kaushik et al., 2019 Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
III. Estimating Data Quality	Sep 26	Estimates of Data Quality	Le Bras et al., 2020 Adversarial Filters of Dataset Biases Swayamdipta et al., 2020 Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Liu et al., 2022 WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation Ethayarajh et al., 2022 Understanding Dataset Difficulty with V-Usable Information
	Sep 28	Aggregate vs. Point-wise Estimates of Data Quality	Ghorbani & Zou, 2019 Data Shapley: Equitable Valuation of Data for Machine Learning; Perez et al., 2021 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length; Mindermann et al., 2022 Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
	Oct 3	Anomalies, Outliers, and Out-of-Distribution Examples	Hendrycks et al., 2018 Deep Anomaly Detection with Outlier Exposure Ren et al., 2019 Likelihood Ratios for Out-of-Distribution Detection
	Oct 5	Disagreements, Subjectivity and Ambiguity I	Pavlick et al., 2019 Inherent Disagreements in Human Textual Inferences; Röttger et al., 2022 Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks; Denton et al., 2021 Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation
	Oct 12	Disagreements, Subjectivity and Ambiguity II	Miceli et al., 2020 Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision; Davani et al., 2021 Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations
IV. Data for Accountability	Oct 17	Creating Evaluation Sets	Recht et al., 2019 Do ImageNet Classifiers Generalize to ImageNet?; Card et al., 2020 With Little Power Comes Great Responsibility; Clark et al. 2021 All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text Ethayarajh & Jurafsky, 2020 Utility is in the eye of the user: a critique of NLP leaderboards
	Oct 19	Counterfactual Evaluation	Gardner et al., 2020 Evaluating Models’ Local Decision Boundaries via Contrast Sets; Ross et al., 2021 Tailor: Generating and Perturbing Text with Semantic Controls
	Oct 24	Adversarial Evaluation	Jia and Liang, 2017 Adversarial Examples for Evaluating Reading Comprehension Systems; Kiela et al., 2021 Dynabench: Rethinking Benchmarking in NLP; Li and Michael, 2022 Overconfidence in the Face of Ambiguity with Adversarial Data
	Oct 26	Contextualizing Decisions	Gebru et al., 2018 Datasheets for Datasets; Bender and Friedman, 2018 Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
V. Beyond Labeled Datasets	Oct 31	Unlabeled Data	Dodge et al., 2021 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Lee et al., 2022 Deduplicating Training Data Makes Language Models Better Gururangan et al., 2022 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
	Nov 2	Prompts as Data?	Wei et al., 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models
	Nov 7	Data Privacy and Security	Amodei et al., 2016 Concrete Problems in AI Safety Carlini et al., 2020 Extracting Training Data from Large Language Models Henderson et al., 2022 Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
	Nov 9	Towards Better Data Citizenship	Jo & Gebru, 2019 Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning Hutchinson et al., 2021 Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure