Reading Notes | Revisiting Hate Speech Benchmarks – From Data Curation to System Deployment

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-21: First draft. This paper appears at KDD 2023. The co-lead author – Sarah Musud – has published numerous papers on hate speech detection.

Additional Notes

  • Measuring Dataset Difficulty

    The authors compare different datasets’ difficulty using the JS divergence between Laplician smoothed unigram distributions of texts under different label pairs; the lower the divergence, the closer the unigram distributions and this makes texts under a label pair more difficult to distinguish.

    For example, the proposed datasets have 4 labels, this will lead to \binom{4}{2} = 6 divergence measures.

  • Matthews Correlation Coefficient (MCC)

Reference

Coding Notes | HuggingFace Reference

Basics

Hyperparameters

  • The hyperparameters are specified through TrainingArguments and Seq2SeqTrainingArguments.
  • model_name_or_path and output_dir are the only two required arguments. However, we should also set other critical hyperparameters, including num_train_epochs, per_device_train_batch_size, per_device_eval_batch_size, learning_rate.

Evaluation, Logging, and Saving

  • It is better to set logging_steps to 1 and logging_strategy to step as logging is beneficial whatsoever yet does not cause significant overhead.
  • It is better to specify eval_steps as 1 / n and eval_strategy to "steps", where n is number evaluations. This will help collect enough samples even if we have fewer training steps or training epochs.
  • load_best_model_at_end=True has to pair with the following configurations (answer). It will save the best checkpoints according to the evaluations done throughout the training process:
    • After setting eval_steps to a decimal number, save_strategy has to be set to "steps" since save_steps has to be multiple of eval_steps. As saving larger models will take long time, we need to set save_steps to a reasonable number. For example, if we would like to evaluate the model for 10 times (i.e., eval_steps is set to 0.1), we should save twice (i.e., save_steps is set to 0.5).
    • save_total_limit governs the saving of the latest models; it is likely to save k+1 checkpoints even if save_total_limit=k as the best model is not the latest k models saved.
    • compute_metric has special syntax to follow. For example, the following is taken from the official run_glue.py. Here p.predictions depends on the specific model.
# You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
# predictions and label_ids field) and has to return a dictionary string to float.
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
    result = metric.compute(predictions=preds, references=p.label_ids)
    if len(result) > 1:
        result["combined_score"] = np.mean(list(result.values())).item()
    return result
Index Hyperparameter Value
1 save_strategy, eval_strategy steps or epoch; they have to be same.
2 eval_steps A reasonable value such as 0.1.
3 save_steps Must be the multiples of the eval_steps.
4 metric_for_best_model and compute_metric metric_for_best_model defaults to loss (or eval_loss with an automatically prepended eval_). It could be set to other custom metrics defined in compute_metric
  • It is recommended to use wandb. In order to do so, we need to set report_to and run_name. Note that if we need to use custom name on wandb portal, we should not rename the default output directory.

Testing Training Scripts

Index Hyperparameter Value Notes
1 max_train_samples, max_eval_samples, max_test_samples 100
2 save_strategy no
3 load_best_model_at_end False

Checkpoints

If a model has been fine-tuned, then most likely there will be only updates in pytorch_model.bin file. We could reuse the original config.json and the tokenizer.

  • A runnable model only consists of a pytorch_model.bin and a config.json file. The config.json documents the metadata of the model.
  • A tokenizer consists of a list of files:

    bash
    tokenizer/
    ├── added_tokens.json
    ├── merges.txt
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

However, if we save checkpoints during training, then the code of saving checkpoints has already been taken care of.

Inference

langchain, Pipeline, and Model Classes

The classes and methods provided inmodel.generate() ,pipeline (or TextGenerationPipeline) and langchain are increasingly more high-level: the TextGenerationPipeline internally calls model.generate() and langchain.llms.huggingface_pipeline.HuggingFacePipeline internally uses TextGenerationPipeline.

Therefore, it is sufficient to understand how model.generate() works and how the more abstract classes wrap the other classes. See the following example. Note that

  • A better way to specify arguments is not through a dictionary but through a predefined class such as transformers.GenerationConfig and then model.config = config. This could make the most use of the code reference feature available in PyCharm.
  • We should stick to transformers.pipeline rather than TextGenerationPipeline as the former has the unified API across different tasks.
  • Here is the decision flow of which API to use:

    Index API Case
    1 model.generate() When we need special control over the outputs. For example, adding human bias to the distribution similar to logit_bias for OpenAI APIs (example) or transformers.NoBadWordsLogitProcessor.
    2 transformers.pipeline Preferred as the first choice.
    3 langchain When working with langchain
import os

os.environ["CUDA_VISIBLE_DEVICES"] = str(0)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import torch
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForCausalLM,
)
from langchain.llms.huggingface_pipeline import (
    HuggingFacePipeline,
)

##################################################

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

prompt = "Great changes have taken place in the past 30 years"

##################################################
# method 1
# BAD: requires manually moving model and data to the device

config = {
    "do_sample": True,
    "top_p": 1,
    "num_return_sequences": 5,
    "temperature": 1,
    "max_new_tokens": 16,
}

device = torch.device("cuda:0")
model = model.to(device)
tokenizer.pad_token = tokenizer.eos_token

raw_response1 = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(device),
    **config,
).squeeze()

texts1 = tokenizer.batch_decode(raw_response1)

##################################################
# method2
# GOOD

config = {
    "do_sample": True,
    "top_p": 1,
    "num_return_sequences": 5,
    "temperature": 1,
    "max_new_tokens": 16,
    "device": "cuda:0"
}

pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **config
)

response2 = pipeline(prompt)
texts2 = [response["generated_text"] for response in response2]

##################################################
# method 3
# GOOD: However, it could NOT generate multiple sequences at the same time

llm = HuggingFacePipeline(pipeline=pipeline)
texts3 = list()
for _ in range(5):
    texts3.append(llm(prompt))

Controlled Generation

Enforcing or Forbidding Specific Tokens

This is done using disjunctive constraints (enforcing) or NoBadWordsLogitsProcessor (forbidding) internally in model.generate(). This could be easily implemented using the snippet below.

Note that when enforcing generation, setting num_beams to an integer greater than 1 is critical as enforcing presence of some tokens is implemented using beam search.

from transformers import AutoTokenizer, AutoModelForCausalLM

def get_tokens_as_list(model_name, word_list):
    tokenizer_with_prefix_space = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
    tokens_list = []
    for word in word_list:
        tokenized_word = tokenizer_with_prefix_space([word], add_special_tokens=False).input_ids[0]
        tokens_list.append(tokenized_word)
    return tokens_list


model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token_ids = tokenizer.eos_token_id

prompt = "Great changes have taken place in the past 30 years"

inputs = tokenizer(prompt, return_tensors="pt")
output_ids = model.generate(inputs["input_ids"], max_new_tokens=5)
print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])

words_ids = get_tokens_as_list(model_name="gpt2", word_list=["Donald", "Trump"])

output_ids = model.generate(inputs["input_ids"], max_new_tokens=5, bad_words_ids=words_ids)
print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])

output_ids = model.generate(inputs["input_ids"], max_new_tokens=5, force_words_ids=words_ids, num_beams=10)
print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])

Inference on Multiple Devices

It does not seem to easily make inferences on multiple devices. However, we could use optimized attention implemented torch>=2.0.0 and optimum to reduce the time and space requirement.

Instruction Tuning

Using the Basic transformers Library

It is possible to instruction-tune a language model using the official example script run_clm.py working with gpt2 or the Phil Schimid’s blog working with google/flan-t5-xl.

Using the trl Library

SFTTrainer() provided in the trl provides an another layer of abstraction; this makes instruction-tuning even easier and cleaner. However, the downsides are (1) it could not work well with deepspeed, and (2) it does not support everything (for example, setting save_steps to a decimal number) defined in transformers.TraingingArguments; this limits its flexibility.

  • Tuning a Model with the Language Modeling Objective

    This could be done in fewer than 14 lines of code. For example, tuning an LM on the imdb dataset. We could add more configurations to the code skeleton below (for example, PEFT, 4-bit / 8-bit) following the example script here.

from datasets import load_dataset
from transformers import AutoModelForCausalLM
from trl import SFTTrainer

dataset = load_dataset("imdb", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)
trainer.train()
  • Tuning a Model using the Completions – Self-Instruction

Tuning Large Models with Constrained Hardware

Overview

We have the following decision matrix when we are working on the single node; there may be other considerations when working with multiple nodes; a node means a machine’s GPUs are physically connected.

Single GPU Multiple GPUs
Mode Fits into Single GPU DDP
ZeRO (may or may not be faster)
Model Does not Fit into Single GPU ZeRO + Offload CPU + MCT (optional) + NVMe (optional) PP (preferred) if NVLink or NVSwitch is not available
ZeRO
TP
Largest Layer Does not Fit into Single GPU ZeRO + Offload CPU + MCT + NVMe (optional) TP
ZeRO + Offload CPU + MCT + NVMe (optional)
  • One single 7B LLaMA model is already almost 30 GB on HuggingFace; the 13B version will be even larger.
  • When using custom training loops. The accelerate library improves pytorch.distributed and makes it possible that the same code could be run on any hardware settings without making updates to the code.

    When using Trainer(), all of the distributed training settings could be done without using accelerate.

  • ZeRO is implemented using deepspeed.

Using a Single GPU

  • A typical model with AdamW optimizer requires 18 bytes per parameter.
  • Besides the methods described below, one could try accelerate library to use same torch code for any hardware configuration (CPU, single GPU, and multiple GPUs).
  • Besides the methods described below, one could try accelerate library to use same torch code for any hardware configuration (CPU, single GPU, and multiple GPUs).
Method Speed📈 Memory📉 Note
Batch Size Yes Yes It should be default to 8. But choosing a batch size that makes most of GPUs is complicated.
Dataloader Yes No Always set pin_memory=True andnum_workers=4 (or 8, 16, …) when possible.
Optimizer Yes Yes Using Adafactor saves 50% compared to Adam or AdamW. But it does not converge fast. This is supported out-of-box.
One could alternatively use 8-bit AdamW to save more than 50% memory when bibsandbytes is installed and used.
Gradient Checkpointing No Yes Supported by Trainer(..., gradient_checkpointing=True, ...).
Gradient Accumulation No Yes Supported by Trainer(..., gradient_accumulation_steps=4,...).
Mixed Precision Training Yes No fp16 is supported in TrainingArguments(.., fp16=True, ...).
With Ampere GPUs such as A100 or RTX-3090, bf16=True or tf32=True (with torch.beakends.cuda.mamul.allow_tf32=True) could be set.
DeepSpeed ZeRO No Yes The model with the smallest batch size does not fit into the GPU. Using Trainer() is supported out-of-box.

We could use the code below to measure the GPU utilization:

from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

For example, when tuning a bert-large-uncased model with some dummy data, we could see on top of the following vanilla code:

  • Vanilla Code to Tune a Classification Model
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(0)

import numpy as np

from datasets import Dataset
from transformers import (
    Trainer,
    logging,
    TrainingArguments,
    AutoModelForSequenceClassification,
)

from utils.common import print_gpu_utilization

##################################################
logging.set_verbosity_error()

dataset_size, seq_len = 512, 512
train_dataset = Dataset.from_dict(
    {
        "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
        "labels": np.random.randint(0, 1, dataset_size),
    }
)
train_dataset.set_format("pt")
print_gpu_utilization()

##################################################

default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none"
}

training_args = TrainingArguments(
per_device_train_batch_size=4,
    **default_args
)

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
print_gpu_utilization()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)
result = trainer.train()

After making updates to the vanilla code, we could see the changes in the memory usage.

Basic Setup Memory (MB)
Loading Dummy Data 2631
Loading Model with per_device_train_batch_size=4 14949
Loading Model with per_device_train_batch_size=4 + 8-bit Adam 13085
Loading Model with per_device_train_batch_size=4 + optim=”adafactor” 12295
Loading Model with per_device_train_batch_size=4 + fp16=True 13939
Loading Model with per_device_train_batch_size=4 + fp16=True + gradient_checking=True 7275
Loading Model with per_device_train_batch_size=1 + gradient_accumulation_steps=4 8681
Loading Model with per_device_train_batch_size=1 + gradient_accumulation_steps=4+ gradient_checkpointing=True 6775
Loading Model with per_device_train_batch_size=1 + gradient_accumulation_steps=4+ gradient_checkpointing=True + fp16=True and using accelerate. 5363

Using Multiple GPUs

There are data, tensor, and pipeline parallelism when working with multiple GPUs. Each of them has pros and cons, there does not exist an universally good solution that fits into any situation.

  • Data Parallelism (DP): The same setup is replicated on all devices but we split the data and send them to different devices. One may see the acronym DDP, which refers to Distributed DP.
  • Tensor Parallelism (TP): Splitting a tensor into multiple shards and process each shard on different devices; it is also called horizontal parallelism.

    ZeRO (Zero Redundancy Optimizer, ZeRO) is a preferred type of TP that does not require make changes to the model.

  • Pipeline Parallelism (PP): Splitting a few layers of the model into a single GPU; it is also called vertical parallelism.

According to Jason Phang, the ZeRO is the more efficient method than PEFT and PP:

There ought to be more efficient methods of tuning (DeepSpeed / ZeRO, NeoX) than the ones presented here, but folks may find this useful already.

Reference

Index Name Note
1 https://huggingface.co/docs/transformers/perf_train_gpu_one Official Tutorial
2 https://huggingface.co/docs/transformers/perf_train_gpu_many Official Tutorial
3 https://github.com/zphang/minimal-llama Jason Phang
4 https://huggingface.co/docs/transformers/main_classes/deepspeed HuggingFace Documentation
5 https://huggingface.co/blog/4bit-transformers-bitsandbytes Fine-Tuning LLMs like llama, gpt-neox, and t5.
6 https://huggingface.co/blog/pytorch-ddp-accelerate-transformers Official Tutorial

Using simpletransformers

Comparing transformers and simpletransformers

simpletransformers is a wrapper of transformers that abstracts out some unnecessary details for training and inference a wide array of models, including text classification (multi-class and multi-label) and regression.

The number of lines of code is significantly reduced if we switch from transformers to simpletransformers. For example, the code below tries to make a inference using bert-base-uncased on a safety dataset:

import os

os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import pandas as pd

from sklearn.metrics import (
    classification_report,
)
from datasets import (
    Dataset,
    load_dataset
)
from transformers import (
    Trainer,
    AutoTokenizer,
    TrainingArguments,
    default_data_collator,
    AutoModelForSequenceClassification,
)


model_name = "distilbert-base-uncased-finetuned-sst-2-english"
dataset = load_dataset("glue", "sst2", split="validation")

model = AutoModelForSequenceClassification.from_pretrained(model_name)  
tokenizer = AutoTokenizer.from_pretrained(model_name)  

tokenized_dataset = dataset.map(  
lambda x: tokenizer(x["sentence"], padding="max_length", truncation=True, max_length=256),  
)

training_args = TrainingArguments(
    output_dir="outputs",
    per_device_eval_batch_size=256,
    remove_unused_columns=True,

)
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=default_data_collator,
)

y_pred = np.argmax(trainer.predict(tokenized_dataset).predictions, axis=1)
y_true = dataset["is_safe"]

print(classification_report(
    y_true=y_true,
    y_pred=y_pred
))

By comparison, we could obtain the exactly same results with simpletransformers:

import os

os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from sklearn.metrics import (
    classification_report,
)
from datasets import (
    load_dataset
)

from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs,
)


model_name = "distilbert-base-uncased-finetuned-sst-2-english"
dataset = load_dataset("glue", "sst2", split="validation")

model_args = ClassificationArgs()

model_args.eval_batch_size = 256
model_args.max_seq_length = 256
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False

model = ClassificationModel(
    model_type="distilbert",
    model_name=model_name,
    num_labels=2,
    args=model_args,
)
y_pred, _ = model.predict(dataset["sentence"])
y_true = dataset["label"]

print(classification_report(
    y_true=y_true,
    y_pred=y_pred
))

Minimal Working Example

simpletransformers could more quickly and cleanly train and evaluate PLMs compared to transformers, where the simpletransformers library is based upon; it also comes with full support from wandb. Note that

  • The number of steps is computed based on one GPU even though model_args.n_gpu is set to a different value. Therefore, we should not further divide n_total_steps by model_args.n_gpu.
  • By default, there will be evaluations at the end of each epoch. Therefore, setting n_eval=10 will lead to model_args.num_epochs + n_eval evaluations; in the example below, there will be 13 evaluations.

The following example fine-tunes bert-base-uncased on the imdb dataset:

import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

import pandas as pd

from datasets import load_dataset
from sklearn.model_selection import train_test_split
from simpletransformers.classification import (
    ClassificationModel,
    ClassificationArgs
)

model_args = ClassificationArgs()

model_class = "roberta"
model_name = "roberta-base"

##################################################
# see full list of configurations:
# https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# critical settings
model_args.learning_rate = 1e-5
model_args.num_train_epochs = 3
model_args.train_batch_size = 32
model_args.eval_batch_size = 32
model_args.gradient_accumulation_steps = 1
model_args.fp16 = False
model_args.max_seq_length = 128
model_args.n_gpu = 4
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False

# saving settings
model_args.no_save = False
model_args.overwrite_output_dir = True
model_args.output_dir = "outputs/"

# the following mandates that only the best checkpoint will be saved; there will be only 1 checkpoint
model_args.best_model_dir = "{}/best_model".format(model_args.output_dir)
model_args.save_model_every_epoch = False
model_args.save_best_model = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1

# validation criterion
model_args.use_early_stopping = False
model_args.early_stopping_metric = "auroc"
model_args.early_stopping_metric_minimize = False

# evaluation settings
model_args.evaluate_during_training = True

# logging settings
model_args.silent = False
model_args.wandb_project = "simpletransformers"
model_args.wandb_kwargs = {
    "name": "sanity-check-imdb"
}

##################################################
# loading dataset

ds = load_dataset("imdb")

# data splits
# there has to be "text" and "labels" columns in the input dataframe
df = pd.DataFrame(ds["train"]).rename(columns={"label": "labels"})

train_df, eval_df = train_test_split(df.sample(frac=0.1), test_size=0.2)
test_df = pd.DataFrame(ds["test"]).sample(frac=0.1).rename(columns={"label": "labels"})

##################################################
# adaptive steps settings
# we will evaluate 10 times and log 100 times no matter how small the dataset

n_eval, n_log = 10, 100
n_total_steps = round(len(train_df) / model_args.train_batch_size) * model_args.num_train_epochs

model_args.evaluate_during_training_steps = max(1, round(n_total_steps / n_eval))
model_args.logging_steps = max(1, round(n_total_steps / n_log))
model_args.save_steps = -1

##################################################
# training
model = ClassificationModel(
    model_class,
    model_name,
    num_labels=2,
    args=model_args,
)

model.train_model(
    train_df=train_df,
    eval_df=eval_df
)

# test
result, model_outputs, wrong_predictions = model.eval_model(test_df)

Validation and Early Stopping

  • Validation

    Choosing which model checkpoint to save (aka. validation) depends on early_stopping_metric and early_stopping_metric_minimize even though early stopping itself is disabled.

  • Early Stopping

    If we need to use early stopping, we need to also be aware of the other hyperparameters.

Name Default Note
use_early_stopping False
early_stopping_metric "eval_loss" eval_during_training has to be True; it will use metrics computed during evaluation.
early_stopping_metric_minimize True
early_stopping_consider_epochs False
early_stopping_patience 3 Terminate training after early_stopping_patience evaluations without improvement specified byearly_stopping_delta.
early_stopping_delta 0
class ClassificationModel:

    def train_model(
        self,
        train_df,
        multi_label=False,
        output_dir=None,
        show_running_loss=True,
        args=None,
        eval_df=None,
        verbose=True,
        **kwargs,
    ):
        // ...
        global_step, training_details = self.train(
            train_dataloader,
            output_dir,
            multi_label=multi_label,
            show_running_loss=show_running_loss,
            eval_df=eval_df,
            verbose=verbose,
            **kwargs,
        )
        // ...


    def train(
        self,
        train_dataloader,
        output_dir,
        multi_label=False,
        show_running_loss=True,
        eval_df=None,
        test_df=None,
        verbose=True,
        **kwargs,
    ):
        //...
        best_eval_metric = None

        // ...
        if not best_eval_metric:
            best_eval_metric = results[args.early_stopping_metric]
            self.save_model(
                args.best_model_dir,
                optimizer,
                scheduler,
                model=model,
                results=results,
            )
        // ...

Using Sentence-Transformers

Overview

  • sentence_transformer is built with torch despite a resemblance to the keras API.
  • The famous METB benchmark is also largely built on top of the sentence_transformer library.

Fine-Tuning Embeddings

Besides an easy interface to generate embeddings, the sentence_transformers library also supports fine-tuning the provided embedding models. The following data formats all have their corresponding loss functions without a need to convert data to a specific format (for example, triplets) (see blog).

Note that these loss functions come from the sentence_transformers library rather than torch or transformers. These loss functions have been discussed in a blog post that is not affiliated with the developers of sentence_transformers.

Index Description Data Loss Note
1 A pair of sentences and a label (premise, hypothesis, label) ContrastiveLoss;
SoftmaxLoss;
CosineSimilarityLoss
2 Individual sentence and corresponding label (text, label) BatchHardTripletLoss and variants “batch hard” performs best in the blog post.
3 A pair of similar sentences (query, response), (src_lang, tgt_lang), (full_text, summary), (text1, text2) (e.g., QQP), (text, entailed_text) (e.g., NLI) MultipleNegativeRankingLoss;
MegaBatchMarginLoss
Frequent
4 A triplet of sentences of an positive, a positive, and a negative (anchor, positive, negative) TripletLoss Rare as it requires offline mining

Here is a minimal working example of fine-tuning representation using sst2 dataset; we could optionally evaluate the fine-tuned model on the MTEB benchmark as it is also built with sentence_transformer library.

Note that:

  • The sentence_transformers does not have a native support for wandb as simpletransformers. We could only monitor one score through the log_with_wandb() with an exactly same signature. The score it monitors depends on which specific evaluator is used (see the complete list of evaluators here).

    When working with TripletEvaluator as in the example below. The returned metric a ratio of the number of triplets among all triplets that satisfy $d(a, p) < d(a, n)$.

  • We could easily replace the model with models available on the HuggingFace hub.
import os
import wandb
import random
import logging

import pandas as pd

from datetime import datetime
from collections import defaultdict
from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    SentencesDataset
)
from sentence_transformers.evaluation import (
    TripletEvaluator,
)

from sentence_transformers import LoggingHandler
from sentence_transformers.losses import (
    BatchHardTripletLoss,
)

from datasets import load_dataset
from torch.utils.data import DataLoader


logging.basicConfig(
    format="%(asctime)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=logging.INFO,
    handlers=[LoggingHandler()],
)

def triplets_from_labeled_dataset(
    records,
    text_column="sentence",
    label_column="label"
):
    # Create triplets for a [(label, sentence), (label, sentence)...] dataset
    # by using each example as an anchor and selecting randomly a
    # positive instance with the same label and a negative instance with a different label

    input_examples = [
        InputExample(guid=str(guid), texts=[record[text_column]], label=record[label_column])
        for guid, record in enumerate(records)
    ]

    triplets = []
    label2sentence = defaultdict(list)
    for inp_example in input_examples:
        label2sentence[inp_example.label].append(inp_example)

    for inp_example in input_examples:
        anchor = inp_example

        if len(label2sentence[inp_example.label]) < 2: #We need at least 2 examples per label to create a triplet
            continue

        positive = None
        while positive is None or positive.guid == anchor.guid:
            positive = random.choice(label2sentence[inp_example.label])

        negative = None
        while negative is None or negative.label == anchor.label:
            negative = random.choice(input_examples)

        triplets.append(InputExample(texts=[anchor.texts[0], positive.texts[0], negative.texts[0]]))

    return triplets

##################################################

model_name = 't5-base'
num_epochs = 10

##################################################

current_time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
wandb.init(
    project="sentence_transformers",
    name=f"{model_name}-{current_time}"
)

##################################################
# model

output_path = (
    "output/"
    + model_name
    + "-"
    + current_time
)

model = SentenceTransformer(model_name)

##################################################

def get_dataloader(df, split, text_column, label_column, batch_size=8):
    records = df.to_dict("records")
    examples = [
        InputExample(texts=[record[text_column]], label=record[label_column])
        for record in records
    ]
    dataset = SentencesDataset(
        examples=examples,
        model=model,
    )
    dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)

    return dataloader

##################################################
# data

ds = load_dataset("sst2")

train_df = pd.DataFrame(ds["train"])
val_df = pd.DataFrame(ds["validation"])
test_df = pd.DataFrame(ds["test"])

train_dataloader = get_dataloader(train_df, "train", text_column="sentence", label_column="label")

##################################################

train_loss = BatchHardTripletLoss(model=model)
val_evaluator = TripletEvaluator.from_input_examples(
    triplets_from_labeled_dataset(val_df[["sentence", "label"]].to_dict("records")),
    name="eval"
)
val_evaluator(model)

##################################################

def log_with_wandb(score, epoch, steps):
    # https://docs.wandb.ai/ref/python/log
    wandb.log(
        data={"score": score},
        step=steps,
    )

warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data

model.fit(
    [(train_dataloader, train_loss)],
    show_progress_bar=True,
    epochs=num_epochs,
    evaluator=val_evaluator,
    evaluation_steps=50,
    warmup_steps=warmup_steps,
    output_path=output_path,
    callback=log_with_wandb
)
##################################################

test_evaluator = TripletEvaluator.from_input_examples(
    triplets_from_labeled_dataset(test_df[["sentence", "label"]].to_dict("records")),
    name="test"
)
model.evaluate(test_evaluator)

As our goal is not evaluating triplets but the quality of clustering, we could define our own evaluator.

from sklearn.cluster import KMeans
from sklearn.metrics import v_measure_score

from sentence_transformers.evaluation import (
    SentenceEvaluator,
)

class ClusteringEvaluator(SentenceEvaluator):
    def __init__(self, texts, labels, batch_size=32, show_progress_bar=False):
        self.texts = texts
        self.labels = labels
        self.batch_size = batch_size
        self.show_progress_bar = show_progress_bar

    def __call__(self, model, output_path: str = None, epoch: int = -1, steps: int = -1):
        embeddings = model.encode(
            self.texts, batch_size=self.batch_size, show_progress_bar=self.show_progress_bar, convert_to_numpy=True
        )
        y_pred = KMeans(n_clusters=len(set(self.labels)), n_init="auto").fit_predict(embeddings)
        score = v_measure_score(
            labels_true=self.labels,
            labels_pred=y_pred
        )

        return score

Customization

Saving Checkpoints

Similar to simpletransformers, sentence_transformer could save best checkpoints according to the evaluation metric. Model saving are controlled by _eval_during_training() and _save_checkpoint() functions.

  • If save_best_model=True, the best model will be saved at the root directory of the output_path. Saving best checkpoint is enabled by default.
  • If we want to save additional checkpoints, these additional checkpoints will be saved at checkpoint_path; the total number of saved checkpoints is governed by checkpoint_save_steps and checkpoint_save_total_limit. Different checkpoints will be stored in the folder named <step>.

    Saving additional checkpoints is disabled by default.

Loss Functions

According to the doc, we should choose which loss to use based on the available format of data we have. There are 14 loss functions supported by sentence_transformer.

Index Loss Function Data Format Publication Note
1 BatchAllTripletLoss (text, label) 1 Using all positive and negative within the PK batch; leading to PK\cdot (PK-K) \cdot (K-1) pairs.
2 BatchSemiHardTripletLoss (text, label) 1
3 BatchHardTripletLoss (text, label) 1 Finding the hardest positive and negative within the PK batch, leading to PK pairs.
4 BatchHardSoftMarginTripletLoss (text, label) 1 Replacing the hinge function with a softplus function.
5 ConstrativeLoss (text1, text2, label) 4
6 OnlineContrastiveLoss (text1, text2, label) 4
7 SoftmaxLoss (text1, text2, label) 2
8 CosineSimilarityLoss (text1, text2, similarity)
9 DenoisingAutoEncoderLoss (corrupted text, original text) 5
10 MultipleNegativeRankingLoss (anchor, positive) 8
11 MegaBatchMarginLoss (anchor, positive) 7 Requires a large batch size (like 500).
12 TripletLoss (anchor, positive, negative) Requires Offline Hard Mining (OHM) described in 1.
13 MSELoss (src embedding, tgt embedding) 3 Aligning embeddings of multiple languages.
14 MarginMSELoss (a, p, n, d(a, p), d(a, n)) 6 Very stringent requirement for input data.
  1. [1703.07737] In Defense of the Triplet Loss for Person Re-Identification: This paper overturns the prevailing belief that the a more intuitive triplet loss is worse than the surrogate classification loss by proposing new loss functions; it also critically points out the limitations of the TripletLoss:

    A major caveat of the triplet loss, though, is that as the dataset gets larger, the possible number of triplets grows cubically, rendering a long enough training impractical. To make matters worse, f _ \theta relatively quickly learns to correctly map most trivial triplets, rendering a large fraction of all triplets uninformative.

    The goal of metric learning is to preserve the "semantic distance" in the metric space: two semantically similar sentences should be close in the metric space and two dissimilar ones should be remote to each other in the embedding space.

    Overall, the "batch-hard" version, possibly with a soft margin, performs best among all loss functions.

    image-20231031131017209

  2. [1908.10084] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  3. [2004.09813] Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
  4. Dimensionality Reduction by Learning an Invariant Mapping (CVPR 2006, Yann LeCun)
  5. [2104.06979] TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning
  6. [2010.02666] Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
  7. ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations (Wieting & Gimpel, ACL 2018)
  8. [1705.00652] Efficient Natural Language Response Suggestion for Smart Reply: Section 4.4 defines the multiple negative loss.

RLHF with trl Library

The trl library provides a one-stop solution to instruction tuning (i.e., SFT), reward modeling, and PPO. The library supports peft and 4-bit (or 8-bit) tuning natively so that we could tune an LM on the customer device.

The trl defines a custom class AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqMWithValueHead so that the PPO could be done; it returns an unbounded score (through nn.Linear(hidden_size, 1)) for each returned token.

ICL with Long Prompt

There are several solutions to long-prompt generation, including Alibi and Yarn.

  • Yarn

    As of 2023-11-22, we have an open-source model of 128K context window. This does not mean that we could do in-context learning with arbitrary number of shots. There is a major gap between the paper and the real-world application: both model and data will consume graphic memory and the memory required for data scales with context length (see an explanation here); however, the paper performs evaluation using a surrogate metric; the authors show that they could successfully work with a context window of 128K by computing the perplexity with a sliding window.

  • Alibi

    • mosaicml/mpt-7b-8k-instruct, mosaicml/mpt-7b-8k-chat, and mosaicml/mpt-7b-8k.
    • mosaicml/mpt-7b-storywriter: This model could extrapolate beyond 65K tokens.
  • LLongMA

Reading Notes | Understanding Dataset Difficulty with V-Usable Information

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-19: First draft. This paper appears as one of the outstanding papers at ICML 2022.

Overview

The main contribution of the paper is a metric to evaluate the difficulty of the aggregate and sample-wise difficulty of a dataset for a model family \mathcal{V}: a lower score indicates a more difficult dataset. This metric is appealing because it is able to do five things while previous approaches could only do 1 to 3 of them. Specifically,

  • Comparing Datasets: DIME (accepted as a workshop paper at NeurIPS 2020), IRT [4].
  • Comparing Models: Dynascore [3]
  • Comparing Instances: Data Shapley [5]
  • Comparing Dataset Slices
  • Comparing Attributes: The paper [6] estimates the attribute importance using MDL.

Method

Despite a lot of theoretical construct in Section 2, the way to compute the proposed metric is indeed fairly straightforward.

Suppose we have a dataset \mathcal{D} _ \text{train} and \mathcal{D} _ \text{test} of a task, such as NLI, the proposed metric requires fine-tuning on \mathcal{D} _ \text{train} two models from the same base model \mathcal{V} and collecting measurements on \mathcal{D} _ \text{test} (Algorithm 1):

  • Step 1: Fine-tuning a model g’ on \mathcal{D} _ \text{train} = { (x_1, y_1), \cdots, (x_m, y_m) } and another model g on { (\phi, y_1), \cdots, (\phi, y_m) }, where \phi is an empty string; both g’ and g are the model initialized from the same base model, such as bert-base-uncased.
  • Step 2: For each test sample, the sample-wise difficulty (aka. PVI) is defined as \mathrm{PVI}(x_i \rightarrow y_i) := -\log_2 g(y_i\vert \phi) + \log_2 g'(y_i\vert x_i); the aggregate difficulty is its average \hat{I} _ \mathcal{V}(X \rightarrow Y) = \frac{1}{n}\sum _ i \mathrm{PVI}(x_i \rightarrow y_i).

    If the input and output are independent, the metric is provably and 0; it will be empirically close to 0.

Note that:

  • The method requires a reasonably large dataset \mathcal{D} _ \text{train}. However, the exact size is not known in advance unless we train many models and wait to see when the curve plateaus, which is not feasible in practice. The authors use 80% of the SNLI dataset for estimation (Appendix A).
  • The specific choice of models, hyperparameters, and random initializations does not influence the results a lot (Section 3.2).

Applications

There are several applications when we use the proposed metric to rank the samples in a dataset:

  • Identifying the annotation errors (Section 3).
  • Using the metric to select challenging samples for data selection, including training data selection, data augmentation, and TCP (Section 4).
  • Guiding the creation of new specifications as it is possible to compute the token-wise metric (Section 4.3).

Additional Notes

  • It is quite surprising that the CoLA dataset is more difficult than SNLI and MNLI according to the authors’ measure.

Code

Reference

  1. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics (Swayamdipta et al., EMNLP 2020): The method in the main paper and this paper both requires training a model.
  2. [2002.10689] A Theory of Usable Information Under Computational Constraints (Xu et al., ICLR 2020).
  3. [2106.06052] Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
  4. Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? (Rodriguez et al., ACL-IJCNLP 2021)
  5. [1904.02868] Data Shapley: Equitable Valuation of Data for Machine Learning (ICML 2019): Data shapley could give a pointwise estimate of a sample’s contribution to the decision boundary.
  6. [2103.03872] Rissanen Data Analysis: Examining Dataset Characteristics via Description Length (ICML 2021).

Research Notes | Manuscript Preparation in LaTeX

Overview

The computer science conferences have high tolerance for style variability, which leads to stark variances in the typesetting quality even for the final camera-ready version. Here is one such example: the left from [1] is much better than [2], which is a random sample from the same conference in the same year. As the latter is the impression of almost all of the papers from that conference, the paper [1] will easily stand out.

Template

  • Some of the templates look more professional than others. Whenever possible, we should such templates.

Fonts

  • Using lmodern package through \usepackage{lmodern} in preamble; this single command will significantly improve the first impression of the manuscript.

Graphics

Reference

  1. packages – Suggest a “nice” font family for my basic LaTeX template (text and math) – TeX – LaTeX Stack Exchange

Reading Notes | NoisywikiHow – A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-14: First draft. The paper appears at ACL 2023. The code base has very detailed instructions on how to reproduce their results.

Method

  • The authors find that the labeling errors are both annotator-dependent and instance-dependent.

Experiments

  • The best performing LNL method on the benchmark is SEAL [1]: one could also consider MixUp regularization [2]. All other LNL methods have almost indistinguishable difference as the base models, i.e., not doing any intervention on the training process.

Additional Note

Comments

  • The reason why creating a new dataset is necessary is that the users could customize the noise level to compare performances of different algorithms in a controlled setting.

Reference

  1. [2012.05458] Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise (Chen et al. AAAI 2021).
  2. [1710.09412] mixup: Beyond Empirical Risk Minimization (Zhang et al., ICLR 2018, 7.6K citations).
  3. Nonlinear Mixup: Out-of-Manifold Data Augmentation for Text Classification (Guo, AAAI 2020). One application of MixUp regularization in NLP. It is based on a CNN classifier and the improvement is quite marginal.
  4. [2006.06049] On Mixup Regularization (Carratino et al., JMLR): A theoretical analysis of MixUp regularization.
  5. Learning with Noisy Labels (Natarajan et al., NIPS 2013): This paper is the first paper that (theoretically) studies LNL. It considers the binary classification problem where labels are randomly flipped, which is theoretically appealing but less relevant empirically according to the main paper.

Research Notes | Training Data Optimization

Problem Statement

Suppose we have a collection of datasets from K sources \mathcal{D} _ 1, \cdots, \mathcal{D} _ K. These K datasets have been unified regarding input and output spaces.

Now we split each \mathcal{D} _ i into train, validation, and test splits \mathcal{D} _ i ^ \text{train},\ \mathcal{D} _ i ^ \text{val} and \mathcal{D} _ i ^ \text{test} and form the aggregated train, validation, and test sets as \mathcal{D}^\text{train} := \cup _ {i=1}^ K D _ i^\text{train}, \mathcal{D}^\text{val} := \cup _ {i=1}^ K D _ i^\text{val}, and \mathcal{D}^\text{test} := \cup _ {i=1}^ K D _ i^\text{test} .

The learning problem could vary depending the quality of the datasets after (1) dataset collection and annotation by authors of different datasets, and (2) dataset unification when merging K datasets into one. This is because:

  • If labels are reliable, then this is dataset selection problem. The argument is to save computation resources when training on \mathcal{D} \subseteq \mathcal{D} ^ \text{train} while maintaining the performance as a model trained in (1) each \mathcal{D}_i,\ i \in [K], (2) \mathcal{D} ^ \text{train}, and (3) \mathrm{Sample}(\mathcal{D} ^ \text{train}) that matches the size of \mathcal{D}.

    In some special cases, another motivation for dataset selection is that we know the size of a sampled dataset (for example, the dataset statistics described in a paper) but we are not sure what are exactly these samples.

  • If labels are not reliable, then the argument is to prevent the low-quality labels from offsetting the benefits of a larger training dataset (rather than distilling a smaller dataset to save compute). We have three options:
Index Method Type
1 Reannotating the entire dataset. This could be reduced as a dataset distillation problem as now we have more confidence on the filtered datasets. Offline
2 Identifying and removing unreliable labels and optionally using these samples as an unsupervised dataset. This is also reducible to a dataset selection problem as 1. Offline
3 Learning with the noisy labels (LNL as described in 1) they are; this requires the learning algorithm to explicitly consider the variablity in the label quality. Online

Note that there is a easily topic called “dataset distillation” that one may easily confused with. The goal of dataset distillation is to create synthetic dataset in the feature space based on the original one to match the performance on the test set. Previous show that it is possible to attain the original performance on MNIST ([3]) and IMDB ([4]) with a synthetic dataset of size (surprisingly) 10 and 20.

Adaptive Data Selection

With the test sets finalized, we could now work on sampling training sets, i.e., choosing one specific \mathrm{Sample}(\cdot) function described above. The goal here is to sample the training set so that the scores on the test sets are maximized:

  • DSIR: Suppose we need to sample B batches of samples totaling K, then we could start by randomly sampling the 1st batch and then calling the DSIR algorithm in the future batches until we have collected K samples. This should be done for each label.

Reference

  1. NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing (Wu et al., Findings 2023)
  2. [2202.01327] Adaptive Sampling Strategies to Construct Equitable Training Datasets (Cai et al., FAccT 2023)
  3. [2301.04272] Data Distillation: A Survey (Sachdeva and McAuley, JMLR).
  4. [1811.10959] Dataset Distillation (Wang et al.)
  5. [1910.02551] Soft-Label Dataset Distillation and Text Dataset Distillation (Sucholutsky and Schonlau, IJCNN 2020). This is the only paper referenced in 3 describing the dataset distillation for texts. This paper is based on the very original data distillation objective proposed in 4.
  6. [2302.03169] Data Selection for Language Models via Importance Resampling (Xie et al.)
  7. [2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al.)
  8. [2306.11670] GIO: Gradient Information Optimization for Training Dataset Selection (Everaert and Potts): This paper has similar settings as the DSIR paper [6]: we are selecting new samples by minimizing their KL divergence with an existing set of unlabeled samples. The paper claims an advantage over the DSIR as the proposed algorithm requires fewer samples:

    Like GIO, these heuristic methods aim to select a subset of data that is higher quality and more relevant. However, they are either highly tailored to their particular tasks or they require very large numbers of examples (to develop classifiers or construct target probabilities). By contrast, GIO is task- and domain-agnostic, it can be applied plug-and-play to a new task and dataset, and it requires comparatively few gold examples X to serve as the target distribution.

Talk Notes | Data-Centric NLP @ USC CSCI-699 Fall 2022

Outline

The following is the course schedule (indeed a reading list) compiled from the course website for quick reference.

Section Date Topic Readings
I. Datasets in NLP Aug 22 Introduction, Historical Perspective, and Overview Fair ML Book Chapter 7. Datasets
Sambasivan et al., 2021: “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI
Paullada et al., 2021 Data and its (dis)contents
Raji et al., 2022 Ethical Challenges of Data Collection & Use in Machine Learning Research
Aug 24 Data Collection and Data Ethics Deng et al., 2009 ImageNet: A large-scale hierarchical image database
Kwiatkowski et al., 2019 Natural Questions: A Benchmark for Question Answering Research
Sakaguchi et al., 2019 WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Bowman et al. 2015 A large annotated corpus for learning natural language inference
Nie et al., 2020 Adversarial NLI: A New Benchmark for Natural Language Understanding
Aug 31 More on Data Ethics Bender et al., 2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Koch et al., 2021 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Klein and D’Ignazio, 2020 Data Feminism Book: Intro and Chapter 1
Strubell et al., 2019 Energy and Policy Considerations for Deep Learning in NLP
II. Bias and Mitigation Sep 7 Biases: An Overview Geirhos et al., 2020 Shortcut Learning in Deep Neural Networks
Hort et al., 2022 Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey
Feder et al., 2021 Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
Sep 12 Spurious Biases I Torralba & Efros, 2011 Unbiased Look at Dataset Bias
Geva et al., 2019 Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
McCoy et al., 2019 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in NLI
Sep 14 Spurious Biases II Gardner et al., 2021 Competency Problems: On Finding and Removing Artifacts in Language Data
Eisenstein, 2022 Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language
Sep 19 Data-Centric Bias Mitigation Srivastava et al., 2020 Robustness to spurious correlations via human annotations
Dixon et al., 2018 Measuring and mitigating unintended bias in text classification
Gardner et al., 2019 On Making Reading Comprehension More Comprehensive
Sep 21 Data Augmentation for Bias Mitigation Ng et al., 2020 SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving O.O.D. Robustness
Kaushik et al., 2019 Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
III. Estimating Data Quality Sep 26 Estimates of Data Quality Le Bras et al., 2020 Adversarial Filters of Dataset Biases
Swayamdipta et al., 2020 Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Liu et al., 2022 WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Ethayarajh et al., 2022 Understanding Dataset Difficulty with V-Usable Information
Sep 28 Aggregate vs. Point-wise Estimates of Data Quality Ghorbani & Zou, 2019 Data Shapley: Equitable Valuation of Data for Machine Learning;
Perez et al., 2021 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length;
Mindermann et al., 2022 Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Oct 3 Anomalies, Outliers, and Out-of-Distribution Examples Hendrycks et al., 2018 Deep Anomaly Detection with Outlier Exposure
Ren et al., 2019 Likelihood Ratios for Out-of-Distribution Detection
Oct 5 Disagreements, Subjectivity and Ambiguity I Pavlick et al., 2019 Inherent Disagreements in Human Textual Inferences;
Röttger et al., 2022 Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks;
Denton et al., 2021 Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation
Oct 12 Disagreements, Subjectivity and Ambiguity II Miceli et al., 2020 Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision;
Davani et al., 2021 Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations
IV. Data for Accountability Oct 17 Creating Evaluation Sets Recht et al., 2019 Do ImageNet Classifiers Generalize to ImageNet?;
Card et al., 2020 With Little Power Comes Great Responsibility;
Clark et al. 2021 All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Ethayarajh & Jurafsky, 2020 Utility is in the eye of the user: a critique of NLP leaderboards
Oct 19 Counterfactual Evaluation Gardner et al., 2020 Evaluating Models’ Local Decision Boundaries via Contrast Sets;
Ross et al., 2021 Tailor: Generating and Perturbing Text with Semantic Controls
Oct 24 Adversarial Evaluation Jia and Liang, 2017 Adversarial Examples for Evaluating Reading Comprehension Systems;
Kiela et al., 2021 Dynabench: Rethinking Benchmarking in NLP;
Li and Michael, 2022 Overconfidence in the Face of Ambiguity with Adversarial Data
Oct 26 Contextualizing Decisions Gebru et al., 2018 Datasheets for Datasets;
Bender and Friedman, 2018 Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
V. Beyond Labeled Datasets Oct 31 Unlabeled Data Dodge et al., 2021 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Lee et al., 2022 Deduplicating Training Data Makes Language Models Better
Gururangan et al., 2022 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Nov 2 Prompts as Data? Wei et al., 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models
Nov 7 Data Privacy and Security Amodei et al., 2016 Concrete Problems in AI Safety
Carlini et al., 2020 Extracting Training Data from Large Language Models
Henderson et al., 2022 Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Nov 9 Towards Better Data Citizenship Jo & Gebru, 2019 Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
Hutchinson et al., 2021 Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure