Reading Notes | NoisywikiHow – A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-14: First draft. The paper appears at ACL 2023. The code base has very detailed instructions on how to reproduce their results.

Method

  • The authors find that the labeling errors are both annotator-dependent and instance-dependent.

Experiments

  • The best performing LNL method on the benchmark is SEAL [1]: one could also consider MixUp regularization [2]. All other LNL methods have almost indistinguishable difference as the base models, i.e., not doing any intervention on the training process.

Additional Note

Comments

  • The reason why creating a new dataset is necessary is that the users could customize the noise level to compare performances of different algorithms in a controlled setting.

Reference

  1. [2012.05458] Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise (Chen et al. AAAI 2021).
  2. [1710.09412] mixup: Beyond Empirical Risk Minimization (Zhang et al., ICLR 2018, 7.6K citations).
  3. Nonlinear Mixup: Out-of-Manifold Data Augmentation for Text Classification (Guo, AAAI 2020). One application of MixUp regularization in NLP. It is based on a CNN classifier and the improvement is quite marginal.
  4. [2006.06049] On Mixup Regularization (Carratino et al., JMLR): A theoretical analysis of MixUp regularization.
  5. Learning with Noisy Labels (Natarajan et al., NIPS 2013): This paper is the first paper that (theoretically) studies LNL. It considers the binary classification problem where labels are randomly flipped, which is theoretically appealing but less relevant empirically according to the main paper.

Research Notes | Training Data Optimization

Problem Statement

Suppose we have a collection of datasets from K sources \mathcal{D} _ 1, \cdots, \mathcal{D} _ K. These K datasets have been unified regarding input and output spaces.

Now we split each \mathcal{D} _ i into train, validation, and test splits \mathcal{D} _ i ^ \text{train},\ \mathcal{D} _ i ^ \text{val} and \mathcal{D} _ i ^ \text{test} and form the aggregated train, validation, and test sets as \mathcal{D}^\text{train} := \cup _ {i=1}^ K D _ i^\text{train}, \mathcal{D}^\text{val} := \cup _ {i=1}^ K D _ i^\text{val}, and \mathcal{D}^\text{test} := \cup _ {i=1}^ K D _ i^\text{test} .

The learning problem could vary depending the quality of the datasets after (1) dataset collection and annotation by authors of different datasets, and (2) dataset unification when merging K datasets into one. This is because:

  • If labels are reliable, then this is dataset selection problem. The argument is to save computation resources when training on \mathcal{D} \subseteq \mathcal{D} ^ \text{train} while maintaining the performance as a model trained in (1) each \mathcal{D}_i,\ i \in [K], (2) \mathcal{D} ^ \text{train}, and (3) \mathrm{Sample}(\mathcal{D} ^ \text{train}) that matches the size of \mathcal{D}.

    In some special cases, another motivation for dataset selection is that we know the size of a sampled dataset (for example, the dataset statistics described in a paper) but we are not sure what are exactly these samples.

  • If labels are not reliable, then the argument is to prevent the low-quality labels from offsetting the benefits of a larger training dataset (rather than distilling a smaller dataset to save compute). We have three options:
Index Method Type
1 Reannotating the entire dataset. This could be reduced as a dataset distillation problem as now we have more confidence on the filtered datasets. Offline
2 Identifying and removing unreliable labels and optionally using these samples as an unsupervised dataset. This is also reducible to a dataset selection problem as 1. Offline
3 Learning with the noisy labels (LNL as described in 1) they are; this requires the learning algorithm to explicitly consider the variablity in the label quality. Online

Note that there is a easily topic called “dataset distillation” that one may easily confused with. The goal of dataset distillation is to create synthetic dataset in the feature space based on the original one to match the performance on the test set. Previous show that it is possible to attain the original performance on MNIST ([3]) and IMDB ([4]) with a synthetic dataset of size (surprisingly) 10 and 20.

Adaptive Data Selection

With the test sets finalized, we could now work on sampling training sets, i.e., choosing one specific \mathrm{Sample}(\cdot) function described above. The goal here is to sample the training set so that the scores on the test sets are maximized:

  • DSIR: Suppose we need to sample B batches of samples totaling K, then we could start by randomly sampling the 1st batch and then calling the DSIR algorithm in the future batches until we have collected K samples. This should be done for each label.

Reference

  1. NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing (Wu et al., Findings 2023)
  2. [2202.01327] Adaptive Sampling Strategies to Construct Equitable Training Datasets (Cai et al., FAccT 2023)
  3. [2301.04272] Data Distillation: A Survey (Sachdeva and McAuley, JMLR).
  4. [1811.10959] Dataset Distillation (Wang et al.)
  5. [1910.02551] Soft-Label Dataset Distillation and Text Dataset Distillation (Sucholutsky and Schonlau, IJCNN 2020). This is the only paper referenced in 3 describing the dataset distillation for texts. This paper is based on the very original data distillation objective proposed in 4.
  6. [2302.03169] Data Selection for Language Models via Importance Resampling (Xie et al.)
  7. [2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al.)
  8. [2306.11670] GIO: Gradient Information Optimization for Training Dataset Selection (Everaert and Potts): This paper has similar settings as the DSIR paper [6]: we are selecting new samples by minimizing their KL divergence with an existing set of unlabeled samples. The paper claims an advantage over the DSIR as the proposed algorithm requires fewer samples:

    Like GIO, these heuristic methods aim to select a subset of data that is higher quality and more relevant. However, they are either highly tailored to their particular tasks or they require very large numbers of examples (to develop classifiers or construct target probabilities). By contrast, GIO is task- and domain-agnostic, it can be applied plug-and-play to a new task and dataset, and it requires comparatively few gold examples X to serve as the target distribution.

Talk Notes | Data-Centric NLP @ USC CSCI-699 Fall 2022

Outline

The following is the course schedule (indeed a reading list) compiled from the course website for quick reference.

Section Date Topic Readings
I. Datasets in NLP Aug 22 Introduction, Historical Perspective, and Overview Fair ML Book Chapter 7. Datasets
Sambasivan et al., 2021: “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI
Paullada et al., 2021 Data and its (dis)contents
Raji et al., 2022 Ethical Challenges of Data Collection & Use in Machine Learning Research
Aug 24 Data Collection and Data Ethics Deng et al., 2009 ImageNet: A large-scale hierarchical image database
Kwiatkowski et al., 2019 Natural Questions: A Benchmark for Question Answering Research
Sakaguchi et al., 2019 WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Bowman et al. 2015 A large annotated corpus for learning natural language inference
Nie et al., 2020 Adversarial NLI: A New Benchmark for Natural Language Understanding
Aug 31 More on Data Ethics Bender et al., 2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Koch et al., 2021 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Klein and D’Ignazio, 2020 Data Feminism Book: Intro and Chapter 1
Strubell et al., 2019 Energy and Policy Considerations for Deep Learning in NLP
II. Bias and Mitigation Sep 7 Biases: An Overview Geirhos et al., 2020 Shortcut Learning in Deep Neural Networks
Hort et al., 2022 Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey
Feder et al., 2021 Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
Sep 12 Spurious Biases I Torralba & Efros, 2011 Unbiased Look at Dataset Bias
Geva et al., 2019 Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
McCoy et al., 2019 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in NLI
Sep 14 Spurious Biases II Gardner et al., 2021 Competency Problems: On Finding and Removing Artifacts in Language Data
Eisenstein, 2022 Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language
Sep 19 Data-Centric Bias Mitigation Srivastava et al., 2020 Robustness to spurious correlations via human annotations
Dixon et al., 2018 Measuring and mitigating unintended bias in text classification
Gardner et al., 2019 On Making Reading Comprehension More Comprehensive
Sep 21 Data Augmentation for Bias Mitigation Ng et al., 2020 SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving O.O.D. Robustness
Kaushik et al., 2019 Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
III. Estimating Data Quality Sep 26 Estimates of Data Quality Le Bras et al., 2020 Adversarial Filters of Dataset Biases
Swayamdipta et al., 2020 Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Liu et al., 2022 WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Ethayarajh et al., 2022 Understanding Dataset Difficulty with V-Usable Information
Sep 28 Aggregate vs. Point-wise Estimates of Data Quality Ghorbani & Zou, 2019 Data Shapley: Equitable Valuation of Data for Machine Learning;
Perez et al., 2021 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length;
Mindermann et al., 2022 Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Oct 3 Anomalies, Outliers, and Out-of-Distribution Examples Hendrycks et al., 2018 Deep Anomaly Detection with Outlier Exposure
Ren et al., 2019 Likelihood Ratios for Out-of-Distribution Detection
Oct 5 Disagreements, Subjectivity and Ambiguity I Pavlick et al., 2019 Inherent Disagreements in Human Textual Inferences;
Röttger et al., 2022 Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks;
Denton et al., 2021 Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation
Oct 12 Disagreements, Subjectivity and Ambiguity II Miceli et al., 2020 Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision;
Davani et al., 2021 Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations
IV. Data for Accountability Oct 17 Creating Evaluation Sets Recht et al., 2019 Do ImageNet Classifiers Generalize to ImageNet?;
Card et al., 2020 With Little Power Comes Great Responsibility;
Clark et al. 2021 All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Ethayarajh & Jurafsky, 2020 Utility is in the eye of the user: a critique of NLP leaderboards
Oct 19 Counterfactual Evaluation Gardner et al., 2020 Evaluating Models’ Local Decision Boundaries via Contrast Sets;
Ross et al., 2021 Tailor: Generating and Perturbing Text with Semantic Controls
Oct 24 Adversarial Evaluation Jia and Liang, 2017 Adversarial Examples for Evaluating Reading Comprehension Systems;
Kiela et al., 2021 Dynabench: Rethinking Benchmarking in NLP;
Li and Michael, 2022 Overconfidence in the Face of Ambiguity with Adversarial Data
Oct 26 Contextualizing Decisions Gebru et al., 2018 Datasheets for Datasets;
Bender and Friedman, 2018 Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
V. Beyond Labeled Datasets Oct 31 Unlabeled Data Dodge et al., 2021 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Lee et al., 2022 Deduplicating Training Data Makes Language Models Better
Gururangan et al., 2022 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Nov 2 Prompts as Data? Wei et al., 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models
Nov 7 Data Privacy and Security Amodei et al., 2016 Concrete Problems in AI Safety
Carlini et al., 2020 Extracting Training Data from Large Language Models
Henderson et al., 2022 Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Nov 9 Towards Better Data Citizenship Jo & Gebru, 2019 Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
Hutchinson et al., 2021 Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

Reading Notes | Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-11: First draft. This paper appears at WOAH ’22.

The paper studies the generalization to new hate target groups on the single HateXplain dataset; they authors do so by comparing three existing methods, including (1) Unsupervised Domain Adaptation (UDA, this method is also used in paper [1]), (2) MixUp regularization, (3) curriculum labeling, and (4) DANN.

The paper also considers the back translation approach (specifically (en, fr), (en, de), and (en, es)) for data augmentation.

Experiments

  • Zero: Directly apply a model trained on \mathcal{D}_A to a new domain \mathcal{D}_B.
  • Zero+: Augmenting \mathcal{D}_A using back-translation.
  • ZeroB+: Applying back-translation-based data augmentation while making sure that the each batch is class-balanced.

Reference

  1. Unsupervised Domain Adaptation in Cross-corpora Abusive Language Detection (Bose et al., SocialNLP 2021): This paper considers the setting of training on dataset \mathcal{D}_A and testing on another dataset \mathcal{D}_B, where A, B are HateEval, Waseem, and Davidson, resulting in 6 pairs. They use several existing methods to improve the test scores on \mathcal{D}_B.
  2. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al., AAAI 2021): This used to be the only dataset that provides the target groups of both hateful and non-hateful contents.
  3. d: Data augmentation could happen in symbol space via rules, word replacement through BERT, text-generation models or feature space. However, the main paper chooses to use the back translation for data augmentation.

    Here are two libraries on data augmentation in NLP:

Reading Notes | Directions in Abusive Language Training Data – Garbage In, Garbage Out

[Semantic Scholar]- [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-06: First draft. This paper provides the influential hate speech dataset hub hatespeechdasta.com even though it appears on PLoS One.

This paper provides a survey of existing (as of 2020) hate speech datasets and some suggestions for creating future hate speech datasets.

Research Notes | A Benchmark for Hate Speech Detection

Overview

There does not exist a unified benchmark such as GLUE in hate speech detection domain that conducts a leaderboard style performance comparison of different open-source hate speech classifiers. This prevents the practitioners from making informed decisions when choosing which model to use for their own hate speech detection applications.

The benchmark will provide the following:

  • The entire training and validation set for future study. However, the labels from public test sets will not be released for benchmarking purposes; there will be additional private test sets.
  • The ranking of the models based on the average aggregated metrics (for example, F1 score) on the public and private test sets.

Protocol

  • Step 1: Randomly select a test set and a validation set.

    The two datasets must be randomly selected for the following reasons:

    1. The distribution of the validation set will be similar to the test set. Using the randomly sampled validation set will help select the models that more are likely to perform well on the test set.
    2. This makes the two datasets independent from each other in terms of label distribution and source distribution. Throughout the experiments, the test and validation sets are the same; this is helpful as we could see the (dis)advantages of one method in the wandb dashboard.
  • Step 2: Sampling train set using different (a) data selection methods.
  • Step 3: Training or fine-tuning (b) different models with (c) different techniques for local improvements, for example, objective function, and regularization.
  • Step 4: Comparing different combinations of (a), (b), and (c). If we have m combinations and n test sets, then we will end up with a table of (m, n+1), where the first column lists all the combinations.

Candidate Datasets

Collected Datasets from Diverse Topics

The current data aggregation includes [1] through [5], where the [5] only includes hate speech.

  1. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  2. [2005.12423] Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis (He et al.)
  3. [2108.12521] TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter (Kumar et al.)
  4. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  5. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech (ElSherief et al., EMNLP 2021)

cardiffnlp/twitter-roberta-base-hate-latest Collection

The follow are the datasets used for the model cardiffnlp/twitter-roberta-base-hate-latest or the paper below:

Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation (Antypas & Camacho-Collados, WOAH 2023)

Index Dataset Name Source Notes
1 HatE Link that requires filling in a Google form.
2 MHS ucberkeley-dlab/measuring-hate-speech
3 DEAP Zenodo
4 CMS Link that requires registration and email verification.
5 Offense Link; this dataset is also called OLID.
6 HateX hatexplain and GitHub
7 LSC GitHub Dehydrated
8 MMHS nedjmaou/MLMA_hate_speech and GitHub
9 HASOC Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve. Not Available
10 AYR GitHub Dehydrated
11 AHSD GitHub
12 HTPO Link
13 HSHP GitHub Dehydrated

The following are the papers that correspond to the list of datasets:

  1. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
  2. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
  3. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  4. [2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
  5. Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
  6. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
  7. [1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
  8. Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
  9. [2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
  10. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
  11. [1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
  12. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  13. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

It is possible to approximate a subset of the original training mixture (8 of 12 datasets excluding the MMHS dataset, which only includes hate speech) following the Table 2 of the original paper. Something to note is that:

  • AYR, HASOC, HSHP, and LSC are not usable.
  • Offense does not exactly match the sizes in Table 2.
  • We disregard any splits and try to match the number in Table 2. When matching number is not possible, we try to make sure the ratio of on-hate versus hate is same.

Additional Datasets from hatespeechdata.com

The following the the additional datasets from hatespeechdata.com that are not included in the above mentioned sources. The dataset names are either available from the original paper or created here for easy reference.

Index Dataset Name Source Notes
1 AbuseEval GitHub The Offense dataset above reannotated for non-hate, implicit, and explicit hate; only IDs are available. Around 87% of the hate/non-hate labels are same as the previous Offense dataset.
2 SWAD GitHub
3 ALONE Not usable. Requires contacting authors.
4 HatefulUsersTwitter GitHub and Kaggle Available but not relevant. This dataset is about detecting whether a user is hateful or neutral on the Tweet network; it does not come with annotated hateful/benign texts.
5 MMHS150K Website Not usable. Multimodal datasets.
6 HarassmentLexicon GitHub Not usable. Lexicons only.
7 P2PHate GitHub Not usable. Dehydrated.
8 Golbeck Not usable. Requires contacting jgolbeck@umd.edu
9 SurgeAI Website Hateful content only.
10 TSA Kaggle Dataset is provided by Analytics Vidhya. The test.csv does not come with labels.
  1. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language (Caselli et al., LREC 2020): The dataset from this paper is also called AbuseEval v1.0.
  2. Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media (Pamungkas et al., LREC 2020)
  3. [2008.06465] ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter (Wijesiriwardene et. al.)
  4. [1803.08977] Characterizing and Detecting Hateful Users on Twitter (Ribeiro et al., ICWSM 2018)
  5. [1910.03814] Exploring Hate Speech Detection in Multimodal Publications (Gomez et al., WACV 2020)
  6. [1802.09416] A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research (Rezvan et al.)
  7. [1804.04649] Peer to Peer Hate: Hate Speech Instigators and Their Targets (ElSherief et al.)
  8. A Large Labeled Corpus for Online Harassment Research (Golbeck et al., WebSci 2017)
  9. Twitter Hate Speech Dataset (Surge AI)
  10. Twitter Sentiment Analysis (Kaggle)

Talk Notes | Paraphrasing Evades Detectors of AI-generated Text, But Retrieval is an Effective Defense by Kaplesh Krishna @ Google

[YouTube] – [Personal Website]

  • The presenter is the author of multiple influential papers on the topics such as paraphrasing and attacks.

Reference

  1. Reformulating Unsupervised Style Transfer as Paraphrase Generation (Krishna et al., EMNLP 2020)
  2. [1910.12366] Thieves on Sesame Street! Model Extraction of BERT-based APIs (Krishna et al., ICLR ’20’)

Reading Notes | WILDS – A Benchmark of in-the-Wild Distribution Shifts

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Leaderboard]

Change Logs:

  • 2023-09-06: First draft. The paper provides a standardized package for many domain generalization algorithms, including group DRO, DANN, and Coral.

Background

Distribution shifts happen when test conditions are “newer” or “smaller” compared to training conditions. The paper defines them as

  • Newer: Domain Generalization

    The test distribution is related to but distinct (aka. new or unseen during training) to the training distributions. Note that the test conditions are not necessarily a superset of the training conditions; they are not “larger” compared to the “smaller” case described below.

    Here are two typical examples of domain generalization described in the paper:

    • Training a model based on patient information from some hospitals and expect the model to generalize to many more hospitals; these hospitals may or may not be a superset of the hospitals we collect training data from.
    • Training an animal recognition model on images taken from some existing cameras and expect the model to work on images taken on the newer cameras.
  • Smaller: Subpopulation Shift

    The test distribution is a subpopulation of the training distributions. For example, degraded facial recognition accuracy on the underrepresented demographic groups ([3] and [4]).

Evaluation

The goal of OOD generalization is training a model on data sampled from training distribution P^\text{train} that performs well on the test distribution P^\text{test}. Note that as we could not assume the data from two distributions are equally difficult to learn, the most ideal case is to at least train two models (or even more ideally three models) and take 2 (or 3) measurements:

Index Goal Training Data Testing Data
1 Measuring OOD Generalization $D ^ \text{train} \sim P^ \text{train}$ $D ^ \text{test} \sim P^ \text{test}$
2 Ruling Out Confounding Factor of Distribution Difficulty $D ^ \text{test} _ \text{heldout} \sim P^ \text{test}$ $D ^ \text{test} \sim P^ \text{test}$
3 (Optional) Sanity Check $D ^ \text{train} \sim P^ \text{train}$ $D ^ \text{train} _ \text{heldout} \sim P^ \text{train}$

However, the generally small test sets make the measurement 2 hard or even impossible: we could not find an additional held-out set D ^ \text{test} _ \text{heldout} that matches the size of D ^ \text{train} to train a model.

The authors therefore define 4 relaxed settings:

Index Setting Training Data Testing Data
1 Mixed-to-Test Mixture of P^ \text{train} and P^ \text{test} $P^ \text{train}$
2 Train-to-Train (aka. setting 3 above) $P^ \text{train}$ $P^ \text{train}$
3 Average. This is a special case of 2; it is only suitable for subpopulation shift, such as Amazon Reviews and CivilComments. Average Performance Worst-Group Performance
4 Random Split. This setting destroys the P ^ \text{test}. $\tilde{D} ^ \text{train} := \mathrm{Sample}(D ^ \text{train} \cup D ^ \text{test})$ $(D ^ \text{train} \cup D ^ \text{test}) \backslash \tilde{D} ^ \text{train}$

image-20231003161354088

Dataset

The dataset includes regular and medical image, graph, and text datasets; 3 out of 10 are text datasets, where the less familiar Py150 is a code completion dataset. Note that the authors fail to cleanly define why there are subpopulation shifts for Amazon Reviews and Py150 datasets as the authors acknowledge below:

However, it is not always possible to cleanly define a problem as one or the other; for example, a test domain might be present in the training set but at a very low frequency.

For Amazon Reviews dataset, one viable explanation on why there is subpopulation shift is uneven distribution of reviews on the same product in the train, validation, and test set.

Name Domain Generalization Subpopulation Shift Notes
CivilComments No; this is because the demographic information of the writers are unknown; if such information is known, we could also create a version with domain generation concern. Yes; this is because the mentions of 8 target demographic groups are available. The only dataset with only subpopulation shift.
Amazon Reviews Yes; due to disjoint users in the train, OOD validation, and OOD test set; there are also ID validation and ID test set from same users as the training set. Yes
Py150 Yes; due to disjoint repositories in train, OOD validation, and OOD test set; there are also ID validation and ID test set from same repositories as the training set. Yes

Importantly, the authors note that performance drop is a necessary condition of distribution shifts. That is

  • The presence of distribution shifts do not lead to performance drop on the test set.
  • If we observe degraded test set performance, then there might be distribution shifts (either domain generalization or subpopulation shift). Here are two examples:

    • Time Shift in Amazon Review Dataset: The model trained on 2000 – 2013 datasets perform similarly well (with 1.1% difference in F1) as the model trained on 2014 – 2018 datasets on the test set sampled from 2014 – 2018.
    • Time and User Shift in Yelp Dataset: For the time shift, the setting is similar as Amazon Reviews; the authors observe a maximum of 3.1% difference. For the user shift, whether the data splits are disjoint in terms of users only influence the scores very little.

Experiments

Here is a summary of the authors’ experiments. Note that Yelp is not the part of the official dataset because it shows no evidence for distribution shift.

Index Dataset Shift Existence
1 Amazon Reviews Time No
2 Amazon Reviews Category Maybe
3 CivilComments Subpopulation Yes
4 Yelp Time No
5 Yelp User No

Amazon Reviews

The authors train a model on one category (“Single”) and four categories (“Multiple”, “Multiple” is a superset of “Single”) and measure the test accuracy on other 23 disjoint categories.

The authors find that (1) training with more categories modestly yet consistently improves the scores, (2) the OOD category (for example, “All Beauty”) could have an even higher score than the ID categories, (3) the authors do not see strong evidence of domain shift as they could not rule out other confounding factors. Note that the authors here use the very vague term “intrinsic difficulty” to gloss over something they could not explain well.

While the accuracies on some unseen categories are lower than the train-to-train in-distribution accuracy, it is unclear whether the performance gaps stem from the distribution shift or differences in intrinsic difficulty across categories; in fact, the accuracy is higher on many unseen categories (e.g., All Beauty) than on the in-distribution categories, illustrating the importance of accounting for intrinsic difficulty.

To control for intrinsic difficulty, we ran a test-to-test comparison on each target category. We controlled for the number of training reviews to the extent possible; the standard model is trained on 1 million reviews in the official split, and each test-to-test model is trained on 1 million reviews or less, as limited by the number of reviews per category. We observed performance drops on some categories, for example on Clothing, Shoes, and Jewelry (83.0% in the test-to-test setting versus 75.2% in the official setting trained on the four different categories) and on Pet Supplies (78.8% to 76.8%). However, on the remaining categories, we observed more modest performance gaps, if at all. While we thus found no evidence for significance performance drops for many categories, these results do not rule out such drops either: one confounding factor is that some of the oracle models are trained on significantly smaller training sets and therefore underestimate the in-distribution performance.

image-20231002193306896

The authors also control the size consistent for “Single” and “Multiple” settings. They show that training data with more domains (with increased diversity) is beneficial for improving OOD accuracies.

CivilComments

Each sample in the dataset has a piece of text, 1 binary toxicity labels, and 8 target labels (each text could include zero, one, or more identities). The authors use 8 TPR and TNR values to measure the performance (totaling 16 numbers).

The authors observe subpopulation shifts: despite 92.2% average accuracy, the worst number among 16 numbers is merely 57.4%. A comparison of 4 mitigation methods shows that (1) the group DRO has the best performance, (2) the reweighting baseline is quite strong, the improved versions of reweighting (i.e., CORAL and IRM) are likely less useful.

In light of the effectiveness of the group DRO algorithm, the authors extend the number of groups to 2 ^ 9= 512, the resulting performance does not improve.

image-20231002190304265

Additional Notes

  • Deciding Type of Distribution Shift

    As long as there are no clearly disjoint train, validation, and test set as in Amazon Reviews and Py150 datasets, then there is no domain generalization issue; presence of a few unseen users in the validation or test set should not be considered as the domain generalization case.

  • Challenge Sets vs. Distribution Shifts

    The CheckList-style challenge sets, such as HANS, PAWS, CheckList, and counterfactually-augmented datasets like [5], are intentionally created different from the training set.

Reference

  1. [2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al.): This paper proposes the LISA method that performs best on the Amazon dataset according to the leaderboard.
  2. [2104.09937] Gradient Matching for Domain Generalization (Shi et al.): This paper proposes the FISH method that performs best on the CivilComments dataset on the leaderboard.
  3. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification (Buolamwini and Gebru, FAccT ’18)
  4. Racial Disparities in Automated Speech Recognition (Koenecke et al., PNAS)
  5. [2010.02114] Explaining The Efficacy of Counterfactually Augmented Data (Kaushik et al., ICLR ’20)
  6. [2004.14444] The Effect of Natural Distribution Shift on Question Answering Models (Miller et al.): This paper trains 100+ QA models and tests them across different domains.
  7. Selective Question Answering under Domain Shift (Kamath et al., ACL 2020): This paper creates a test set of mixture of ID and OOD domains.

Reading Notes | Distributionally Robust Neural Networks for Group Shifts – On the Importance of Regularization for Worst-Case Generalization

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-09-06: First draft. This paper appears at ICLR ’20.
  • 2023-09-07: Add the “example” section for easy visualization.

Background

ERM and DRO

  • ERM

    ERM tries to minimize the empirical risk. Here \hat{P} denotes the empirical distribution of the true underlying distribution P of training data.

\hat{\theta} _ \mathrm{ERM} := \mathbb{E} _ {(x, y) \sim \hat{P}} \left[ \ell((x, y); \theta)\right]

  • DRO

    DRO tries to find \theta that minimizes the worst-group risk \hat{\mathcal{R}}(\theta). The practical form of DRO is called group DRO (i.e., gDRO). See the application section on how the groups are defined.

\hat{\theta} _ \mathrm{DRO} := \arg\min _ \theta \left[ \hat{\mathcal{R}}(\theta):=\max _ {g \in \mathcal{G}}\mathbb{E} _ {(x, y) \sim \hat{P} _ g} \left[ \ell(x, y); \theta) \right] \right]

Example

To better visualize the strength of gDRO over ERM, we can look at a linear regression example; this example is taken from Stanford CS 221.

The objective of linear regression is Mean Square Error (MSE) \arg\min_{\mathbf{w}} (\mathbf{w}^T\mathbf{x} -y) ^ 2; fitting the entire datasets gives a much higher group A loss (i.e. 21.26) than group B loss (i.e. 0.31) even though the total loss is 7.29.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

x = np.array([1, 2, 5, 6, 7, 8])[:, np.newaxis]
y = np.array([4, 8, 5, 6, 7, 8])

reg = LinearRegression(fit_intercept=False)
reg.fit(x, y)

print(mean_squared_error(reg.predict(x[:2]), y[:2]))
print(mean_squared_error(reg.predict(x[2:]), y[2:]))
print(mean_squared_error(reg.predict(x), y))

Note that the second plot shows how changing \mathbf{w} leads to differences in the loss over each group (yellow or blue) and the aggregated group (red). We can see that optimizing the aggregated loss leads to a solution that bias group B. However, if we optimize the pointwise maximum (purple), we could improve have a more reasonable curve.

image-20230907124813672

image-20230907123929901

Application

  • Mitigating Spurious Correlation

    In order to train a classifier that is not affected by spurious correlation, we can partition training dataset into groups with multiple attributes \mathcal{A} based on some prior knowledge and then form the group using \mathcal{A} \times \mathcal{Y}. For example, the paper [1] observes that negation spuriously correlates with the contradiction label. Therefore, one natural choice of \mathcal{A} is “texts with negation words” and “texts without negation words;” this will lead to m=2 \times 3 = 6 groups.

  • Improving Training on Data Mixture

    Training a classifier using a mixture of datasets \cup _ {k=1}^K \mathcal{D} _ k with the same label space \mathcal{Y}; this will give us K \times \vert \mathcal{Y}\vert groups. This is a more natural application of DRO as we have well-defined \mathcal{A} that does not depend on prior knowledge.

Method

For large discriminative models, neither ERM nor gDRO is able to attain a low worst-group test error due to a high worst-group generalization gap.

Model Method Training Error Worst-Group Test Error
Many Models ERM Low High
Small Convex Discriminative Model or Generative Model gDRO Low Low
Large Discriminative Model (e.g., ResNet or BERT) gDRO Low High

The authors propose to add simple regularization to gDRO to address the problem; they try \ell_2 regularization and early stopping. Even though these methods are frequently used approaches, it is a novel complement to the observations in influential work [4]: regularization may be necessary to make gDRO work for large discriminative models.

image-20230906211211382

Additional Note

  • Probability Simplex

    A probability simplex \Delta is a geometric representation of all probabilities of n events. If there are n events, then \Delta is a (n-1)-dimensional convex set that includes all possible n-dimensional probability vectors \mathbf{p}; it satisfies \mathbf{1}^T \mathbf{p}=1 with non-negative entries. The boundary of \Delta is determined by extreme one-hot probability vectors.

    The visualization of a probability simplex depicting 3 events is a triangular plane determined by three extreme points (1, 0, 0), (0, 1, 0), (0,0, 1).

  • Measures of Robustness

The paper uses the generalization on the worst-accuracy group as a proxy for robustness.

Reference

  1. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference (McCoy et al., ACL 2019): This paper identifies three shortcuts (called “heuristics” in the paper) that could be exploited by an NLI classifier: (1) lexical overlap, (2) subsequence, and (3) constituent. The authors also propose the famous HANS (Heuristic Analysis for NLI Systems) test set to diagnose the shortcut learning.

    Instead of using these cases to overrule the lexical overlap heuristic, a model might account for them by learning to assume that the label is contradiction whenever there is negation in the premise but not the hypothesis.

  2. Annotation Artifacts in Natural Language Inference Data (Gururangan et al., NAACL 2018): This paper shows that a significant portion of SNLI and MNLI test sets could be classified correctly without premises.
  3. [1806.08010] Fairness Without Demographics in Repeated Loss Minimization (Hashimoto et al.): The application of DRO in fair classification.
  4. [1611.03530] Understanding deep learning requires rethinking generalization (Zhang et al.; more than 5K citations): This paper makes two important observations and rules out the VC dimension, Rademacher complexity as possible explanations.

    • The neural network is able to attain zero training error for (1) a dataset with real images but random label, and (2) a dataset of random noise and random labels through memorization. The testing error is still near chance.
    • Because of the last bullet point, the regularization may not help with generalization at all; it is neither a necessary nor a sufficient condition to generalization.

Reading Notes | Directions in Abusive Language Training Data – Garbage In, Garbage Out

[Semantic Scholar]- [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-06: First draft. This paper provides the influential hate speech dataset hub hatespeechdasta.com even though it appears on PLoS One.

This paper provides a survey of existing (as of 2020) hate speech datasets and some suggestions for creating future hate speech datasets.