Reading Notes | Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

Posted on September 12, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-11: First draft. This paper appears at WOAH ’22.

The paper studies the generalization to new hate target groups on the single HateXplain dataset; they authors do so by comparing three existing methods, including (1) Unsupervised Domain Adaptation (UDA, this method is also used in paper [1]), (2) MixUp regularization, (3) curriculum labeling, and (4) DANN.

The paper also considers the back translation approach (specifically (en, fr), (en, de), and (en, es)) for data augmentation.

Experiments

Zero: Directly apply a model trained on $\mathcal{D}_A$ to a new domain $\mathcal{D}_B$ .
Zero+: Augmenting $\mathcal{D}_A$ using back-translation.
ZeroB+: Applying back-translation-based data augmentation while making sure that the each batch is class-balanced.

Reference

Unsupervised Domain Adaptation in Cross-corpora Abusive Language Detection (Bose et al., SocialNLP 2021): This paper considers the setting of training on dataset $\mathcal{D}_A$ and testing on another dataset $\mathcal{D}_B$ , where $A, B$ are HateEval, Waseem, and Davidson, resulting in 6 pairs. They use several existing methods to improve the test scores on $\mathcal{D}_B$ .
[2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al., AAAI 2021): This used to be the only dataset that provides the target groups of both hateful and non-hateful contents.
d: Data augmentation could happen in symbol space via rules, word replacement through BERT, text-generation models or feature space. However, the main paper chooses to use the back translation for data augmentation.

Here are two libraries on data augmentation in NLP:
- GitHub – makcedward/nlpaug: Data augmentation for NLP (4.1K stars)
- GitHub – GEM-benchmark/NL-Augmenter: NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations (700 stars; Continuous Updated).

Reading Notes | Directions in Abusive Language Training Data – Garbage In, Garbage Out

Posted on September 12, 2023December 11, 2023 by David Yang

[Semantic Scholar]- [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-06: First draft. This paper provides the influential hate speech dataset hub hatespeechdasta.com even though it appears on PLoS One.

This paper provides a survey of existing (as of 2020) hate speech datasets and some suggestions for creating future hate speech datasets.

Research Notes | A Benchmark for Hate Speech Detection

Posted on September 12, 2023December 11, 2023 by David Yang

Overview

There does not exist a unified benchmark such as GLUE in hate speech detection domain that conducts a leaderboard style performance comparison of different open-source hate speech classifiers. This prevents the practitioners from making informed decisions when choosing which model to use for their own hate speech detection applications.

The benchmark will provide the following:

The entire training and validation set for future study. However, the labels from public test sets will not be released for benchmarking purposes; there will be additional private test sets.
The ranking of the models based on the average aggregated metrics (for example, F1 score) on the public and private test sets.

Protocol

Step 1: Randomly select a test set and a validation set.

The two datasets must be randomly selected for the following reasons:
1. The distribution of the validation set will be similar to the test set. Using the randomly sampled validation set will help select the models that more are likely to perform well on the test set.
2. This makes the two datasets independent from each other in terms of label distribution and source distribution. Throughout the experiments, the test and validation sets are the same; this is helpful as we could see the (dis)advantages of one method in the wandb dashboard.
Step 2: Sampling train set using different (a) data selection methods.
Step 3: Training or fine-tuning (b) different models with (c) different techniques for local improvements, for example, objective function, and regularization.
Step 4: Comparing different combinations of (a), (b), and (c). If we have $m$ combinations and $n$ test sets, then we will end up with a table of $(m, n+1)$ , where the first column lists all the combinations.

Candidate Datasets

Collected Datasets from Diverse Topics

The current data aggregation includes [1] through [5], where the [5] only includes hate speech.

Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
[2005.12423] Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis (He et al.)
[2108.12521] TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter (Kumar et al.)
Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech (ElSherief et al., EMNLP 2021)

cardiffnlp/twitter-roberta-base-hate-latest Collection

The follow are the datasets used for the model cardiffnlp/twitter-roberta-base-hate-latest or the paper below:

Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation (Antypas & Camacho-Collados, WOAH 2023)

Index	Dataset Name	Source	Notes
1	HatE	Link that requires filling in a Google form.
2	MHS	`ucberkeley-dlab/measuring-hate-speech`
3	DEAP	Zenodo
4	CMS	Link that requires registration and email verification.
5	Offense	Link; this dataset is also called OLID.
6	HateX	`hatexplain` and GitHub
7	LSC	GitHub	Dehydrated
8	MMHS	`nedjmaou/MLMA_hate_speech` and GitHub
9	HASOC	Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve.	Not Available
10	AYR	GitHub	Dehydrated
11	AHSD	GitHub
12	HTPO	Link
13	HSHP	GitHub	Dehydrated

The following are the papers that correspond to the list of datasets:

SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
[2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
[2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
[1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
[2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
[1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

It is possible to approximate a subset of the original training mixture (8 of 12 datasets excluding the MMHS dataset, which only includes hate speech) following the Table 2 of the original paper. Something to note is that:

AYR, HASOC, HSHP, and LSC are not usable.
Offense does not exactly match the sizes in Table 2.
We disregard any splits and try to match the number in Table 2. When matching number is not possible, we try to make sure the ratio of on-hate versus hate is same.

Additional Datasets from hatespeechdata.com

The following the the additional datasets from hatespeechdata.com that are not included in the above mentioned sources. The dataset names are either available from the original paper or created here for easy reference.

Index	Dataset Name	Source	Notes
1	AbuseEval	GitHub	The Offense dataset above reannotated for non-hate, implicit, and explicit hate; only IDs are available. Around 87% of the hate/non-hate labels are same as the previous Offense dataset.
2	SWAD	GitHub
3	ALONE		Not usable. Requires contacting authors.
4	HatefulUsersTwitter	GitHub and Kaggle	Available but not relevant. This dataset is about detecting whether a user is hateful or neutral on the Tweet network; it does not come with annotated hateful/benign texts.
5	MMHS150K	Website	Not usable. Multimodal datasets.
6	HarassmentLexicon	GitHub	Not usable. Lexicons only.
7	P2PHate	GitHub	Not usable. Dehydrated.
8	Golbeck		Not usable. Requires contacting `jgolbeck@umd.edu`
9	SurgeAI	Website	Hateful content only.
10	TSA	Kaggle	Dataset is provided by Analytics Vidhya. The `test.csv` does not come with labels.

I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language (Caselli et al., LREC 2020): The dataset from this paper is also called AbuseEval v1.0.
Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media (Pamungkas et al., LREC 2020)
[2008.06465] ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter (Wijesiriwardene et. al.)
[1803.08977] Characterizing and Detecting Hateful Users on Twitter (Ribeiro et al., ICWSM 2018)
[1910.03814] Exploring Hate Speech Detection in Multimodal Publications (Gomez et al., WACV 2020)
[1802.09416] A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research (Rezvan et al.)
[1804.04649] Peer to Peer Hate: Hate Speech Instigators and Their Targets (ElSherief et al.)
A Large Labeled Corpus for Online Harassment Research (Golbeck et al., WebSci 2017)
Twitter Hate Speech Dataset (Surge AI)
Twitter Sentiment Analysis (Kaggle)

Talk Notes | Paraphrasing Evades Detectors of AI-generated Text, But Retrieval is an Effective Defense by Kaplesh Krishna @ Google

Posted on September 12, 2023December 11, 2023 by David Yang

[YouTube] – [Personal Website]

The presenter is the author of multiple influential papers on the topics such as paraphrasing and attacks.

Reference

Reformulating Unsupervised Style Transfer as Paraphrase Generation (Krishna et al., EMNLP 2020)
[1910.12366] Thieves on Sesame Street! Model Extraction of BERT-based APIs (Krishna et al., ICLR ’20’)

Reading Notes | WILDS – A Benchmark of in-the-Wild Distribution Shifts

Posted on September 7, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Leaderboard]

Change Logs:

2023-09-06: First draft. The paper provides a standardized package for many domain generalization algorithms, including group DRO, DANN, and Coral.

Background

Distribution shifts happen when test conditions are “newer” or “smaller” compared to training conditions. The paper defines them as

Newer: Domain Generalization

The test distribution is related to but distinct (aka. new or unseen during training) to the training distributions. Note that the test conditions are not necessarily a superset of the training conditions; they are not “larger” compared to the “smaller” case described below.

Here are two typical examples of domain generalization described in the paper:
- Training a model based on patient information from some hospitals and expect the model to generalize to many more hospitals; these hospitals may or may not be a superset of the hospitals we collect training data from.
- Training an animal recognition model on images taken from some existing cameras and expect the model to work on images taken on the newer cameras.
Smaller: Subpopulation Shift

The test distribution is a subpopulation of the training distributions. For example, degraded facial recognition accuracy on the underrepresented demographic groups ([3] and [4]).

Evaluation

The goal of OOD generalization is training a model on data sampled from training distribution $P^\text{train}$ that performs well on the test distribution $P^\text{test}$ . Note that as we could not assume the data from two distributions are equally difficult to learn, the most ideal case is to at least train two models (or even more ideally three models) and take 2 (or 3) measurements:

Index	Goal	Training Data	Testing Data
1	Measuring OOD Generalization	$D ^ \text{train} \sim P^ \text{train}$	$D ^ \text{test} \sim P^ \text{test}$
2	Ruling Out Confounding Factor of Distribution Difficulty	$D ^ \text{test} _ \text{heldout} \sim P^ \text{test}$	$D ^ \text{test} \sim P^ \text{test}$
3	(Optional) Sanity Check	$D ^ \text{train} \sim P^ \text{train}$	$D ^ \text{train} _ \text{heldout} \sim P^ \text{train}$

However, the generally small test sets make the measurement 2 hard or even impossible: we could not find an additional held-out set $D ^ \text{test} _ \text{heldout}$ that matches the size of $D ^ \text{train}$ to train a model.

The authors therefore define 4 relaxed settings:

Index	Setting	Training Data	Testing Data
1	Mixed-to-Test	Mixture of $P^ \text{train}$ and $P^ \text{test}$	$P^ \text{train}$
2	Train-to-Train (aka. setting 3 above)	$P^ \text{train}$	$P^ \text{train}$
3	Average. This is a special case of 2; it is only suitable for subpopulation shift, such as Amazon Reviews and CivilComments.	Average Performance	Worst-Group Performance
4	Random Split. This setting destroys the $P ^ \text{test}$ .	$\tilde{D} ^ \text{train} := \mathrm{Sample}(D ^ \text{train} \cup D ^ \text{test})$	$(D ^ \text{train} \cup D ^ \text{test}) \backslash \tilde{D} ^ \text{train}$

Dataset

The dataset includes regular and medical image, graph, and text datasets; 3 out of 10 are text datasets, where the less familiar Py150 is a code completion dataset. Note that the authors fail to cleanly define why there are subpopulation shifts for Amazon Reviews and Py150 datasets as the authors acknowledge below:

However, it is not always possible to cleanly define a problem as one or the other; for example, a test domain might be present in the training set but at a very low frequency.

For Amazon Reviews dataset, one viable explanation on why there is subpopulation shift is uneven distribution of reviews on the same product in the train, validation, and test set.

Name	Domain Generalization	Subpopulation Shift	Notes
CivilComments	No; this is because the demographic information of the writers are unknown; if such information is known, we could also create a version with domain generation concern.	Yes; this is because the mentions of 8 target demographic groups are available.	The only dataset with only subpopulation shift.
Amazon Reviews	Yes; due to disjoint users in the train, OOD validation, and OOD test set; there are also ID validation and ID test set from same users as the training set.	Yes
Py150	Yes; due to disjoint repositories in train, OOD validation, and OOD test set; there are also ID validation and ID test set from same repositories as the training set.	Yes

Importantly, the authors note that performance drop is a necessary condition of distribution shifts. That is

The presence of distribution shifts do not lead to performance drop on the test set.
If we observe degraded test set performance, then there might be distribution shifts (either domain generalization or subpopulation shift). Here are two examples:
- Time Shift in Amazon Review Dataset: The model trained on 2000 – 2013 datasets perform similarly well (with 1.1% difference in F1) as the model trained on 2014 – 2018 datasets on the test set sampled from 2014 – 2018.
- Time and User Shift in Yelp Dataset: For the time shift, the setting is similar as Amazon Reviews; the authors observe a maximum of 3.1% difference. For the user shift, whether the data splits are disjoint in terms of users only influence the scores very little.

Experiments

Here is a summary of the authors’ experiments. Note that Yelp is not the part of the official dataset because it shows no evidence for distribution shift.

Index	Dataset	Shift	Existence
1	Amazon Reviews	Time	No
2	Amazon Reviews	Category	Maybe
3	CivilComments	Subpopulation	Yes
4	Yelp	Time	No
5	Yelp	User	No

Amazon Reviews

The authors train a model on one category (“Single”) and four categories (“Multiple”, “Multiple” is a superset of “Single”) and measure the test accuracy on other 23 disjoint categories.

The authors find that (1) training with more categories modestly yet consistently improves the scores, (2) the OOD category (for example, “All Beauty”) could have an even higher score than the ID categories, (3) the authors do not see strong evidence of domain shift as they could not rule out other confounding factors. Note that the authors here use the very vague term “intrinsic difficulty” to gloss over something they could not explain well.

While the accuracies on some unseen categories are lower than the train-to-train in-distribution accuracy, it is unclear whether the performance gaps stem from the distribution shift or differences in intrinsic difficulty across categories; in fact, the accuracy is higher on many unseen categories (e.g., All Beauty) than on the in-distribution categories, illustrating the importance of accounting for intrinsic difficulty.

To control for intrinsic difficulty, we ran a test-to-test comparison on each target category. We controlled for the number of training reviews to the extent possible; the standard model is trained on 1 million reviews in the official split, and each test-to-test model is trained on 1 million reviews or less, as limited by the number of reviews per category. We observed performance drops on some categories, for example on Clothing, Shoes, and Jewelry (83.0% in the test-to-test setting versus 75.2% in the official setting trained on the four different categories) and on Pet Supplies (78.8% to 76.8%). However, on the remaining categories, we observed more modest performance gaps, if at all. While we thus found no evidence for significance performance drops for many categories, these results do not rule out such drops either: one confounding factor is that some of the oracle models are trained on significantly smaller training sets and therefore underestimate the in-distribution performance.

The authors also control the size consistent for “Single” and “Multiple” settings. They show that training data with more domains (with increased diversity) is beneficial for improving OOD accuracies.

CivilComments

Each sample in the dataset has a piece of text, 1 binary toxicity labels, and 8 target labels (each text could include zero, one, or more identities). The authors use 8 TPR and TNR values to measure the performance (totaling 16 numbers).

The authors observe subpopulation shifts: despite 92.2% average accuracy, the worst number among 16 numbers is merely 57.4%. A comparison of 4 mitigation methods shows that (1) the group DRO has the best performance, (2) the reweighting baseline is quite strong, the improved versions of reweighting (i.e., CORAL and IRM) are likely less useful.

In light of the effectiveness of the group DRO algorithm, the authors extend the number of groups to $2 ^ 9= 512$ , the resulting performance does not improve.

Additional Notes

Deciding Type of Distribution Shift

As long as there are no clearly disjoint train, validation, and test set as in Amazon Reviews and Py150 datasets, then there is no domain generalization issue; presence of a few unseen users in the validation or test set should not be considered as the domain generalization case.
Challenge Sets vs. Distribution Shifts

The CheckList-style challenge sets, such as HANS, PAWS, CheckList, and counterfactually-augmented datasets like [5], are intentionally created different from the training set.

Reference

[2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al.): This paper proposes the LISA method that performs best on the Amazon dataset according to the leaderboard.
[2104.09937] Gradient Matching for Domain Generalization (Shi et al.): This paper proposes the FISH method that performs best on the CivilComments dataset on the leaderboard.
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification (Buolamwini and Gebru, FAccT ’18)
Racial Disparities in Automated Speech Recognition (Koenecke et al., PNAS)
[2010.02114] Explaining The Efficacy of Counterfactually Augmented Data (Kaushik et al., ICLR ’20)
[2004.14444] The Effect of Natural Distribution Shift on Question Answering Models (Miller et al.): This paper trains 100+ QA models and tests them across different domains.
Selective Question Answering under Domain Shift (Kamath et al., ACL 2020): This paper creates a test set of mixture of ID and OOD domains.

Reading Notes | Distributionally Robust Neural Networks for Group Shifts – On the Importance of Regularization for Worst-Case Generalization

Posted on September 7, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-09-06: First draft. This paper appears at ICLR ’20.

2023-09-07: Add the “example” section for easy visualization.

Background

ERM and DRO

ERM

ERM tries to minimize the empirical risk. Here $\hat{P}$ denotes the empirical distribution of the true underlying distribution $P$ of training data.

$\hat{\theta} _ \mathrm{ERM} := \mathbb{E} _ {(x, y) \sim \hat{P}} \left[ \ell((x, y); \theta)\right]$

DRO

DRO tries to find $\theta$ that minimizes the worst-group risk $\hat{\mathcal{R}}(\theta)$ . The practical form of DRO is called group DRO (i.e., gDRO). See the application section on how the groups are defined.

$\hat{\theta} _ \mathrm{DRO} := \arg\min _ \theta \left[ \hat{\mathcal{R}}(\theta):=\max _ {g \in \mathcal{G}}\mathbb{E} _ {(x, y) \sim \hat{P} _ g} \left[ \ell(x, y); \theta) \right] \right]$

Example

To better visualize the strength of gDRO over ERM, we can look at a linear regression example; this example is taken from Stanford CS 221.

The objective of linear regression is Mean Square Error (MSE) $\arg\min_{\mathbf{w}} (\mathbf{w}^T\mathbf{x} -y) ^ 2$ ; fitting the entire datasets gives a much higher group A loss (i.e. 21.26) than group B loss (i.e. 0.31) even though the total loss is 7.29.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

x = np.array([1, 2, 5, 6, 7, 8])[:, np.newaxis]
y = np.array([4, 8, 5, 6, 7, 8])

reg = LinearRegression(fit_intercept=False)
reg.fit(x, y)

print(mean_squared_error(reg.predict(x[:2]), y[:2]))
print(mean_squared_error(reg.predict(x[2:]), y[2:]))
print(mean_squared_error(reg.predict(x), y))

Note that the second plot shows how changing $\mathbf{w}$ leads to differences in the loss over each group (yellow or blue) and the aggregated group (red). We can see that optimizing the aggregated loss leads to a solution that bias group B. However, if we optimize the pointwise maximum (purple), we could improve have a more reasonable curve.

Application

Mitigating Spurious Correlation

In order to train a classifier that is not affected by spurious correlation, we can partition training dataset into groups with multiple attributes $\mathcal{A}$ based on some prior knowledge and then form the group using $\mathcal{A} \times \mathcal{Y}$ . For example, the paper [1] observes that negation spuriously correlates with the contradiction label. Therefore, one natural choice of $\mathcal{A}$ is “texts with negation words” and “texts without negation words;” this will lead to $m=2 \times 3 = 6$ groups.
Improving Training on Data Mixture

Training a classifier using a mixture of datasets $\cup _ {k=1}^K \mathcal{D} _ k$ with the same label space $\mathcal{Y}$ ; this will give us $K \times \vert \mathcal{Y}\vert$ groups. This is a more natural application of DRO as we have well-defined $\mathcal{A}$ that does not depend on prior knowledge.

Method

For large discriminative models, neither ERM nor gDRO is able to attain a low worst-group test error due to a high worst-group generalization gap.

Model	Method	Training Error	Worst-Group Test Error
Many Models	ERM	Low	High
Small Convex Discriminative Model or Generative Model	gDRO	Low	Low
Large Discriminative Model (e.g., ResNet or BERT)	gDRO	Low	High

The authors propose to add simple regularization to gDRO to address the problem; they try $\ell_2$ regularization and early stopping. Even though these methods are frequently used approaches, it is a novel complement to the observations in influential work [4]: regularization may be necessary to make gDRO work for large discriminative models.

Additional Note

Probability Simplex

A probability simplex $\Delta$ is a geometric representation of all probabilities of $n$ events. If there are $n$ events, then $\Delta$ is a $(n-1)$ -dimensional convex set that includes all possible $n$ -dimensional probability vectors $\mathbf{p}$ ; it satisfies $\mathbf{1}^T \mathbf{p}=1$ with non-negative entries. The boundary of $\Delta$ is determined by extreme one-hot probability vectors.

The visualization of a probability simplex depicting 3 events is a triangular plane determined by three extreme points $(1, 0, 0), (0, 1, 0), (0,0, 1)$ .
Measures of Robustness

The paper uses the generalization on the worst-accuracy group as a proxy for robustness.

Reference

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference (McCoy et al., ACL 2019): This paper identifies three shortcuts (called “heuristics” in the paper) that could be exploited by an NLI classifier: (1) lexical overlap, (2) subsequence, and (3) constituent. The authors also propose the famous HANS (Heuristic Analysis for NLI Systems) test set to diagnose the shortcut learning.

Instead of using these cases to overrule the lexical overlap heuristic, a model might account for them by learning to assume that the label is contradiction whenever there is negation in the premise but not the hypothesis.
Annotation Artifacts in Natural Language Inference Data (Gururangan et al., NAACL 2018): This paper shows that a significant portion of SNLI and MNLI test sets could be classified correctly without premises.
[1806.08010] Fairness Without Demographics in Repeated Loss Minimization (Hashimoto et al.): The application of DRO in fair classification.
[1611.03530] Understanding deep learning requires rethinking generalization (Zhang et al.; more than 5K citations): This paper makes two important observations and rules out the VC dimension, Rademacher complexity as possible explanations.
- The neural network is able to attain zero training error for (1) a dataset with real images but random label, and (2) a dataset of random noise and random labels through memorization. The testing error is still near chance.
- Because of the last bullet point, the regularization may not help with generalization at all; it is neither a necessary nor a sufficient condition to generalization.

Reading Notes | Directions in Abusive Language Training Data – Garbage In, Garbage Out

Posted on September 7, 2023September 7, 2023 by David Yang

[Semantic Scholar]- [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-06: First draft. This paper provides the influential hate speech dataset hub hatespeechdasta.com even though it appears on PLoS One.

This paper provides a survey of existing (as of 2020) hate speech datasets and some suggestions for creating future hate speech datasets.

Reading Notes | Using GPT4 for Content Moderation

Posted on September 6, 2023December 11, 2023 by David Yang

Method

This blog post illustrates an idea of human-AI collaboration in revising an existing content policy. Specifically,

Based on an initial policy $P_0$ , a human expert may disagree with a moderation decision of GPT-4.
The human expert elicit suggestions from GPT-4 to revise the policy $P_0$ into $P_1$ until the human expert agrees with the decision from GPT-4.

The blog post does not clearly explain how either step is done. For example, (1) what prompt is used to turn the general purpose GPT-4 into a content moderator, (2) what prompt is used to ask the feedback from GPT-4, and (3) how human experts ingest GPT-4 feedback into concrete policy revisions.

Reading Notes | Robust Hate Speech Detection in Social Media – A Cross-Dataset Empirical Evaluation

Posted on September 4, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-09-04: First draft. This paper appears at WOAH ’23. The provided models on HuggingFace have more than 40K downloads thanks to their easy-to-use tweetnlp package; the best-performing binary and multi-class classification models are cardiffnlp/twitter-roberta-base-hate-latest and cardiffnlp/twitter-roberta-base-hate-multiclass-latest respectively.

Method

Datasets

The authors manually select and unify 13 hate speech datasets for binary and multi-class classification settings. The authors do not provide the rationale on why they choose these 13 datasets.

For the multi-class classification setting, the authors devise 7 classes: racism, sexism, disability, sexual orientation, religion, other, and non-hate. This category is similar to yet smaller than the MHS dataset, including gender, race, sexuality, religion, origin, politics, age, and disability (see [1]).

For all 13 datasets, the authors apply a 7:1:2 ratio of data splitting; they also create a small external test set (i.e., Indep). With test sets kept untouched, the authors consider 3 ways of preparing data:

Training on the single dataset.
Training on an aggregation of 13 datasets.
Training on a sampled dataset from the aggregation in 2. Specifically, the authors (1) find the dataset size that leads to the highest score in 1, (2) sample the dataset proportionally by the each of 13 datasets’ sizes and the the ratio of hate versus non-hate to exactly 1:1.

The processed datasets are not provided by the authors. We need to follow the guides below to obtain them; the index of the datasets is kept consistent with the HuggingFace model hub and and main paper’s Table 1.

Index	Dataset Name	Source	Notes
1	HatE	Link that requires filling in a Google form.
2	MHS	`ucberkeley-dlab/measuring-hate-speech`
3	DEAP	Zenodo
4	CMS	Link that requires registration and email verification.
5	Offense	Link; this dataset is also called OLID.
6	HateX	`hatexplain` and GitHub
7	LSC	GitHub	Dehydrated
8	MMHS	`nedjmaou/MLMA_hate_speech` and GitHub
9	HASOC	Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve.	Not Available
10	AYR	GitHub	Dehydrated
11	AHSD	GitHub
12	HTPO	Link
13	HSHP	GitHub	Dehydrated

The following are the papers that correspond to the list of datasets:

SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
[2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
[2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
[1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
[2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
[1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

Despite the availability of the sources, it is quite hard to reproduce the original dataset as (1) many of the datasets in the table do not come with the predefiend splits; the only datasets that are available and have such splits are HatE, HatX, and HTPO, (2) how the authors unify (for example, deriving binary label from potentially complicated provided labels) the datasets is unknown, and (3) how the authors process the texts is also unknown.

It is better to find a model whose checkpoints and exact training datasets are both available; one such example is the Alpaca language model.

Models and Fine-Tuning

The authors start from the bert-base-uncased, roberta-base, and two models specifically customized to Twitter (see [2], [3]). The authors carry out the HPO on learning rates, warmup rates, number of epochs, and batch size using hyperopt.

Experiments

The data preparation method 3 (All*) performs better than the method 1 (MHS, AYR, etc). It also achieves the highest scores on the Indep test set (Table 3).

Other Information

Language classification tasks could be done with fasttext models (doc).

Comments

Ill-Defined Data Collection Goal

We can read the sentence like following from the paper:
- For example both CMS and AYR datasets deal with sexism but the models trained only on CMS perform poorly when evaluated on AYR (e.g. BERTweetCSM achieves 87% F1 on CSM, but only 52% on AYR).
- This may be due to the scope of the dataset, dealing with East Asian Prejudice during the COVID-19 pandemic, which is probably not well captured in the rest of the datasets.
The issue is that there is not quantitative measure of the underlying theme of a dataset (for example, CMS and AYR). The dataset curators may have some general ideas on what the dataset should be about; they often do not have a clearly defined measure to quantify how much one sample aligns with their data collection goals.

I wish to see some quantitative measures on topics and distributions of an NLP dataset.

Reference

Targeted Identity Group Prediction in Hate Speech Corpora (Sachdeva et al., WOAH 2022)
BERTweet: A pre-trained language model for English Tweets (Nguyen et al., EMNLP 2020)
TimeLMs: Diachronic Language Models from Twitter (Loureiro et al., ACL 2022): This paper also comes from Cardiff NLP. It considers the time axis of the language modeling through continual learning. It tries to achieve OOD generalization (in terms of time) without degrading the performance on the static benchmark.

Reading Notes | DoReMi – Optimizing Data Mixtures Speeds Up Language Model Pretraining

Posted on September 2, 2023December 11, 2023 by David Yang

Overview

Other Information

The ratios of domains should be counted using number of tokens rather than number of documents, even though different tokenizers may return slightly different ratios.

Reference

[2110.10372] Distributionally Robust Classifiers in Sentiment Analysis (Stanford Course Project Report).
Distributionally Robust Finetuning BERT for Covariate Drift in Spoken Language Understanding (Broscheit et al., ACL 2022): This paper is one of few papers I could find that applies DRO to an NLP model; the problem the authors addressing here is mitigating the spurious correlation (or improving robustness) of a cascade of text and token classification models.

The standard ERM (aka. MLE) assumes a single distribution and therefore all losses are equally important. However, the DRO tries to minimize the maximum (i.e., the worse case) of a set of distributions; this set of distributions is modeled by prior knowledge.
[1810.08750] Learning Models with Uniform Performance via Distributionally Robust Optimization
Distributionally Robust Language Modeling (Oren et al., EMNLP-IJCNLP 2019): The main paper extensively cites this paper. The goal of this paper is to train a language model on a dataset mixture of $K$ sources $\cup _ {i=1}^K\mathcal{D} _ i$ without degrading the perform on each domain’s test set; it is a practical application of [3] in language modeling.

This setting may be useful because (1) each $\mathcal{D} _ i$ may not be large enough to train the model, and (2) the authors observe that training on data mixture degrades the performance on the each domain’s test set than using a smaller dataset.
[1911.08731] Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization (ICLR ’20; 1K citations). This paper fine-tunes BERT using DRO on the MNLI dataset; the paper also experiments on the image datasets.