Research Notes | Generalizable Hate Speech Detection

Overview

This post is the summary of the following methods; they rank top on the CivilComments-WILDS benchmark:

Rank Method Paper
1 FISH [2104.09937] Gradient Matching for Domain Generalization (Shi et al., ICLR 2022
2, 3 IRMX [2206.07766] Pareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalization (Chen et al., ICLR 2023)
4 LISA [2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al., ICML 2022)
5 DFR [2204.02937] Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations (Kirichenko et al., ICLR 2023)
6, 8 Group DRO
7, 12 Reweighting [1901.05555] Class-Balanced Loss Based on Effective Number of Samples (Cui et al., CVPR 2019) is one example that uses this method; the reweighting method could date back to much earlier works.

Reweighting, IRM, and CORAL

IRM [2] and CORAL [3] are two extensions of the basic reweighting method by adding an additional penalty term on top of the reweighting loss; this term is based on some measures of the data representations from different domains to encourage the data distribution of different domains to be similar.

Reference

  1. [2012.07421] WILDS: A Benchmark of in-the-Wild Distribution Shifts
  2. [1907.02893] Invariant Risk Minimization (Arjovsky et al.)
  3. [2007.01434] In Search of Lost Domain Generalization (Gulrajani and Lopez-Paz)

Reading Notes | Wild-Time – A Benchmark of in-the-Wild Distribution Shift over Time

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website and Leaderboard] – [Slide] – [Lead Author]

Change Logs:

  • 2023-10-03: First draft. The authors provide 5 datasets (2 of them are text classification datasets, the others include 2 image classification datasets and 1 EHR dataset) and more than 10 mitigation methods for distribution shift.

Experiments

  • The authors find that most of the mitigation methods are not effective compared to the standard ERM on the proposed benchmark. Note that SimCLR and SwaV methods are only applicable to image classification tasks.

    image-20231003120134331

image-20231003120318062

Additional Notes

From the content below, we could see that:

To address this challenge, we adapt the above invariant learning approaches to the temporal distribution shift setting. We leverage timestamp metadata to create a temporal robustness set consisting of substreams of data, where each substream is treated as one domain. Specifically, as shown in Figure 3, we define a sliding window G with length L. For a data stream with T timestamps, we apply the sliding window G to obtain T − L + 1 substreams. We treat each substream as a “domain” and apply the above invariant algorithms on the robustness set. We name the adapted CORAL, GroupDRO and IRM as CORAL-T, GroupDRO-T, IRM-T, respectively. Note that we do not adapt LISA since the intra-label LISA performs well without domain information, which is also mentioned in the original paper.

  • The way the authors apply the group algorithms look questionable: it does not make sense to create artificial domains by grouping data from some consecutive timestamps. This may be the reason why the authors do not observe the performance gains.
  • The LISA, which is the same author’s work, seems to be a good approach as it does not require the domain labels while performing competitively.

Talk Notes | Lessons Learned from Analyzing Systems for Hate Speech Detection and Bias Mitigation by Sarah Masud

[YouTube] – [Personal Website]

  • The presenter has authored several interesting papers ([1] through [5]) on hate speech detection.

Notes

Status Quo of Hate Speech Detection

  • There are varying definitions of hate speech.
  • Labels related to hate speech include hate, offensive, toxic, profane, and toxic. There could be also more fine-grained categories, such as sexist, racist, and islamophobic.
  • Because of the reasons mentioned above, there is no leaderboard in hate speech detection.

Data Sources

We should pay attention to data bias; it is doubtful to collect hate speeches from people and sites that are more likely to generate hate speech. The authors propose to collect datasets from neutral sources; this design choice makes the data annotation difficult.

Annotations

Current approaches of hate speech annotation rely on people (crowdworkers or experts). The authors use the two-phase approach to ensure the label quality.

Building Better Hate Speech Detection Models

  • The complexity of models does not necessarily help. It is more important to capture the signals that predict the final labels, for example, the history and the social network information. This observation also applies to other tasks that involve modeling social behaviors.
  • However, we should carefully monitor the overfitting: spurious correlation between overfitted phrases and labels should not be the signals we allow the models to pick up. That is, the models should generalize without the presence of these words.
  • In the work [2], the authors propose a system that considers not just the text information, but also the timeline and social network information. They merge the three sources of signal using an attention mechanism. However, we could see two limitations:
    • This design is specific to Twitter. Other platforms, such as Reddit, do not have this information with respect to users.
    • The best performing system (M14) does not significantly outperform the baseline system, which is simply fine-tuning a mBERT (M8).

image-20230925174150642

Lexical Bias

  • Replacing the bias sensitive words with more general words is likely shift the bias towards the WordNet ancestors. This hypothesis could be supported by a measurement called pinned bias, where t is the single word in the sensitive word list.

pB _ T = \sum _ {t \in T} \frac{\vert p(\text{“toxic”}\vert t) – \phi\vert}{ \vert T \vert},\quad \phi=\min(p(\text{“toxic”}\vert t), 0.5)

Horizons

The presenter has three high-level observations:

  • Like energy: Bias seems to be transferring from one source to the other.
  • Like a system at rest: A model or dataset will remain biased unless external force (for example, mitigation and regularization) is enabled.
  • Like interactive systems: A system is evolving more chaotic over time. The toxicity needs to be monitored and mitigated in a continuous fashion.

Reference

  1. [2010.04377] Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter (Masud et al., ICDE 2021): This paper presents a dataset called RETINA that focus on hate speech in the Indian context.
  2. [2206.04007] Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization (Masud et al., KDD 2022)
  3. [2201.00961] Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media (Chakraborty and Masud)
  4. [2306.01105] Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment (Masud et al., KDD 2023)
  5. [2202.00126] Handling Bias in Toxic Speech Detection: A Survey (Garg et al., CSUR).
  6. Language (Technology) is Power: A Critical Survey of “Bias” in NLP (Blodgett et al., ACL 2020)
  7. [2305.06626] When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks (Fleisig et al.)
  8. Handling Disagreement in Hate Speech Modelling | SpringerLink (Novak et al., IPMU 2022)
  9. [2001.05495] Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations (Badjatiya et al., WWW 2019).

Research Notes | A Benchmark for Hate Speech Detection

Overview

There does not exist a unified benchmark such as GLUE in hate speech detection domain that conducts a leaderboard style performance comparison of different open-source hate speech classifiers. This prevents the practitioners from making informed decisions when choosing which model to use for their own hate speech detection applications.

The benchmark will provide the following:

  • The entire training and validation set for future study. However, the labels from public test sets will not be released for benchmarking purposes; there will be additional private test sets.
  • The ranking of the models based on the average aggregated metrics (for example, F1 score) on the public and private test sets.

Protocol

  • Step 1: Randomly select a test set and a validation set.

    The two datasets must be randomly selected for the following reasons:

    1. The distribution of the validation set will be similar to the test set. Using the randomly sampled validation set will help select the models that more are likely to perform well on the test set.
    2. This makes the two datasets independent from each other in terms of label distribution and source distribution. Throughout the experiments, the test and validation sets are the same; this is helpful as we could see the (dis)advantages of one method in the wandb dashboard.
  • Step 2: Sampling train set using different (a) data selection methods.
  • Step 3: Training or fine-tuning (b) different models with (c) different techniques for local improvements, for example, objective function, and regularization.
  • Step 4: Comparing different combinations of (a), (b), and (c). If we have m combinations and n test sets, then we will end up with a table of (m, n+1), where the first column lists all the combinations.

Candidate Datasets

Collected Datasets from Diverse Topics

The current data aggregation includes [1] through [5], where the [5] only includes hate speech.

  1. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  2. [2005.12423] Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis (He et al.)
  3. [2108.12521] TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter (Kumar et al.)
  4. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  5. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech (ElSherief et al., EMNLP 2021)

cardiffnlp/twitter-roberta-base-hate-latest Collection

The follow are the datasets used for the model cardiffnlp/twitter-roberta-base-hate-latest or the paper below:

Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation (Antypas & Camacho-Collados, WOAH 2023)

Index Dataset Name Source Notes
1 HatE Link that requires filling in a Google form.
2 MHS ucberkeley-dlab/measuring-hate-speech
3 DEAP Zenodo
4 CMS Link that requires registration and email verification.
5 Offense Link; this dataset is also called OLID.
6 HateX hatexplain and GitHub
7 LSC GitHub Dehydrated
8 MMHS nedjmaou/MLMA_hate_speech and GitHub
9 HASOC Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve. Not Available
10 AYR GitHub Dehydrated
11 AHSD GitHub
12 HTPO Link
13 HSHP GitHub Dehydrated

The following are the papers that correspond to the list of datasets:

  1. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
  2. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
  3. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  4. [2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
  5. Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
  6. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
  7. [1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
  8. Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
  9. [2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
  10. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
  11. [1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
  12. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  13. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

It is possible to approximate a subset of the original training mixture (8 of 12 datasets excluding the MMHS dataset, which only includes hate speech) following the Table 2 of the original paper. Something to note is that:

  • AYR, HASOC, HSHP, and LSC are not usable.
  • Offense does not exactly match the sizes in Table 2.
  • We disregard any splits and try to match the number in Table 2. When matching number is not possible, we try to make sure the ratio of on-hate versus hate is same.

Additional Datasets from hatespeechdata.com

The following the the additional datasets from hatespeechdata.com that are not included in the above mentioned sources. The dataset names are either available from the original paper or created here for easy reference.

Index Dataset Name Source Notes
1 AbuseEval GitHub The Offense dataset above reannotated for non-hate, implicit, and explicit hate; only IDs are available. Around 87% of the hate/non-hate labels are same as the previous Offense dataset.
2 SWAD GitHub
3 ALONE Not usable. Requires contacting authors.
4 HatefulUsersTwitter GitHub and Kaggle Available but not relevant. This dataset is about detecting whether a user is hateful or neutral on the Tweet network; it does not come with annotated hateful/benign texts.
5 MMHS150K Website Not usable. Multimodal datasets.
6 HarassmentLexicon GitHub Not usable. Lexicons only.
7 P2PHate GitHub Not usable. Dehydrated.
8 Golbeck Not usable. Requires contacting jgolbeck@umd.edu
9 SurgeAI Website Hateful content only.
10 TSA Kaggle Dataset is provided by Analytics Vidhya. The test.csv does not come with labels.
  1. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language (Caselli et al., LREC 2020): The dataset from this paper is also called AbuseEval v1.0.
  2. Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media (Pamungkas et al., LREC 2020)
  3. [2008.06465] ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter (Wijesiriwardene et. al.)
  4. [1803.08977] Characterizing and Detecting Hateful Users on Twitter (Ribeiro et al., ICWSM 2018)
  5. [1910.03814] Exploring Hate Speech Detection in Multimodal Publications (Gomez et al., WACV 2020)
  6. [1802.09416] A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research (Rezvan et al.)
  7. [1804.04649] Peer to Peer Hate: Hate Speech Instigators and Their Targets (ElSherief et al.)
  8. A Large Labeled Corpus for Online Harassment Research (Golbeck et al., WebSci 2017)
  9. Twitter Hate Speech Dataset (Surge AI)
  10. Twitter Sentiment Analysis (Kaggle)

Reading Notes | WILDS – A Benchmark of in-the-Wild Distribution Shifts

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Leaderboard]

Change Logs:

  • 2023-09-06: First draft. The paper provides a standardized package for many domain generalization algorithms, including group DRO, DANN, and Coral.

Background

Distribution shifts happen when test conditions are “newer” or “smaller” compared to training conditions. The paper defines them as

  • Newer: Domain Generalization

    The test distribution is related to but distinct (aka. new or unseen during training) to the training distributions. Note that the test conditions are not necessarily a superset of the training conditions; they are not “larger” compared to the “smaller” case described below.

    Here are two typical examples of domain generalization described in the paper:

    • Training a model based on patient information from some hospitals and expect the model to generalize to many more hospitals; these hospitals may or may not be a superset of the hospitals we collect training data from.
    • Training an animal recognition model on images taken from some existing cameras and expect the model to work on images taken on the newer cameras.
  • Smaller: Subpopulation Shift

    The test distribution is a subpopulation of the training distributions. For example, degraded facial recognition accuracy on the underrepresented demographic groups ([3] and [4]).

Evaluation

The goal of OOD generalization is training a model on data sampled from training distribution P^\text{train} that performs well on the test distribution P^\text{test}. Note that as we could not assume the data from two distributions are equally difficult to learn, the most ideal case is to at least train two models (or even more ideally three models) and take 2 (or 3) measurements:

Index Goal Training Data Testing Data
1 Measuring OOD Generalization $D ^ \text{train} \sim P^ \text{train}$ $D ^ \text{test} \sim P^ \text{test}$
2 Ruling Out Confounding Factor of Distribution Difficulty $D ^ \text{test} _ \text{heldout} \sim P^ \text{test}$ $D ^ \text{test} \sim P^ \text{test}$
3 (Optional) Sanity Check $D ^ \text{train} \sim P^ \text{train}$ $D ^ \text{train} _ \text{heldout} \sim P^ \text{train}$

However, the generally small test sets make the measurement 2 hard or even impossible: we could not find an additional held-out set D ^ \text{test} _ \text{heldout} that matches the size of D ^ \text{train} to train a model.

The authors therefore define 4 relaxed settings:

Index Setting Training Data Testing Data
1 Mixed-to-Test Mixture of P^ \text{train} and P^ \text{test} $P^ \text{train}$
2 Train-to-Train (aka. setting 3 above) $P^ \text{train}$ $P^ \text{train}$
3 Average. This is a special case of 2; it is only suitable for subpopulation shift, such as Amazon Reviews and CivilComments. Average Performance Worst-Group Performance
4 Random Split. This setting destroys the P ^ \text{test}. $\tilde{D} ^ \text{train} := \mathrm{Sample}(D ^ \text{train} \cup D ^ \text{test})$ $(D ^ \text{train} \cup D ^ \text{test}) \backslash \tilde{D} ^ \text{train}$

image-20231003161354088

Dataset

The dataset includes regular and medical image, graph, and text datasets; 3 out of 10 are text datasets, where the less familiar Py150 is a code completion dataset. Note that the authors fail to cleanly define why there are subpopulation shifts for Amazon Reviews and Py150 datasets as the authors acknowledge below:

However, it is not always possible to cleanly define a problem as one or the other; for example, a test domain might be present in the training set but at a very low frequency.

For Amazon Reviews dataset, one viable explanation on why there is subpopulation shift is uneven distribution of reviews on the same product in the train, validation, and test set.

Name Domain Generalization Subpopulation Shift Notes
CivilComments No; this is because the demographic information of the writers are unknown; if such information is known, we could also create a version with domain generation concern. Yes; this is because the mentions of 8 target demographic groups are available. The only dataset with only subpopulation shift.
Amazon Reviews Yes; due to disjoint users in the train, OOD validation, and OOD test set; there are also ID validation and ID test set from same users as the training set. Yes
Py150 Yes; due to disjoint repositories in train, OOD validation, and OOD test set; there are also ID validation and ID test set from same repositories as the training set. Yes

Importantly, the authors note that performance drop is a necessary condition of distribution shifts. That is

  • The presence of distribution shifts do not lead to performance drop on the test set.
  • If we observe degraded test set performance, then there might be distribution shifts (either domain generalization or subpopulation shift). Here are two examples:

    • Time Shift in Amazon Review Dataset: The model trained on 2000 – 2013 datasets perform similarly well (with 1.1% difference in F1) as the model trained on 2014 – 2018 datasets on the test set sampled from 2014 – 2018.
    • Time and User Shift in Yelp Dataset: For the time shift, the setting is similar as Amazon Reviews; the authors observe a maximum of 3.1% difference. For the user shift, whether the data splits are disjoint in terms of users only influence the scores very little.

Experiments

Here is a summary of the authors’ experiments. Note that Yelp is not the part of the official dataset because it shows no evidence for distribution shift.

Index Dataset Shift Existence
1 Amazon Reviews Time No
2 Amazon Reviews Category Maybe
3 CivilComments Subpopulation Yes
4 Yelp Time No
5 Yelp User No

Amazon Reviews

The authors train a model on one category (“Single”) and four categories (“Multiple”, “Multiple” is a superset of “Single”) and measure the test accuracy on other 23 disjoint categories.

The authors find that (1) training with more categories modestly yet consistently improves the scores, (2) the OOD category (for example, “All Beauty”) could have an even higher score than the ID categories, (3) the authors do not see strong evidence of domain shift as they could not rule out other confounding factors. Note that the authors here use the very vague term “intrinsic difficulty” to gloss over something they could not explain well.

While the accuracies on some unseen categories are lower than the train-to-train in-distribution accuracy, it is unclear whether the performance gaps stem from the distribution shift or differences in intrinsic difficulty across categories; in fact, the accuracy is higher on many unseen categories (e.g., All Beauty) than on the in-distribution categories, illustrating the importance of accounting for intrinsic difficulty.

To control for intrinsic difficulty, we ran a test-to-test comparison on each target category. We controlled for the number of training reviews to the extent possible; the standard model is trained on 1 million reviews in the official split, and each test-to-test model is trained on 1 million reviews or less, as limited by the number of reviews per category. We observed performance drops on some categories, for example on Clothing, Shoes, and Jewelry (83.0% in the test-to-test setting versus 75.2% in the official setting trained on the four different categories) and on Pet Supplies (78.8% to 76.8%). However, on the remaining categories, we observed more modest performance gaps, if at all. While we thus found no evidence for significance performance drops for many categories, these results do not rule out such drops either: one confounding factor is that some of the oracle models are trained on significantly smaller training sets and therefore underestimate the in-distribution performance.

image-20231002193306896

The authors also control the size consistent for “Single” and “Multiple” settings. They show that training data with more domains (with increased diversity) is beneficial for improving OOD accuracies.

CivilComments

Each sample in the dataset has a piece of text, 1 binary toxicity labels, and 8 target labels (each text could include zero, one, or more identities). The authors use 8 TPR and TNR values to measure the performance (totaling 16 numbers).

The authors observe subpopulation shifts: despite 92.2% average accuracy, the worst number among 16 numbers is merely 57.4%. A comparison of 4 mitigation methods shows that (1) the group DRO has the best performance, (2) the reweighting baseline is quite strong, the improved versions of reweighting (i.e., CORAL and IRM) are likely less useful.

In light of the effectiveness of the group DRO algorithm, the authors extend the number of groups to 2 ^ 9= 512, the resulting performance does not improve.

image-20231002190304265

Additional Notes

  • Deciding Type of Distribution Shift

    As long as there are no clearly disjoint train, validation, and test set as in Amazon Reviews and Py150 datasets, then there is no domain generalization issue; presence of a few unseen users in the validation or test set should not be considered as the domain generalization case.

  • Challenge Sets vs. Distribution Shifts

    The CheckList-style challenge sets, such as HANS, PAWS, CheckList, and counterfactually-augmented datasets like [5], are intentionally created different from the training set.

Reference

  1. [2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al.): This paper proposes the LISA method that performs best on the Amazon dataset according to the leaderboard.
  2. [2104.09937] Gradient Matching for Domain Generalization (Shi et al.): This paper proposes the FISH method that performs best on the CivilComments dataset on the leaderboard.
  3. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification (Buolamwini and Gebru, FAccT ’18)
  4. Racial Disparities in Automated Speech Recognition (Koenecke et al., PNAS)
  5. [2010.02114] Explaining The Efficacy of Counterfactually Augmented Data (Kaushik et al., ICLR ’20)
  6. [2004.14444] The Effect of Natural Distribution Shift on Question Answering Models (Miller et al.): This paper trains 100+ QA models and tests them across different domains.
  7. Selective Question Answering under Domain Shift (Kamath et al., ACL 2020): This paper creates a test set of mixture of ID and OOD domains.