Reading Notes | ToxiGen – A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Log

  • 2023-12-05: First draft.

Overview

The authors propose a method to automatically generate a balanced dataset (13 identity groups and both toxic and benign) of 14K (using ALICE) + 260K (using demonstraionts) = 274K samples without explicit words based on the following two observations:

  • It is hard to collect hard toxic content to augment the training set of machine learning models as overly toxic content often co-occur with small set of explicit words.
  • Furthermore, the explicit mention (for example, Muslim) of language styles (for example, African-American English) of some identity groups are unfairly classified as toxic by existing models.

Method

ALICE

The authors incoporate a binary hate speech classifier’s score on the “hate” or “non-hate” class into the decoding process to encourge more hateful or more non-hateful generation given a prompt.

Originally, the hateful prompt will lead to hateful continuation. However, when we have the classifier in the loop, the continuation’s hatefulness will be mitigated yet not reversed, leading to implicit hate speech (i.e., hard toxic content).

Demonstration

Another method the authors propose is manually collecting implicit hate speech from the web, and then demonstrate to obtain more texts from GPT-3. This effort lead to 260K samples.

Experiments

  • Data Augmentation with ToxiGen Improves Accuracy on OOD Test Sets

    The authors further fine-tune HateBERT and ToxDectRoBERTa using the collected dataset and test it on social_bias_frames, SALT-NLP/ImplicitHate, and aps/dynahate. The authors observe improved accuracy after fine-tuning.

    image-20231205224136646

Research Notes | Generalizable Hate Speech Detection

Overview

This post is the summary of the following methods; they rank top on the CivilComments-WILDS benchmark:

Rank Method Paper
1 FISH [2104.09937] Gradient Matching for Domain Generalization (Shi et al., ICLR 2022
2, 3 IRMX [2206.07766] Pareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalization (Chen et al., ICLR 2023)
4 LISA [2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al., ICML 2022)
5 DFR [2204.02937] Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations (Kirichenko et al., ICLR 2023)
6, 8 Group DRO
7, 12 Reweighting [1901.05555] Class-Balanced Loss Based on Effective Number of Samples (Cui et al., CVPR 2019) is one example that uses this method; the reweighting method could date back to much earlier works.

Reweighting, IRM, and CORAL

IRM [2] and CORAL [3] are two extensions of the basic reweighting method by adding an additional penalty term on top of the reweighting loss; this term is based on some measures of the data representations from different domains to encourage the data distribution of different domains to be similar.

Reference

  1. [2012.07421] WILDS: A Benchmark of in-the-Wild Distribution Shifts
  2. [1907.02893] Invariant Risk Minimization (Arjovsky et al.)
  3. [2007.01434] In Search of Lost Domain Generalization (Gulrajani and Lopez-Paz)

Talk Notes | Lessons Learned from Analyzing Systems for Hate Speech Detection and Bias Mitigation by Sarah Masud

[YouTube] – [Personal Website]

  • The presenter has authored several interesting papers ([1] through [5]) on hate speech detection.

Notes

Status Quo of Hate Speech Detection

  • There are varying definitions of hate speech.
  • Labels related to hate speech include hate, offensive, toxic, profane, and toxic. There could be also more fine-grained categories, such as sexist, racist, and islamophobic.
  • Because of the reasons mentioned above, there is no leaderboard in hate speech detection.

Data Sources

We should pay attention to data bias; it is doubtful to collect hate speeches from people and sites that are more likely to generate hate speech. The authors propose to collect datasets from neutral sources; this design choice makes the data annotation difficult.

Annotations

Current approaches of hate speech annotation rely on people (crowdworkers or experts). The authors use the two-phase approach to ensure the label quality.

Building Better Hate Speech Detection Models

  • The complexity of models does not necessarily help. It is more important to capture the signals that predict the final labels, for example, the history and the social network information. This observation also applies to other tasks that involve modeling social behaviors.
  • However, we should carefully monitor the overfitting: spurious correlation between overfitted phrases and labels should not be the signals we allow the models to pick up. That is, the models should generalize without the presence of these words.
  • In the work [2], the authors propose a system that considers not just the text information, but also the timeline and social network information. They merge the three sources of signal using an attention mechanism. However, we could see two limitations:
    • This design is specific to Twitter. Other platforms, such as Reddit, do not have this information with respect to users.
    • The best performing system (M14) does not significantly outperform the baseline system, which is simply fine-tuning a mBERT (M8).

image-20230925174150642

Lexical Bias

  • Replacing the bias sensitive words with more general words is likely shift the bias towards the WordNet ancestors. This hypothesis could be supported by a measurement called pinned bias, where t is the single word in the sensitive word list.

pB _ T = \sum _ {t \in T} \frac{\vert p(\text{“toxic”}\vert t) – \phi\vert}{ \vert T \vert},\quad \phi=\min(p(\text{“toxic”}\vert t), 0.5)

Horizons

The presenter has three high-level observations:

  • Like energy: Bias seems to be transferring from one source to the other.
  • Like a system at rest: A model or dataset will remain biased unless external force (for example, mitigation and regularization) is enabled.
  • Like interactive systems: A system is evolving more chaotic over time. The toxicity needs to be monitored and mitigated in a continuous fashion.

Reference

  1. [2010.04377] Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter (Masud et al., ICDE 2021): This paper presents a dataset called RETINA that focus on hate speech in the Indian context.
  2. [2206.04007] Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization (Masud et al., KDD 2022)
  3. [2201.00961] Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media (Chakraborty and Masud)
  4. [2306.01105] Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment (Masud et al., KDD 2023)
  5. [2202.00126] Handling Bias in Toxic Speech Detection: A Survey (Garg et al., CSUR).
  6. Language (Technology) is Power: A Critical Survey of “Bias” in NLP (Blodgett et al., ACL 2020)
  7. [2305.06626] When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks (Fleisig et al.)
  8. Handling Disagreement in Hate Speech Modelling | SpringerLink (Novak et al., IPMU 2022)
  9. [2001.05495] Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations (Badjatiya et al., WWW 2019).

Reading Notes | Revisiting Hate Speech Benchmarks – From Data Curation to System Deployment

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-21: First draft. This paper appears at KDD 2023. The co-lead author – Sarah Musud – has published numerous papers on hate speech detection.

Additional Notes

  • Measuring Dataset Difficulty

    The authors compare different datasets’ difficulty using the JS divergence between Laplician smoothed unigram distributions of texts under different label pairs; the lower the divergence, the closer the unigram distributions and this makes texts under a label pair more difficult to distinguish.

    For example, the proposed datasets have 4 labels, this will lead to \binom{4}{2} = 6 divergence measures.

  • Matthews Correlation Coefficient (MCC)

Reference

Reading Notes | Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-11: First draft. This paper appears at WOAH ’22.

The paper studies the generalization to new hate target groups on the single HateXplain dataset; they authors do so by comparing three existing methods, including (1) Unsupervised Domain Adaptation (UDA, this method is also used in paper [1]), (2) MixUp regularization, (3) curriculum labeling, and (4) DANN.

The paper also considers the back translation approach (specifically (en, fr), (en, de), and (en, es)) for data augmentation.

Experiments

  • Zero: Directly apply a model trained on \mathcal{D}_A to a new domain \mathcal{D}_B.
  • Zero+: Augmenting \mathcal{D}_A using back-translation.
  • ZeroB+: Applying back-translation-based data augmentation while making sure that the each batch is class-balanced.

Reference

  1. Unsupervised Domain Adaptation in Cross-corpora Abusive Language Detection (Bose et al., SocialNLP 2021): This paper considers the setting of training on dataset \mathcal{D}_A and testing on another dataset \mathcal{D}_B, where A, B are HateEval, Waseem, and Davidson, resulting in 6 pairs. They use several existing methods to improve the test scores on \mathcal{D}_B.
  2. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al., AAAI 2021): This used to be the only dataset that provides the target groups of both hateful and non-hateful contents.
  3. d: Data augmentation could happen in symbol space via rules, word replacement through BERT, text-generation models or feature space. However, the main paper chooses to use the back translation for data augmentation.

    Here are two libraries on data augmentation in NLP:

Reading Notes | Directions in Abusive Language Training Data – Garbage In, Garbage Out

[Semantic Scholar]- [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-06: First draft. This paper provides the influential hate speech dataset hub hatespeechdasta.com even though it appears on PLoS One.

This paper provides a survey of existing (as of 2020) hate speech datasets and some suggestions for creating future hate speech datasets.

Research Notes | A Benchmark for Hate Speech Detection

Overview

There does not exist a unified benchmark such as GLUE in hate speech detection domain that conducts a leaderboard style performance comparison of different open-source hate speech classifiers. This prevents the practitioners from making informed decisions when choosing which model to use for their own hate speech detection applications.

The benchmark will provide the following:

  • The entire training and validation set for future study. However, the labels from public test sets will not be released for benchmarking purposes; there will be additional private test sets.
  • The ranking of the models based on the average aggregated metrics (for example, F1 score) on the public and private test sets.

Protocol

  • Step 1: Randomly select a test set and a validation set.

    The two datasets must be randomly selected for the following reasons:

    1. The distribution of the validation set will be similar to the test set. Using the randomly sampled validation set will help select the models that more are likely to perform well on the test set.
    2. This makes the two datasets independent from each other in terms of label distribution and source distribution. Throughout the experiments, the test and validation sets are the same; this is helpful as we could see the (dis)advantages of one method in the wandb dashboard.
  • Step 2: Sampling train set using different (a) data selection methods.
  • Step 3: Training or fine-tuning (b) different models with (c) different techniques for local improvements, for example, objective function, and regularization.
  • Step 4: Comparing different combinations of (a), (b), and (c). If we have m combinations and n test sets, then we will end up with a table of (m, n+1), where the first column lists all the combinations.

Candidate Datasets

Collected Datasets from Diverse Topics

The current data aggregation includes [1] through [5], where the [5] only includes hate speech.

  1. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  2. [2005.12423] Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis (He et al.)
  3. [2108.12521] TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter (Kumar et al.)
  4. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  5. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech (ElSherief et al., EMNLP 2021)

cardiffnlp/twitter-roberta-base-hate-latest Collection

The follow are the datasets used for the model cardiffnlp/twitter-roberta-base-hate-latest or the paper below:

Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation (Antypas & Camacho-Collados, WOAH 2023)

Index Dataset Name Source Notes
1 HatE Link that requires filling in a Google form.
2 MHS ucberkeley-dlab/measuring-hate-speech
3 DEAP Zenodo
4 CMS Link that requires registration and email verification.
5 Offense Link; this dataset is also called OLID.
6 HateX hatexplain and GitHub
7 LSC GitHub Dehydrated
8 MMHS nedjmaou/MLMA_hate_speech and GitHub
9 HASOC Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve. Not Available
10 AYR GitHub Dehydrated
11 AHSD GitHub
12 HTPO Link
13 HSHP GitHub Dehydrated

The following are the papers that correspond to the list of datasets:

  1. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
  2. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
  3. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  4. [2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
  5. Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
  6. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
  7. [1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
  8. Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
  9. [2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
  10. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
  11. [1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
  12. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  13. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

It is possible to approximate a subset of the original training mixture (8 of 12 datasets excluding the MMHS dataset, which only includes hate speech) following the Table 2 of the original paper. Something to note is that:

  • AYR, HASOC, HSHP, and LSC are not usable.
  • Offense does not exactly match the sizes in Table 2.
  • We disregard any splits and try to match the number in Table 2. When matching number is not possible, we try to make sure the ratio of on-hate versus hate is same.

Additional Datasets from hatespeechdata.com

The following the the additional datasets from hatespeechdata.com that are not included in the above mentioned sources. The dataset names are either available from the original paper or created here for easy reference.

Index Dataset Name Source Notes
1 AbuseEval GitHub The Offense dataset above reannotated for non-hate, implicit, and explicit hate; only IDs are available. Around 87% of the hate/non-hate labels are same as the previous Offense dataset.
2 SWAD GitHub
3 ALONE Not usable. Requires contacting authors.
4 HatefulUsersTwitter GitHub and Kaggle Available but not relevant. This dataset is about detecting whether a user is hateful or neutral on the Tweet network; it does not come with annotated hateful/benign texts.
5 MMHS150K Website Not usable. Multimodal datasets.
6 HarassmentLexicon GitHub Not usable. Lexicons only.
7 P2PHate GitHub Not usable. Dehydrated.
8 Golbeck Not usable. Requires contacting jgolbeck@umd.edu
9 SurgeAI Website Hateful content only.
10 TSA Kaggle Dataset is provided by Analytics Vidhya. The test.csv does not come with labels.
  1. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language (Caselli et al., LREC 2020): The dataset from this paper is also called AbuseEval v1.0.
  2. Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media (Pamungkas et al., LREC 2020)
  3. [2008.06465] ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter (Wijesiriwardene et. al.)
  4. [1803.08977] Characterizing and Detecting Hateful Users on Twitter (Ribeiro et al., ICWSM 2018)
  5. [1910.03814] Exploring Hate Speech Detection in Multimodal Publications (Gomez et al., WACV 2020)
  6. [1802.09416] A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research (Rezvan et al.)
  7. [1804.04649] Peer to Peer Hate: Hate Speech Instigators and Their Targets (ElSherief et al.)
  8. A Large Labeled Corpus for Online Harassment Research (Golbeck et al., WebSci 2017)
  9. Twitter Hate Speech Dataset (Surge AI)
  10. Twitter Sentiment Analysis (Kaggle)

Reading Notes | Using GPT4 for Content Moderation

Method

This blog post illustrates an idea of human-AI collaboration in revising an existing content policy. Specifically,

  • Based on an initial policy P_0, a human expert may disagree with a moderation decision of GPT-4.
  • The human expert elicit suggestions from GPT-4 to revise the policy P_0 into P_1 until the human expert agrees with the decision from GPT-4.

The blog post does not clearly explain how either step is done. For example, (1) what prompt is used to turn the general purpose GPT-4 into a content moderator, (2) what prompt is used to ask the feedback from GPT-4, and (3) how human experts ingest GPT-4 feedback into concrete policy revisions.

Reading Notes | Robust Hate Speech Detection in Social Media – A Cross-Dataset Empirical Evaluation

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-09-04: First draft. This paper appears at WOAH ’23. The provided models on HuggingFace have more than 40K downloads thanks to their easy-to-use tweetnlp package; the best-performing binary and multi-class classification models are cardiffnlp/twitter-roberta-base-hate-latest and cardiffnlp/twitter-roberta-base-hate-multiclass-latest respectively.

Method

Datasets

The authors manually select and unify 13 hate speech datasets for binary and multi-class classification settings. The authors do not provide the rationale on why they choose these 13 datasets.

For the multi-class classification setting, the authors devise 7 classes: racism, sexism, disability, sexual orientation, religion, other, and non-hate. This category is similar to yet smaller than the MHS dataset, including gender, race, sexuality, religion, origin, politics, age, and disability (see [1]).

For all 13 datasets, the authors apply a 7:1:2 ratio of data splitting; they also create a small external test set (i.e., Indep). With test sets kept untouched, the authors consider 3 ways of preparing data:

  1. Training on the single dataset.
  2. Training on an aggregation of 13 datasets.
  3. Training on a sampled dataset from the aggregation in 2. Specifically, the authors (1) find the dataset size that leads to the highest score in 1, (2) sample the dataset proportionally by the each of 13 datasets’ sizes and the the ratio of hate versus non-hate to exactly 1:1.

The processed datasets are not provided by the authors. We need to follow the guides below to obtain them; the index of the datasets is kept consistent with the HuggingFace model hub and and main paper’s Table 1.

Index Dataset Name Source Notes
1 HatE Link that requires filling in a Google form.
2 MHS ucberkeley-dlab/measuring-hate-speech
3 DEAP Zenodo
4 CMS Link that requires registration and email verification.
5 Offense Link; this dataset is also called OLID.
6 HateX hatexplain and GitHub
7 LSC GitHub Dehydrated
8 MMHS nedjmaou/MLMA_hate_speech and GitHub
9 HASOC Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve. Not Available
10 AYR GitHub Dehydrated
11 AHSD GitHub
12 HTPO Link
13 HSHP GitHub Dehydrated

The following are the papers that correspond to the list of datasets:

  1. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
  2. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
  3. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  4. [2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
  5. Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
  6. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
  7. [1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
  8. Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
  9. [2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
  10. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
  11. [1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
  12. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  13. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

Despite the availability of the sources, it is quite hard to reproduce the original dataset as (1) many of the datasets in the table do not come with the predefiend splits; the only datasets that are available and have such splits are HatE, HatX, and HTPO, (2) how the authors unify (for example, deriving binary label from potentially complicated provided labels) the datasets is unknown, and (3) how the authors process the texts is also unknown.

It is better to find a model whose checkpoints and exact training datasets are both available; one such example is the Alpaca language model.

Models and Fine-Tuning

The authors start from the bert-base-uncased, roberta-base, and two models specifically customized to Twitter (see [2], [3]). The authors carry out the HPO on learning rates, warmup rates, number of epochs, and batch size using hyperopt.

Experiments

  • The data preparation method 3 (All*) performs better than the method 1 (MHS, AYR, etc). It also achieves the highest scores on the Indep test set (Table 3).

    image-20230904122703515

Other Information

  • Language classification tasks could be done with fasttext models (doc).

Comments

  • Ill-Defined Data Collection Goal

    We can read the sentence like following from the paper:

    • For example both CMS and AYR datasets deal with sexism but the models trained only on CMS perform poorly when evaluated on AYR (e.g. BERTweetCSM achieves 87% F1 on CSM, but only 52% on AYR).
    • This may be due to the scope of the dataset, dealing with East Asian Prejudice during the COVID-19 pandemic, which is probably not well captured in the rest of the datasets.

    The issue is that there is not quantitative measure of the underlying theme of a dataset (for example, CMS and AYR). The dataset curators may have some general ideas on what the dataset should be about; they often do not have a clearly defined measure to quantify how much one sample aligns with their data collection goals.

    I wish to see some quantitative measures on topics and distributions of an NLP dataset.

Reference

  1. Targeted Identity Group Prediction in Hate Speech Corpora (Sachdeva et al., WOAH 2022)
  2. BERTweet: A pre-trained language model for English Tweets (Nguyen et al., EMNLP 2020)
  3. TimeLMs: Diachronic Language Models from Twitter (Loureiro et al., ACL 2022): This paper also comes from Cardiff NLP. It considers the time axis of the language modeling through continual learning. It tries to achieve OOD generalization (in terms of time) without degrading the performance on the static benchmark.

Talk Notes | Building End-to-End Content Moderation Pipelines in the Real World

[Website] – [Paper] – [Blog]

Note:
– The presenter of this talk is the lead author of the paper A Holistic Approach to Undesired Content Detection in the Real World.

Change Logs:

  • 2023-08-29: First draft.

Overview

There are two main iterations to build an end-to-end content moderator.
– Annotation Iteration: OpenAI outsource the most of the annotation iteration to external data providers. They also have internal expert annotators to provide the labels of the quality control set.
– Main Iteration: This is the bulk of the OpenAI’s contribution.

Annotation Iteration

  • Labeling guidelines need to be clarified and updated multiple times with more and more edges surface. The specifications from OpenAI are finally turned into training materials of their label providers to educate their annotators.
  • There should be sessions that
    • Calibrating the annotators by clarifying the annotation guidelines.
    • Auditing data that are flagged harmful either by the annotators or the model. Removing annotations from the annotator that has low per-category F1 scores. This process could be accelerated using cross-auditing with multiple annotators.

Main Iteration

There following are the diagrams that outline the steps above:

  • Step 0: Creating an initial dataset. This initial dataset includes those from “bad” (and unlabeled) subset of CommonCrawl, expert selected academic dataset, and zero-shot synthetic data from GPT-3 based on hand-crafted templates.
  • Step k-1: \cdots
  • Step k: In the iteration k, training a model \mathcal{M}_k based on GPT-series model using the standard cross-entropy loss.

One of the things the OpenAI could not solve well is the calibration.

  • Step k+1: Using \mathcal{M}_k to run inference on the unlabeled production data; the probabilities are used to select the subset for annotation. Three methods are compared:
    • Purely Random Sampling
    • Random Sampling for Samples Above a Threshold
    • Uncertainty Sampling

Active learning substantially improves the ratio of harmful contents in the user traffic (10 – 22 times).

After the subset is annotated, it is added back to the training set. Further, there is also synthetic data that is added to address the counterfactual bias.

  • Step k+2: Running the following steps to further improve the training data.

    • Overfitted Phrase Detection.
    • Mislabeling Detection.
  • Step k+3: Internal red teaming.
  • Step k+4: \cdots
  • Step -3:
  • Evaluating on the static test set.
  • A/B testing.
  • Step -1: Product release.

Here is a more detailed diagram; it is same as the one provided in the paper.

Future Direction

  • Dataset

    • A more systematic approach to create synthetic dataset. The current approach OpenAI uses is described ad-hoc.
    • Robustness to prompt injection and ciphers.
  • Continuous GPT-Assisted Red Teaming
  • Active Leraning
    • The current active learning approach relies on the model \mathcal{M}_k at Step k+1, which the model \mathcal{M}_k may not be able to generalize.
    • The presenter also mentions anomaly detection; it is not prioritized in OpenAI due to time constraint.

Reference