Reading Notes | Using GPT4 for Content Moderation

Method

This blog post illustrates an idea of human-AI collaboration in revising an existing content policy. Specifically,

  • Based on an initial policy P_0, a human expert may disagree with a moderation decision of GPT-4.
  • The human expert elicit suggestions from GPT-4 to revise the policy P_0 into P_1 until the human expert agrees with the decision from GPT-4.

The blog post does not clearly explain how either step is done. For example, (1) what prompt is used to turn the general purpose GPT-4 into a content moderator, (2) what prompt is used to ask the feedback from GPT-4, and (3) how human experts ingest GPT-4 feedback into concrete policy revisions.

Reading Notes | Robust Hate Speech Detection in Social Media – A Cross-Dataset Empirical Evaluation

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

  • 2023-09-04: First draft. This paper appears at WOAH ’23. The provided models on HuggingFace have more than 40K downloads thanks to their easy-to-use tweetnlp package; the best-performing binary and multi-class classification models are cardiffnlp/twitter-roberta-base-hate-latest and cardiffnlp/twitter-roberta-base-hate-multiclass-latest respectively.

Method

Datasets

The authors manually select and unify 13 hate speech datasets for binary and multi-class classification settings. The authors do not provide the rationale on why they choose these 13 datasets.

For the multi-class classification setting, the authors devise 7 classes: racism, sexism, disability, sexual orientation, religion, other, and non-hate. This category is similar to yet smaller than the MHS dataset, including gender, race, sexuality, religion, origin, politics, age, and disability (see [1]).

For all 13 datasets, the authors apply a 7:1:2 ratio of data splitting; they also create a small external test set (i.e., Indep). With test sets kept untouched, the authors consider 3 ways of preparing data:

  1. Training on the single dataset.
  2. Training on an aggregation of 13 datasets.
  3. Training on a sampled dataset from the aggregation in 2. Specifically, the authors (1) find the dataset size that leads to the highest score in 1, (2) sample the dataset proportionally by the each of 13 datasets’ sizes and the the ratio of hate versus non-hate to exactly 1:1.

The processed datasets are not provided by the authors. We need to follow the guides below to obtain them; the index of the datasets is kept consistent with the HuggingFace model hub and and main paper’s Table 1.

Index Dataset Name Source Notes
1 HatE Link that requires filling in a Google form.
2 MHS ucberkeley-dlab/measuring-hate-speech
3 DEAP Zenodo
4 CMS Link that requires registration and email verification.
5 Offense Link; this dataset is also called OLID.
6 HateX hatexplain and GitHub
7 LSC GitHub Dehydrated
8 MMHS nedjmaou/MLMA_hate_speech and GitHub
9 HASOC Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve. Not Available
10 AYR GitHub Dehydrated
11 AHSD GitHub
12 HTPO Link
13 HSHP GitHub Dehydrated

The following are the papers that correspond to the list of datasets:

  1. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
  2. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
  3. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  4. [2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
  5. Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
  6. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
  7. [1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
  8. Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
  9. [2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
  10. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
  11. [1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
  12. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  13. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

Despite the availability of the sources, it is quite hard to reproduce the original dataset as (1) many of the datasets in the table do not come with the predefiend splits; the only datasets that are available and have such splits are HatE, HatX, and HTPO, (2) how the authors unify (for example, deriving binary label from potentially complicated provided labels) the datasets is unknown, and (3) how the authors process the texts is also unknown.

It is better to find a model whose checkpoints and exact training datasets are both available; one such example is the Alpaca language model.

Models and Fine-Tuning

The authors start from the bert-base-uncased, roberta-base, and two models specifically customized to Twitter (see [2], [3]). The authors carry out the HPO on learning rates, warmup rates, number of epochs, and batch size using hyperopt.

Experiments

  • The data preparation method 3 (All*) performs better than the method 1 (MHS, AYR, etc). It also achieves the highest scores on the Indep test set (Table 3).

    image-20230904122703515

Other Information

  • Language classification tasks could be done with fasttext models (doc).

Comments

  • Ill-Defined Data Collection Goal

    We can read the sentence like following from the paper:

    • For example both CMS and AYR datasets deal with sexism but the models trained only on CMS perform poorly when evaluated on AYR (e.g. BERTweetCSM achieves 87% F1 on CSM, but only 52% on AYR).
    • This may be due to the scope of the dataset, dealing with East Asian Prejudice during the COVID-19 pandemic, which is probably not well captured in the rest of the datasets.

    The issue is that there is not quantitative measure of the underlying theme of a dataset (for example, CMS and AYR). The dataset curators may have some general ideas on what the dataset should be about; they often do not have a clearly defined measure to quantify how much one sample aligns with their data collection goals.

    I wish to see some quantitative measures on topics and distributions of an NLP dataset.

Reference

  1. Targeted Identity Group Prediction in Hate Speech Corpora (Sachdeva et al., WOAH 2022)
  2. BERTweet: A pre-trained language model for English Tweets (Nguyen et al., EMNLP 2020)
  3. TimeLMs: Diachronic Language Models from Twitter (Loureiro et al., ACL 2022): This paper also comes from Cardiff NLP. It considers the time axis of the language modeling through continual learning. It tries to achieve OOD generalization (in terms of time) without degrading the performance on the static benchmark.

Reading Notes | DoReMi – Optimizing Data Mixtures Speeds Up Language Model Pretraining

Overview

Other Information

  • The ratios of domains should be counted using number of tokens rather than number of documents, even though different tokenizers may return slightly different ratios.

Reference

  1. [2110.10372] Distributionally Robust Classifiers in Sentiment Analysis (Stanford Course Project Report).
  2. Distributionally Robust Finetuning BERT for Covariate Drift in Spoken Language Understanding (Broscheit et al., ACL 2022): This paper is one of few papers I could find that applies DRO to an NLP model; the problem the authors addressing here is mitigating the spurious correlation (or improving robustness) of a cascade of text and token classification models.

    The standard ERM (aka. MLE) assumes a single distribution and therefore all losses are equally important. However, the DRO tries to minimize the maximum (i.e., the worse case) of a set of distributions; this set of distributions is modeled by prior knowledge.

  3. [1810.08750] Learning Models with Uniform Performance via Distributionally Robust Optimization
  4. Distributionally Robust Language Modeling (Oren et al., EMNLP-IJCNLP 2019): The main paper extensively cites this paper. The goal of this paper is to train a language model on a dataset mixture of K sources \cup _ {i=1}^K\mathcal{D} _ i without degrading the perform on each domain’s test set; it is a practical application of [3] in language modeling.

    This setting may be useful because (1) each \mathcal{D} _ i may not be large enough to train the model, and (2) the authors observe that training on data mixture degrades the performance on the each domain’s test set than using a smaller dataset.

  5. [1911.08731] Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization (ICLR ’20; 1K citations). This paper fine-tunes BERT using DRO on the MNLI dataset; the paper also experiments on the image datasets.

Basics | A Quick Reference of the Evaluation Metrics in NLP

Overview

This set of evaluation metrics I discuss in this post is organized based on the typical tasks in NLP (see [1]); they are:

  • Sequence Classification
  • Token Classification (Tagging)
  • Generation: This category includes all tasks whose outputs are a sequence of tokens. For example, question answering, machine translation, text summarization and text simplification, and paraphrasing.
  • Retrieval
  • Regression

There is also a section dedicated to basic statistics, such as correlations, confidence intervals, and p-values.

Basic Statistics

Correlation

  • Choice of Correlation Measures

    We should choose Spearman correlation unless Pearson correlation is absolutely necessary (answer):

    • Pearson correlation is a parametric test for symmetric linear association; it has more stringent requirements.
    • Spearman correlation is a non-parametric test for monotinicity. It has lower requirement for data: data not normally distributed, data with outliers, ordinal or categorical data.
  • Number of Observation

    • The number of observations influence the confidence interval; smaller number of observations will make confidence interval wide. However, the small number of observations itself is not a problem; one real-life example is determining a new drug is effective on a small group of human subjects, where there may be only 5 or 6 people involved in the study.
    • Bootstrapping will not “turn a sow’s ear into a silk purse”: it only reduces confidence intervals (or significance level); it does not change correlation values (answer).

Sequence Classification

Confusion Matrix

The \mathbf{C}_{ij} means the number of samples of class i receive the prediction j; the rows are the true classes while the columns are the predictions.

  • When there are only two classes, we could define \mathrm{TN} = \mathbf{C} _ {11}, \mathrm{FP}=\mathbf{C} _ {12}, \mathrm{FN}=\mathbf{C} _ {21}, and \mathrm{TP}=\mathbf{C} _ {22}:

    Bases on these 4 numbers, we could define

    • $\mathrm{TPR}$, $\mathrm{FPR}$, $\mathrm{FNR}$, and $\mathrm{TNR}$: they are the normalized version of the confusion matrix on the true number of samples in each class.
    • Precision P and Recall R: we could compute these two numbers for each class; they are important in diagnosing a classifier’s performance.
Notation Formula
$\mathrm{TNR}$ $\frac{\mathrm{TN}}{\mathrm{N}}$
$\mathrm{FNR}$ $\frac{\mathrm{FN}}{\mathrm{N}}$
$\mathrm{FPR}$ $\frac{\mathrm{FP}}{\mathrm{P}}$
$\mathrm{TPR}$ $\frac{\mathrm{TP}}{\mathrm{P}}$
$P$ $\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$
$R$ $\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$
import numpy as np
from sklearn.metrics import confusion_matrix

y_true = np.array([1, 0, 1, 0, 1])
y_pred = np.array([1, 0, 1, 1, 0])

# raw counts
tn, fp, fn, tp = confusion_matrix(
    y_true=y_true,
    y_pred=y_pred
).ravel()

print(tn, fp, fn, tp)
# expected output: (1, 1, 1, 2)
# actual output: array([1, 1, 1, 2])

# tnr, fpr, fnr, tpr
tnr, fpr, fnr, tpr = confusion_matrix(
    y_true=y_true,
    y_pred=y_pred,
    normalize="true",
).ravel()

print(tnr, fpr, fnr, tpr)
# expected output: (1/2, 1/2, 1/3, 2/3)
# actual output: 0.5 0.5 0.3333333333333333 0.6666666666666666

  • We could use the code below to visualize a confusion matrix. Following the example above, we have:
import pandas as pd

df = pd.DataFrame(
    confusion_matrix(y_true, y_pred),
    index=labels
    columns=labels,
)

sns.plot(df)

Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient (MCC) (also called \phi statistic) is a special case of Pearson’s correlation for two boolean distributions with value {-1, 1}.

The range of MCC is [-1, 1], where 1 – perfect predictions, 0 – random prediction, and -1 – inverse prediction. It is better than F1 score (and therefore accuracy) as it does not have similar majority-class bias by additionally considering TN (remember that P=\frac{TP}{TP+FP}, R=\frac{TP}{TP+FN}, and F1=\frac{2PR}{P+R}=\frac{2TP}{2TP+FP+FN} and TN is not considered in F1).

Concretely, consider the following examples

  • Example 1: Consider a dataset with 10 samples and y =(1, 1, \cdots, 1, 0) and \hat{y} = (1, 1, \cdots , 1, 1), then the F1=0.9474, MCC=0.
  • Example 2: Consider the use case A1 detailed in [3], given a dataset of 91 positive samples and 9 negative samples, suppose all 9 negative samples are misclassified and 1 positive sample is misclassified (i.e. TP=90, TN=0, FP=9, FN=1), then F1=0.95 while MCC=-0.03.

A potential issue when computing the MCC is that the metric is undefined when y is all 0 or all 1, and this will make either TP or TN undefined.
This issue also come with F1 score as it could be written as F1=\frac{2P\cdot R}{P+R}=\frac{2TP}{2TP+FP+FN} and it could not be computed when TP is undefined when all labels are 0.

To have a better comparison between MCC and F1, consider a dataset with 100 samples, plot the F1 \sim MCC curve given different combinations of TP, FP, TN and FN using simulation. The dots colored red are those (1) achieve an more than 95% F1 score, and (2) correspond to a dataset where there are more than 95% positive samples. We could see that their MCC is relatively low.

Ranking

Typical ranking metrics include Mean Reciprocal Rank, Mean Average Precision, Precision, and Normalized Discounted Cumulative Gain (NDCG).NDCG is more comprehensive than other metrics as it considers the location of relevant items.

Metric Formula Note
MRR $\frac{1}{N} \sum _ {i=1}^N \frac{1}{r _ i}$ $r$ is the first relevant item for each query.
MAP@k $\frac{1}{N} \sum _ {i=1}^N \mathrm{AP}@k(i)$ $\mathrm{AP}@k(i) = \frac{1}{\text{# Relevant Items in Top-}k} \sum _ {i=1} ^ k \mathrm{P}@k(i) \cdot 1(i\ \text{is relevant})$.
P@k $\frac{1}{N} \sum _ {i=1}^N \mathrm{P}@k(i)$ $\mathrm{P}@k(i)$ is the ratio of relevant items in the total $k$ items for query $i$.
NDCG@k $\frac{\mathrm{DCG}@k}{\mathrm{IDCG}$k}$ $\mathrm{DCG}@k=\sum _ {i=1} ^ k \frac{\mathrm{rel} _ i}{\log _ 2 (i+1)}$, $\mathrm{IDCG}@k$ is the $\mathrm{DCG}@k$ for the ranking list of ideal order.

Suppose there are two queries q1 and q2, the returned documents have the following relevance list (1 as relevant and 0 as irrelevant):

q1 = [1, 0, 1, 0, 1]
q2 = [0, 0, 1, 1, 0]

Then we have the following results:

1 2 3 4 5 AP@5 DCG@5 IDCG@5
P@k for q_1 1 1/2 2/3 1/2 3/5 $\frac{1}{3}\times (1 + \frac{2}{3} + \frac{3}{5}) = 0.756$ $1 + \frac{1}{\log _ 2 4}+\frac{1}{\log_2 6}$ $1+\frac{1}{\log _ 2 3}+\frac{1}{\log _ 2 4}$
P@k for q_2 0 0 1/3 1/2 2/5 $\frac{1}{2}\times (\frac{1}{3} + \frac{1}{2})=0.417$ $\frac{1}{\log _ 2 4}+\frac{1}{\log_2 5}$ $1+\frac{1}{\log _ 2 3}$

Then, we have

  • P@5: \frac{1}{2} \times (\frac{3}{5} + \frac{2}{5}) = 0.5.
  • MAP@5: \frac{1}{2} \times (0.756 + 0.417)=0.587

Reference

  1. [2107.13586] Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing: This survey gives clear categories of NLP tasks: GEN , CLS, and TAG.
  2. Confusion matrix – Wikipedia: A comprehensive overview of a list of related metrics.
  3. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation | BMC Genomics | Full Text (Chicco and Jurman).
  4. Calculate Pearson Correlation Confidence Interval in Python | Zhiya Zuo: The author writes a function that outputs Pearson correlation, p-value, and confidence intervals.
  5. sample size – Pearson correlation minimum number of pairs – Cross Validated