Research Notes | Debugging Machine Learning Models

Overview

The edited knowledge in this paper is in the form of triplets. Given the prompt Eiffel Tower is located in the city of, the original model will output Paris as expected. However, after model editing, the output could be other tokens with high probability. For example, Seattle.

Suppose we have an input x and its original output is y := \mathcal{M}(x), if we apply some intervention to \mathcal{M}(\cdot) and expect the future output to be y’, we require the editing to be reliable, local, and general:

  • Reliable: The edited model should output y’ with a high probability.
  • Local: The output of anything semantically different from x should not change.
  • General (or Consistent): The output of anything semantically equivalent to x should also change.

The community seems to focus on editing encoder-decoder models or decoder-only models ([12] and [13]) due to their ability to generate texts. However, the encoder-only models are less of interest even though MEND and TransformerPatcher both study it. For example, the paper [13] mentions the following:

<

blockquote>
Previous studies typically used smaller language models (<1B) and demonstrated the effectiveness of current editing methods on smaller models like BERT (Devlin et al., 2019). However, whether these methods work for larger models is still unexplored. Hence, considering theHowever, whether these methods work for larger models is still unexplored. Hence, considering the editing task and future developments, we focus on generation based models and choose larger ones: T5-XL (3B) and GPT-J (6B), representing both encoder-decoder and decoder-only structures.

The editing methods could be compared on whether the model parameters have been modified. There are several scenarios:

  1. Model Parameters are Unchanged
  2. Model Parameters are Unchanged, but there are Additional Parameters
  3. Model Parameters are Changed: This could be done using either (1) locating-and-editing, or (2) meta-learning with a separate hypernetwork.
Method Category Note
ENN 3
KnowledgeEditor 3
MEND 3
SEARC 1
ROME 3
MEMIT 3
TransformerPatcher 2
KnowledgeNeuron 3
MQuAKE 1
IKE 1
MemPrompt 1

ROME

KnowledgeNeuron

KnowledgeEditor

MEND

TransformerPatcher

MEMIT

Experiments

Datasets

The canonical tasks of model editing includes fact-checking on FEVER and QA with the zsRE datasets.

  • For FEVER, the editing dataset is based on the original input and flipped label.
  • For zsRE, the editing dataset is based on the original input and an answer that is not top-1.
Paper Fact Checking QA Generation Note
MEMIT [1] N/A zsRE and CounterFact N/A There are two intermediate works ROME and SEARC. But they are omitted as the best model is MEMIT.
MEND [5] Binary FEVER zsRE Wikitext The first two tasks are chosen same as De Cao et al.; Wikitext is an additional dataset.
KnowledgeEditor [4] Binary FEVER zsRE N/A
Constrained Fine-Tuning [3] N/A zsRE and T-REx N/A
ENN [4] N/A N/A N/A This early work experiments on CIFAR-10 and MT tasks.

Additional Notes

  • The RDF triplet may be the most unambiguous way to express instances of a specification; it is a classical way to represent knowledge and could be bidirectionally converted from and to a SQL database (Wikipedia).
  • The overarching research field is called “mechanistic interpretibility.”
  • Knowledge editing is thought to be difficult because now knowledge is stored distributionally rather than symbols. However, the paper [2] finds that the localization is quite concentrated in MLPs; the authors focus on MLPs because they believe the attention is too complicated to study.
  • MLPs are storing information while attention is gathering information: the information “Seattle” is in one specific location of GPT-2 before the “the space needle is located at” is asked.
  • Model editing is different from adversarial attack since the former tries to change the model while the latter tries to change the input data. However, model editing could have dual use beyond model patching: engineering an LM that always generates non-factual content.
  • One limitation of the model editing is that we could only update singleton facts; we could not update the higher level content, for example, specifications and political leanings.

Reference

Kevin Meng and David Bau have published a series of works ([1] and [2]) on knowledge editing for transformers. [3] through [6] are the predecessors to the proposed work; they could at most scale to 75 edits.

  1. [2210.07229] Mass-Editing Memory in a Transformer (MEMIT system).
  2. [2202.05262] Locating and Editing Factual Associations in GPT (ROME system).
  3. [2012.00363] Modifying Memories in Transformer Models: This paper is the first to study the problem of fact editing transformers. The authors propose to fine-tune the models’ first and last transformer block on the modified facts \mathcal{D} _ M while constraining the parameter within a small space.
    \min _ {\theta \in \Theta} \frac{1}{m} \sum _ {x \in \mathcal{D}_M} L(x;\theta)\quad s.t. \Vert \theta – \theta_0 \Vert \leq \delta
  4. [2004.00345] Editable Neural Networks (Sinitsin et al., ICLR 2020) (ENN system): This paper is the first to apply meta-learning to model editing; it is a precursor to follow-up works [5], [6], and [7]. Besides, it mentions the following important observations:

    • The goal of model editing is quickly patching critical mistakes made by a neural model. The problem precludes (1) retraining with augmented dataset because it is slow, and (2) manual cache as it does not adapt to diverse input changes.
  5. Editing Factual Knowledge in Language Models (De Cao et al., EMNLP 2021) (KnowledgeEditor system): The authors observe that the previous methods [3] and [4] have following limitations in their edited models:

    • Unreliable Edits: For sentences that are different from x, the behaviors should not have changed.
    • Inconsistent Edits: For sentences that are semantically equivalent to x, the behaviors should have changed.

    Furthermore, the method [4] also requires expensive retraining.

  6. [2110.11309] Fast Model Editing at Scale (Mitchell et al.) (MEND system): This paper improves the De Cao et al. in editing models with a scale of 10B parameter. On smaller models, the ENN model is better than KnowledgeEditor. The code base of this work also implements ENN and KnowledgeEditor for comparison.
  7. [2206.06520] Memory-Based Model Editing at Scale (Mitchell et al.) (SEARC system): The authors do not release code for SEARC.
  8. Transformer Feed-Forward Layers Are Key-Value Memories (Geva et al., EMNLP 2021): This paper helps the main paper constrain the editing target to the MLP layers.
  9. Knowledge Neurons in Pretrained Transformers (Dai et al., ACL 2022) (KnowledgeNeuron system)
  10. [2305.12740] Can We Edit Factual Knowledge by In-Context Learning? (Zhang et al.)
  11. [2301.09785] Transformer-Patcher: One Mistake worth One Neuron (Huang et al., ICLR 2023): This paper proposes to add one neuron in the last FFN layer and activates this neuron when the exact same error is seen again; this error will be corrected; their experiments include both an encoder-only model (BERT) and an encoder-decoder model (BART).
  12. [2308.07269] EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models (Wang et al.)
  13. [2305.13172] Editing Large Language Models: Problems, Methods, and Opportunities (Yao et al., EMNLP 2023): This paper, together with the above paper introducing easyedit library, provides comprehensive survey and Python library for knowledge editing. We could stick to these papers and only read original papers when necessary.
  14. From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (Feng et al., ACL 2023)
  15. [2305.14795] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions (Zhong et al.)
  16. Memory-assisted prompt editing to improve GPT-3 after deployment (Madaan et al., EMNLP 2022)

The following are other useful references:

Research Notes | Writing

Phrase Bank

Alliteration

Alliteration is a literary device that involves the repetition of initial consonant sounds in a sequence of words, and it is often used for stylistic or rhetorical purposes to create rhythm, emphasize key ideas, or make phrases more memorable.

Although this focused work completely aligns, addresses, and adheres to the guidelines for a short-paper in this venue, we have not performed any experiments on data outside this privacy policy domain.

Research Notes | Generalizable Hate Speech Detection

Overview

This post is the summary of the following methods; they rank top on the CivilComments-WILDS benchmark:

Rank Method Paper
1 FISH [2104.09937] Gradient Matching for Domain Generalization (Shi et al., ICLR 2022
2, 3 IRMX [2206.07766] Pareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalization (Chen et al., ICLR 2023)
4 LISA [2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al., ICML 2022)
5 DFR [2204.02937] Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations (Kirichenko et al., ICLR 2023)
6, 8 Group DRO
7, 12 Reweighting [1901.05555] Class-Balanced Loss Based on Effective Number of Samples (Cui et al., CVPR 2019) is one example that uses this method; the reweighting method could date back to much earlier works.

Reweighting, IRM, and CORAL

IRM [2] and CORAL [3] are two extensions of the basic reweighting method by adding an additional penalty term on top of the reweighting loss; this term is based on some measures of the data representations from different domains to encourage the data distribution of different domains to be similar.

Reference

  1. [2012.07421] WILDS: A Benchmark of in-the-Wild Distribution Shifts
  2. [1907.02893] Invariant Risk Minimization (Arjovsky et al.)
  3. [2007.01434] In Search of Lost Domain Generalization (Gulrajani and Lopez-Paz)

Research Notes | Research Questions

Overview

Here I document a list of general research questions that warrants searching, reading, thinking, and rethinking.

General Topics

Model Capacity

  • What is the broadly applicable measure of model capacity similar to a hardware performance benchmark that helps practitioners pick up the suitable model to start building their applications?

    • Note: Model capacity mostly determines the performance upper bound of a model. The actual model performance may also related to how the model is trained with what set of hyperparameters.
    • Hypothesis: A straightforward choice is the number of parameters a model has. However, one may question the correlation between the parameter count and this measure, i.e., the parameter count may need to be a valid proxy for the model capacity.

Generalization

  • Existence of Universal Generalization

    Specifically, suppose there are K texts existing in the world at time t, and they are all labeled by an oracle; if we fine-tune a bert-base-uncased with k \ll K samples as a classification model, is there any hope that this fine-tuned model perform reasonably well (needs more precise definition) on all (K-k) samples.

    • Experiment: We could only approximate the oracle by some knowingly most capable models like GPT-4. We therefore have two datasets, one from (a) original annotations and (b) the other from oracle (approximated by GPT-4) annotations. Could the model fine-tuned on dataset (b) generalize better than (a)?
    • Question: Despite the generalization, could the fine-tuned model also inherit the bias (needs more precise definition) of GPT-4?

Text Classification, Annotation Bias, and Spurious Correlation

Does text classification work by relying on the spurious correlation (likely due to annotation bias where the annotators seek to take the shortcuts to complete their assigned tasks as soon as possible) between a limited number of words in the input text and output label? Therefore, is the better model indeed the model that better exploits the spurious correlation?

> - Hypothesis: If the K samples are all annotated by an oracle, then any *reasonably capable* model (needs more precise definition) can generalize well. 
> - Tentative Experiment: If we replace the words in the list with their hypernyms, will the system performance drop?

Life Long and Continual Learning

Suppose we have a generalizable model at time t if we want the model to serve the users indefinitely. What are the strategies that make the model generalize well across time?

Data Distribution

In machine learning theory, we often encounter concepts such as “i.i.d.” Understanding “distribution” for tabular data is straightforward, where the list of variables forms a joint distribution that predicts the label y. However, what could be considered a “distribution” for texts is less clear. Note that this is possible for images, for example, modeling the gray-scale images as a Dirichlet multimodal distribution that predicts digits from 0 to 9.

Data Annotation

The labels in the NLP tasks have different levels of subjectivity. For example, the grammatical error correction is less subjective, sentiment classification is moderately subjective, and the topics like hate speech, suicidal ideation [1], and empathy [2] are either extremely subjective or requires expert knowledge.

The difficulty here it to mitigate the ambiguity during data annotation and make sure the information available in texts matches well with the label. Ideally, if we know the true underlying label of a text, we could fine-tune any reasonably capable model to generalize well.

Data Selection and Data Pruning for Classification Tasks

As of 2023-10-11, I have not seen a single work on the data selection for classification tasks; there are plenty of works on optimizing data mixture for language model pretraining. One likely reason why this happens is that the quality of classification datasets depends on both texts and labels; investigating the label quality is hard.

Special Issues

Improving Machine Learning Models with Specifications

The difference of testing machine learning models versus testing traditional software is that the action items for the latter is generally known after testing and could be executed with high precision, while for the former, we do not know how we will improve the model.

Suppose the model could already achieve high accuracy on the standard test set (it is indeed the train-to-train setting if we follow the WILDS paper), which means the model architecture, training objective, and hyperparameters are not responsible for the lower performance on the artificially created challenging set, the most straightforward way to improve the model performance is data augmentation. The naive way is to blindly collect more data that are wishfully relevant as an augmentation so we expect the performance could improve.

Guided Data Augmentation

However, this blindness hampers the efficiency of improving the models: the only feedback signal is the single scalar (i.e., failure rate) after we have trained and evaluated the model; we should have a feedback signal before we train the model.

  • Unverified Hypothesis: the feedback signal is highly (inversely) correlated with the failure rate on the benchmark.

Formally, we have a list of specifications in the format of (s_1, D _ 1, D _ 1 ^ \text{heldout}), (s _ 2, D _ 2, D _ 2 ^ \text{heldout}), \cdots, the model \mathcal{M}_0 trained on D _ \text{train} does well on D _ \text{train} ^\text{heldout} but poorly on D _ 1 \cup D _ 2 \cup D _ 3 \cdots as indicated by failure rate \mathrm{FR}. We additionally have a new labeled dataset D _ \text{unused}. The goal is to sample D _ \text{unused} using (s_1,D _ 1 ^ \text{heldout}), (s _ 2, D _ 2 ^ \text{heldout}), \cdots: \mathrm{Sample}(D _ \text{unused}); we also have a random sample with same size \mathrm{RandomSample}(D _ \text{unused}) as baseline.

  • Note: The D _ i and D _ i ^ \text{heldout} are completely different. For example, if the specification s _ i is operationalized through templates, these two sets are disjoint in terms of templates. What we are certain about D _ i and D _ i ^ \text{heldout} is that they are ideally sufficient and necessary with respect to s _ i; practically, the semantic underspecification of them are low [3].

There are a lot of things we could do with \mathrm{RandomSample}(D _ \text{unused}) and \mathrm{Sample}(D _ \text{unused}). For example

  • Fine-tuning a model from scratch using \mathrm{RandomSample}(D _ \text{unused}) \cup D _ \text{train}.
  • Patching the model using constrained fine-tuning [4] and related approaches.

Whichever method we choose, if we denote the intervention with \mathrm{RandomSample}(D _ \text{unused}) as \mathcal{M} _ 1 and \mathrm{Sample}(D _ \text{unused}) as \mathcal{M} _ 2. We expect the following conditions will hold:

  • $D _ \text{train} ^ \text{heldout}$: $\mathcal{M} _ 0 \approx \mathcal{M} _ 1 \approx \mathcal{M} _ 2$.
  • $D _ 1 \cup D _ 2 \cup D _ 3 \cdots$: $\mathcal{M} _ 2 \ll \mathcal{M} _ 0$, $\mathcal{M} _ 2 \ll \mathcal{M} _ 1$. That is, the specification-following data selection improves over the random selection on the specification-based benchmarks.
  • Assumption: The samples x _ {ij} is fully specified by the specification s _ i.
  • Note: If the annotations of a dataset strictly follow the annotation codebook, then the machine learning learns the specifications in the codebook. The process described above is a reverse process: we have a model that is already trained by others; we want to use the model in a new application but do not want to or can not afford to relabel the entire dataset, what is the minimal intervention we could apply to the dataset so that the model could quickly meet my specifications?

Detecting Inconsistent Labels with Specifications

Following the previous problem setup, we have a list of specifications in the format of (s_1, D _ 1, D _ 1 ^ \text{heldout}), (s _ 2, D _ 2, D _ 2 ^ \text{heldout}), \cdots; each specification has an unambiguous label. Rather than augmenting the D _ \text{train} with additional data by selecting using either (1) D _ 1 ^ \text{heldout} \cup D _ 2 ^ \text{heldout} \cup \cdots itself or (2) a model trained on it, we aim to correct labels directly in D _ \text{train} which are inconsistent with specifications.

Specifically, we could do the following for train, validation, and test sets:

  • Note: It is important to note that the data splitting should happen before we correct labels; otherwise the scores between trials will not be comparable. An alternative is to use D _ 1 ^ \text{heldout} \cup D _ 2 ^ \text{heldout} \cup \cdots as the validation set so that all scores are comparable.
  • Step 1: Grouping the specifications by the binary labels (for example, 0 and 1).
  • Step 2: Using the queries corresponding to each label to rank samples D _ s; each sample in D _ s will receive an integer ranking ranging from 0 to \vert D _ s \vert. For example, for a set of positive specifications S^+, his will lead to a matrix of shape (\vert D _ s\vert, \vert S^+ \vert).
  • Step 3: Merging the \vert S^+\vert (or \vert S^-\vert) ranking list into one list using some rank aggregation methods.
  • Step 4: Removing all samples of label 0 (or 1). The top-k samples are the ones that should be corrected.

The main issue with this pipeline is that the number of corrected samples is strictly no more than k; retraining with only \frac{k}{\vert D _ \text{train}\vert} of labels changed may not have direct impact on the modified model.

  • Note: This process is different from cleanlab as the latter does not consider specifications (i.e., the guaranteed uncorrupted labels). Their setting is useful in many ways as their system only require noisy labels and predicted probabilities of each sample.

Reverse Engineering Queries Given Documents

For a DPR model trained on large corpus (for example, facebook/dpr-ctx_encoder-single-nq-base and facebook/dpr-question_encoder-single-nq-base), if we have a list of documents D that are aligned with our goal (or true underlying query) q, is it possible to search for its approximated version \hat{q} that returns D as relevant documents with high probability?

A somehow related problem called Doc2Query has been studied before; the difference is that these previous works use Doc2Query as a data augmentation (it is called document expansion in the IR community) approach.

With the vec2text, it may be possible to search for the best query in the embedding space using approaches like Projected Gradient Descent (PGD).

Geometry of Embeddings and its Implication for Text Generation

This is based on the hypothesis that there exists certain relation between the geometry of embedding space and semantic meaning of each point in that space. For example, sampling a convex set leads to sentences that have similar high-level specifications.

Many recent works show that text embeddings may be anisotropic: directions of word vectors are not evenly distributed across space but rater concentrated in a narrow cone; this peculiarity may not be related to performance [7].

Retrieval Augmented LM

RALM could be useful in numerous ways.

  • Copyright: This is the idea of SiloLM, where the LM itself is fine-tuned with CC0 data. The copyright data is stored in a non-parametric database; these data could be incorporated into the inference process using RALM. However, with this design, the authors of the copyrighted texts could easily request a removal.
  • Traceability: The retrieved samples serve as evidence to support the decisions made by the LM.
  • QA: When we would like to do QA on a large tabular database (for example, asking “what is the percentage of patients who have allergy” to a large EHR database), the RALM is the most natural way to incorporate the necessary information in the database into the inference process of an LLM. Previously we need to build a pipeline that first generates queries written in formal language (for example, ElasticSearch queries) and then use these generated queries to answer the question.

These benefits are offered by the complementary nature of non-parametric databases’ high data fidelity and LMs’ inference ability. Specifically, the knowledge is stored distributionally in the LM; it is not straightforward to retrieve the exact know compared to using a non-parametric database. At the same time, the inference ability available to exploit in LMs are not available in other smaller models.

HateModerate

  • Dataset Statistics

image-20231107125236836

Label Inconsistency of Different Datasets

Given multiple datasets D_1, D_2, \cdots with the same input and output space \mathcal{X} \times \mathcal{Y} (for example, binary hate speech classification), is there a systemic approach that finds inconsistent labeling criteria. Specifically, if two similar sentences that belong to two datasets receive different labels, how do we explain the discrepancy in their underlying labeling criterion? This is done preferably in the format of FOL or natural language.

  • If we treat GPT-4 as an oracle and use it to annotate the samples from D _ 1, D _ 2, \cdots, we could obtain an accuracy vector of size \vert \mathcal{Y} \vert to characterize the label quality of each dataset. Note that for comparison purposes, the datasets to be annotated should be made same size and remains the original label distribution.

    Previously it has been shown that using a simple zero-shot prompt shows an binary label inconsistency rate from 9% to up to 36%; the datasets under study are 15 hate speech datasets (uniform random sample of 200 samples per dataset) whose labels have been normalized to binary labels per each dataset’s description.

    Note: The dataset label normalization process may be questionable.

Adversarial Attack on RLHF

We assume there is an underlying utility function U: \mathcal{Y} \rightarrow [-1, 1] that measures a response y‘s alignment to the input x: a response receives a high score when it is helpful, honest, and harmless.

  • One thing we could do is investigating the relation between the ratio of reversed comparison pairs and the degradation on performance on the downstream tasks, such as HHH.
  • The comparison reversal is not uniformly adversarial to the downstream tasks. If U(y _ i) and U(y _ j) is very close, then reversing them is not as effective as reversing another pair where U(y _ i’) and U(y _ j ‘) is very different.

OOD for Reward Model in RLHF

The reward model r(x, y; \phi) is fixed when fine-tuning the LM with PPO. There may be some distribution shifts between two stages. From the high level, this may not be an issue as the goal of RLHF is general enough (for example, HHH and Constitutional AI).

Applications of RLHF to Other Tasks

According to Hyungwon Chung, RLHF is the new paradigm to create application-specific loss function. It is therefore likely beneficial to abandon traditional cross-entropy loss altogether and opt for RLHF.

Pairwise Regression

This is especially useful for highly abstract tasks like hate speech classification. For example, we could initialize a RM and use the normalized score [0, 1] (for example, hatefulness) to fine-tune a hate speech regressor based on some open-source models. We could find a threshold on the validation set and then deployment the RM (with a threshold) to the testing environment. This idea is indeed the pairwise regression; it is one of three approaches (point-wise, pairwise, and list-wise) for learning to rank.

References

  1. ScAN: Suicide Attempt and Ideation Events Dataset (Rawat et al., NAACL 2022)
  2. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support (Sharma et al., EMNLP 2020)
  3. Dealing with Semantic Underspecification in Multimodal NLP (Pezzelle, ACL 2023)
  4. [2012.00363] Modifying Memories in Transformer Models (Zhu et al.)
  5. cleanlab

    1. [1911.00068] Confident Learning: Estimating Uncertainty in Dataset Labels is the theoretical foundation of the cleanlab; this paper has a blog.
    2. [2103.14749] Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks is an application of the principle in the first paper to machine learning benchmarks; this paper has a blog.
  6. Doc2Query

    1. [1904.08375] Document Expansion by Query Prediction (Nogueira et al.)
    2. From doc2query to docTTTTTquery (Nogueira and Lin) and its associated GitHub.
    3. [2310.06816] Text Embeddings Reveal (Almost) As Much As Text (Morris et al., EMNLP 2024)
    4. Decoding a Neural Retriever’s Latent Space for Query Suggestion (Adolphs et al., EMNLP 2022)
  7. Is Anisotropy Truly Harmful? A Case Study on Text Clustering (Ait-Saada & Nadif, ACL 2023)

Research Notes | Manuscript Preparation in LaTeX

Overview

The computer science conferences have high tolerance for style variability, which leads to stark variances in the typesetting quality even for the final camera-ready version. Here is one such example: the left from [1] is much better than [2], which is a random sample from the same conference in the same year. As the latter is the impression of almost all of the papers from that conference, the paper [1] will easily stand out.

Template

  • Some of the templates look more professional than others. Whenever possible, we should such templates.

Fonts

  • Using lmodern package through \usepackage{lmodern} in preamble; this single command will significantly improve the first impression of the manuscript.

Graphics

Reference

  1. packages – Suggest a “nice” font family for my basic LaTeX template (text and math) – TeX – LaTeX Stack Exchange

Research Notes | Training Data Optimization

Problem Statement

Suppose we have a collection of datasets from K sources \mathcal{D} _ 1, \cdots, \mathcal{D} _ K. These K datasets have been unified regarding input and output spaces.

Now we split each \mathcal{D} _ i into train, validation, and test splits \mathcal{D} _ i ^ \text{train},\ \mathcal{D} _ i ^ \text{val} and \mathcal{D} _ i ^ \text{test} and form the aggregated train, validation, and test sets as \mathcal{D}^\text{train} := \cup _ {i=1}^ K D _ i^\text{train}, \mathcal{D}^\text{val} := \cup _ {i=1}^ K D _ i^\text{val}, and \mathcal{D}^\text{test} := \cup _ {i=1}^ K D _ i^\text{test} .

The learning problem could vary depending the quality of the datasets after (1) dataset collection and annotation by authors of different datasets, and (2) dataset unification when merging K datasets into one. This is because:

  • If labels are reliable, then this is dataset selection problem. The argument is to save computation resources when training on \mathcal{D} \subseteq \mathcal{D} ^ \text{train} while maintaining the performance as a model trained in (1) each \mathcal{D}_i,\ i \in [K], (2) \mathcal{D} ^ \text{train}, and (3) \mathrm{Sample}(\mathcal{D} ^ \text{train}) that matches the size of \mathcal{D}.

    In some special cases, another motivation for dataset selection is that we know the size of a sampled dataset (for example, the dataset statistics described in a paper) but we are not sure what are exactly these samples.

  • If labels are not reliable, then the argument is to prevent the low-quality labels from offsetting the benefits of a larger training dataset (rather than distilling a smaller dataset to save compute). We have three options:
Index Method Type
1 Reannotating the entire dataset. This could be reduced as a dataset distillation problem as now we have more confidence on the filtered datasets. Offline
2 Identifying and removing unreliable labels and optionally using these samples as an unsupervised dataset. This is also reducible to a dataset selection problem as 1. Offline
3 Learning with the noisy labels (LNL as described in 1) they are; this requires the learning algorithm to explicitly consider the variablity in the label quality. Online

Note that there is a easily topic called “dataset distillation” that one may easily confused with. The goal of dataset distillation is to create synthetic dataset in the feature space based on the original one to match the performance on the test set. Previous show that it is possible to attain the original performance on MNIST ([3]) and IMDB ([4]) with a synthetic dataset of size (surprisingly) 10 and 20.

Adaptive Data Selection

With the test sets finalized, we could now work on sampling training sets, i.e., choosing one specific \mathrm{Sample}(\cdot) function described above. The goal here is to sample the training set so that the scores on the test sets are maximized:

  • DSIR: Suppose we need to sample B batches of samples totaling K, then we could start by randomly sampling the 1st batch and then calling the DSIR algorithm in the future batches until we have collected K samples. This should be done for each label.

Reference

  1. NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing (Wu et al., Findings 2023)
  2. [2202.01327] Adaptive Sampling Strategies to Construct Equitable Training Datasets (Cai et al., FAccT 2023)
  3. [2301.04272] Data Distillation: A Survey (Sachdeva and McAuley, JMLR).
  4. [1811.10959] Dataset Distillation (Wang et al.)
  5. [1910.02551] Soft-Label Dataset Distillation and Text Dataset Distillation (Sucholutsky and Schonlau, IJCNN 2020). This is the only paper referenced in 3 describing the dataset distillation for texts. This paper is based on the very original data distillation objective proposed in 4.
  6. [2302.03169] Data Selection for Language Models via Importance Resampling (Xie et al.)
  7. [2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al.)
  8. [2306.11670] GIO: Gradient Information Optimization for Training Dataset Selection (Everaert and Potts): This paper has similar settings as the DSIR paper [6]: we are selecting new samples by minimizing their KL divergence with an existing set of unlabeled samples. The paper claims an advantage over the DSIR as the proposed algorithm requires fewer samples:

    Like GIO, these heuristic methods aim to select a subset of data that is higher quality and more relevant. However, they are either highly tailored to their particular tasks or they require very large numbers of examples (to develop classifiers or construct target probabilities). By contrast, GIO is task- and domain-agnostic, it can be applied plug-and-play to a new task and dataset, and it requires comparatively few gold examples X to serve as the target distribution.

Research Notes | A Benchmark for Hate Speech Detection

Overview

There does not exist a unified benchmark such as GLUE in hate speech detection domain that conducts a leaderboard style performance comparison of different open-source hate speech classifiers. This prevents the practitioners from making informed decisions when choosing which model to use for their own hate speech detection applications.

The benchmark will provide the following:

  • The entire training and validation set for future study. However, the labels from public test sets will not be released for benchmarking purposes; there will be additional private test sets.
  • The ranking of the models based on the average aggregated metrics (for example, F1 score) on the public and private test sets.

Protocol

  • Step 1: Randomly select a test set and a validation set.

    The two datasets must be randomly selected for the following reasons:

    1. The distribution of the validation set will be similar to the test set. Using the randomly sampled validation set will help select the models that more are likely to perform well on the test set.
    2. This makes the two datasets independent from each other in terms of label distribution and source distribution. Throughout the experiments, the test and validation sets are the same; this is helpful as we could see the (dis)advantages of one method in the wandb dashboard.
  • Step 2: Sampling train set using different (a) data selection methods.
  • Step 3: Training or fine-tuning (b) different models with (c) different techniques for local improvements, for example, objective function, and regularization.
  • Step 4: Comparing different combinations of (a), (b), and (c). If we have m combinations and n test sets, then we will end up with a table of (m, n+1), where the first column lists all the combinations.

Candidate Datasets

Collected Datasets from Diverse Topics

The current data aggregation includes [1] through [5], where the [5] only includes hate speech.

  1. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  2. [2005.12423] Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media during the COVID-19 Crisis (He et al.)
  3. [2108.12521] TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter (Kumar et al.)
  4. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  5. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech (ElSherief et al., EMNLP 2021)

cardiffnlp/twitter-roberta-base-hate-latest Collection

The follow are the datasets used for the model cardiffnlp/twitter-roberta-base-hate-latest or the paper below:

Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation (Antypas & Camacho-Collados, WOAH 2023)

Index Dataset Name Source Notes
1 HatE Link that requires filling in a Google form.
2 MHS ucberkeley-dlab/measuring-hate-speech
3 DEAP Zenodo
4 CMS Link that requires registration and email verification.
5 Offense Link; this dataset is also called OLID.
6 HateX hatexplain and GitHub
7 LSC GitHub Dehydrated
8 MMHS nedjmaou/MLMA_hate_speech and GitHub
9 HASOC Link that requires uploading a signed agreement; this agreement takes up to 15 days to approve. Not Available
10 AYR GitHub Dehydrated
11 AHSD GitHub
12 HTPO Link
13 HSHP GitHub Dehydrated

The following are the papers that correspond to the list of datasets:

  1. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Basile et al., SemEval 2019)
  2. The Measuring Hate Speech Corpus: Leveraging Rasch Measurement Theory for Data Perspectivism (Sachdeva et al., NLPerspectives 2022)
  3. Detecting East Asian Prejudice on Social Media (Vidgen et al., ALW 2020)
  4. [2004.12764] “Call me sexist, but…”: Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples (Samory et al.)
  5. Predicting the Type and Target of Offensive Posts in Social Media (Zampieri et al., NAACL 2019)
  6. [2012.10289] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection (Mathew et al.)
  7. [1802.00393] Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (Founta et al.)
  8. Multilingual and Multi-Aspect Hate Speech Analysis (Ousidhoum et al., EMNLP-IJCNLP 2019)
  9. [2108.05927] Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages (Mandal et al.)
  10. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter (Waseem, NLP+CSS 2016)
  11. [1703.04009] Automated Hate Speech Detection and the Problem of Offensive Language (Davidson et al.)
  12. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection (Grimminger & Klinger, WASSA 2021)
  13. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter (Waseem & Hovy, NAACL 2016)

It is possible to approximate a subset of the original training mixture (8 of 12 datasets excluding the MMHS dataset, which only includes hate speech) following the Table 2 of the original paper. Something to note is that:

  • AYR, HASOC, HSHP, and LSC are not usable.
  • Offense does not exactly match the sizes in Table 2.
  • We disregard any splits and try to match the number in Table 2. When matching number is not possible, we try to make sure the ratio of on-hate versus hate is same.

Additional Datasets from hatespeechdata.com

The following the the additional datasets from hatespeechdata.com that are not included in the above mentioned sources. The dataset names are either available from the original paper or created here for easy reference.

Index Dataset Name Source Notes
1 AbuseEval GitHub The Offense dataset above reannotated for non-hate, implicit, and explicit hate; only IDs are available. Around 87% of the hate/non-hate labels are same as the previous Offense dataset.
2 SWAD GitHub
3 ALONE Not usable. Requires contacting authors.
4 HatefulUsersTwitter GitHub and Kaggle Available but not relevant. This dataset is about detecting whether a user is hateful or neutral on the Tweet network; it does not come with annotated hateful/benign texts.
5 MMHS150K Website Not usable. Multimodal datasets.
6 HarassmentLexicon GitHub Not usable. Lexicons only.
7 P2PHate GitHub Not usable. Dehydrated.
8 Golbeck Not usable. Requires contacting jgolbeck@umd.edu
9 SurgeAI Website Hateful content only.
10 TSA Kaggle Dataset is provided by Analytics Vidhya. The test.csv does not come with labels.
  1. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language (Caselli et al., LREC 2020): The dataset from this paper is also called AbuseEval v1.0.
  2. Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media (Pamungkas et al., LREC 2020)
  3. [2008.06465] ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter (Wijesiriwardene et. al.)
  4. [1803.08977] Characterizing and Detecting Hateful Users on Twitter (Ribeiro et al., ICWSM 2018)
  5. [1910.03814] Exploring Hate Speech Detection in Multimodal Publications (Gomez et al., WACV 2020)
  6. [1802.09416] A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research (Rezvan et al.)
  7. [1804.04649] Peer to Peer Hate: Hate Speech Instigators and Their Targets (ElSherief et al.)
  8. A Large Labeled Corpus for Online Harassment Research (Golbeck et al., WebSci 2017)
  9. Twitter Hate Speech Dataset (Surge AI)
  10. Twitter Sentiment Analysis (Kaggle)