Research Notes | Writing

Phrase Bank

Alliteration

Alliteration is a literary device that involves the repetition of initial consonant sounds in a sequence of words, and it is often used for stylistic or rhetorical purposes to create rhythm, emphasize key ideas, or make phrases more memorable.

Although this focused work completely aligns, addresses, and adheres to the guidelines for a short-paper in this venue, we have not performed any experiments on data outside this privacy policy domain.

Reading Notes | Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Slide]

Change Logs:

  • 2023-10-05: First draft. This paper appears at EACL 2023; it is dated 2204.08952. The code is not released.

Overview

Paraphrasing and back-translation methods are only applicable for texts that are not sensitive to changes in texts. However, the privacy policies could convey wildly different meanings for small differences in the texts; this makes these two techniques less applicable to the problem being studied.

Method

The authors propose a coarse-to-fine architecture for retrieval-based data augmentation. It consists of an ensemble of retrieval and filter models; these models include (1) regular BERT, (2) PBERT, a BERT fine-tuned with MLM objective on the privacy policies, and (3) the PBERT fine-tuned with SimCSE.

  • Retrieval Model (Bi-Encoder): This is a typical structure proposed in [1].
  • Filter Model (Cross-Encoder): This is indeed a text classification model that takes the query, retrieved sentence pair and return a binary decision.

Note that

  • The retrieval model and filter model are trained separately; they are not jointly trained in this work.
  • The ensemble here is more like three systems working in parallel and aggregating the collected sentences altogether at last.

During inference, the top-k retrieved samples are filtered by the trained filter model. The aggregated retrieved texts are combined with original dataset to fine-tune the privacy QA model.

Reference

  1. Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., EMNLP 2020) and HuggingFace.

Research Notes | Generalizable Hate Speech Detection

Overview

This post is the summary of the following methods; they rank top on the CivilComments-WILDS benchmark:

Rank Method Paper
1 FISH [2104.09937] Gradient Matching for Domain Generalization (Shi et al., ICLR 2022
2, 3 IRMX [2206.07766] Pareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalization (Chen et al., ICLR 2023)
4 LISA [2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al., ICML 2022)
5 DFR [2204.02937] Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations (Kirichenko et al., ICLR 2023)
6, 8 Group DRO
7, 12 Reweighting [1901.05555] Class-Balanced Loss Based on Effective Number of Samples (Cui et al., CVPR 2019) is one example that uses this method; the reweighting method could date back to much earlier works.

Reweighting, IRM, and CORAL

IRM [2] and CORAL [3] are two extensions of the basic reweighting method by adding an additional penalty term on top of the reweighting loss; this term is based on some measures of the data representations from different domains to encourage the data distribution of different domains to be similar.

Reference

  1. [2012.07421] WILDS: A Benchmark of in-the-Wild Distribution Shifts
  2. [1907.02893] Invariant Risk Minimization (Arjovsky et al.)
  3. [2007.01434] In Search of Lost Domain Generalization (Gulrajani and Lopez-Paz)

Reading Notes | Wild-Time – A Benchmark of in-the-Wild Distribution Shift over Time

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website and Leaderboard] – [Slide] – [Lead Author]

Change Logs:

  • 2023-10-03: First draft. The authors provide 5 datasets (2 of them are text classification datasets, the others include 2 image classification datasets and 1 EHR dataset) and more than 10 mitigation methods for distribution shift.

Experiments

  • The authors find that most of the mitigation methods are not effective compared to the standard ERM on the proposed benchmark. Note that SimCLR and SwaV methods are only applicable to image classification tasks.

    image-20231003120134331

image-20231003120318062

Additional Notes

From the content below, we could see that:

To address this challenge, we adapt the above invariant learning approaches to the temporal distribution shift setting. We leverage timestamp metadata to create a temporal robustness set consisting of substreams of data, where each substream is treated as one domain. Specifically, as shown in Figure 3, we define a sliding window G with length L. For a data stream with T timestamps, we apply the sliding window G to obtain T − L + 1 substreams. We treat each substream as a “domain” and apply the above invariant algorithms on the robustness set. We name the adapted CORAL, GroupDRO and IRM as CORAL-T, GroupDRO-T, IRM-T, respectively. Note that we do not adapt LISA since the intra-label LISA performs well without domain information, which is also mentioned in the original paper.

  • The way the authors apply the group algorithms look questionable: it does not make sense to create artificial domains by grouping data from some consecutive timestamps. This may be the reason why the authors do not observe the performance gains.
  • The LISA, which is the same author’s work, seems to be a good approach as it does not require the domain labels while performing competitively.

Reading Notes | Competency Problems – On Finding and Removing Artifacts in Language Data

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-26: First draft. The paper appears at EMNLP 2021.
  • The following is the main claim of the paper, as is summarized in [1]:

[…] all correlations between labels and individual “input features” are spurious.

  • Spurious correlation is useful in the training data but unreliable in general [1].

Reference

  1. Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language (Eisenstein, NAACL 2022): This paper updates the claim in the main paper theoretically: feature-label correlation is not related to whether label is invariant to the the interventions on the feature.

    Practically, the paper suggests the partial invariance (whether independent or not) for real-world datasets; for example, the sentiment of a movie review is invariant to the actor names. The paper also suggest the following options to improve model robustness:

    data augmentation, causally-motivated regularizers, stress tests, and “worst-subgroup” performance metrics (and associated robust optimizers) can be seen as enforcing or testing task-specific invariance properties that provide robustness against known distributional shifts (e.g., Lu et al., 2020; Ribeiro et al., 2020; Kaushik et al., 2021; Koh et al., 2021; Veitch et al., 2021). Such approaches generally require domain knowledge about the linguistic and causal properties of the task at hand — or to put it more positively, they make it possible for such domain knowledge to be brought to bear. Indeed, the central argument of this paper is that no meaningful definition of spuriousness or robustness can be obtained without such domain knowledge.

  2. On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations (Schwartz & Stanovsky, Findings 2022): This paper shows that creating a truly balanced dataset devoid of the issues mentioned in the main paper will also throw the useful signals encoded in the texts (“throw the baby out with the bathwater”).

Research Notes | Research Questions

Overview

Here I document a list of general research questions that warrants searching, reading, thinking, and rethinking.

General Topics

Model Capacity

  • What is the broadly applicable measure of model capacity similar to a hardware performance benchmark that helps practitioners pick up the suitable model to start building their applications?

    • Note: Model capacity mostly determines the performance upper bound of a model. The actual model performance may also related to how the model is trained with what set of hyperparameters.
    • Hypothesis: A straightforward choice is the number of parameters a model has. However, one may question the correlation between the parameter count and this measure, i.e., the parameter count may need to be a valid proxy for the model capacity.

Generalization

  • Existence of Universal Generalization

    Specifically, suppose there are K texts existing in the world at time t, and they are all labeled by an oracle; if we fine-tune a bert-base-uncased with k \ll K samples as a classification model, is there any hope that this fine-tuned model perform reasonably well (needs more precise definition) on all (K-k) samples.

    • Experiment: We could only approximate the oracle by some knowingly most capable models like GPT-4. We therefore have two datasets, one from (a) original annotations and (b) the other from oracle (approximated by GPT-4) annotations. Could the model fine-tuned on dataset (b) generalize better than (a)?
    • Question: Despite the generalization, could the fine-tuned model also inherit the bias (needs more precise definition) of GPT-4?

Text Classification, Annotation Bias, and Spurious Correlation

Does text classification work by relying on the spurious correlation (likely due to annotation bias where the annotators seek to take the shortcuts to complete their assigned tasks as soon as possible) between a limited number of words in the input text and output label? Therefore, is the better model indeed the model that better exploits the spurious correlation?

> - Hypothesis: If the K samples are all annotated by an oracle, then any *reasonably capable* model (needs more precise definition) can generalize well. 
> - Tentative Experiment: If we replace the words in the list with their hypernyms, will the system performance drop?

Life Long and Continual Learning

Suppose we have a generalizable model at time t if we want the model to serve the users indefinitely. What are the strategies that make the model generalize well across time?

Data Distribution

In machine learning theory, we often encounter concepts such as “i.i.d.” Understanding “distribution” for tabular data is straightforward, where the list of variables forms a joint distribution that predicts the label y. However, what could be considered a “distribution” for texts is less clear. Note that this is possible for images, for example, modeling the gray-scale images as a Dirichlet multimodal distribution that predicts digits from 0 to 9.

Data Annotation

The labels in the NLP tasks have different levels of subjectivity. For example, the grammatical error correction is less subjective, sentiment classification is moderately subjective, and the topics like hate speech, suicidal ideation [1], and empathy [2] are either extremely subjective or requires expert knowledge.

The difficulty here it to mitigate the ambiguity during data annotation and make sure the information available in texts matches well with the label. Ideally, if we know the true underlying label of a text, we could fine-tune any reasonably capable model to generalize well.

Data Selection and Data Pruning for Classification Tasks

As of 2023-10-11, I have not seen a single work on the data selection for classification tasks; there are plenty of works on optimizing data mixture for language model pretraining. One likely reason why this happens is that the quality of classification datasets depends on both texts and labels; investigating the label quality is hard.

Special Issues

Improving Machine Learning Models with Specifications

The difference of testing machine learning models versus testing traditional software is that the action items for the latter is generally known after testing and could be executed with high precision, while for the former, we do not know how we will improve the model.

Suppose the model could already achieve high accuracy on the standard test set (it is indeed the train-to-train setting if we follow the WILDS paper), which means the model architecture, training objective, and hyperparameters are not responsible for the lower performance on the artificially created challenging set, the most straightforward way to improve the model performance is data augmentation. The naive way is to blindly collect more data that are wishfully relevant as an augmentation so we expect the performance could improve.

Guided Data Augmentation

However, this blindness hampers the efficiency of improving the models: the only feedback signal is the single scalar (i.e., failure rate) after we have trained and evaluated the model; we should have a feedback signal before we train the model.

  • Unverified Hypothesis: the feedback signal is highly (inversely) correlated with the failure rate on the benchmark.

Formally, we have a list of specifications in the format of (s_1, D _ 1, D _ 1 ^ \text{heldout}), (s _ 2, D _ 2, D _ 2 ^ \text{heldout}), \cdots, the model \mathcal{M}_0 trained on D _ \text{train} does well on D _ \text{train} ^\text{heldout} but poorly on D _ 1 \cup D _ 2 \cup D _ 3 \cdots as indicated by failure rate \mathrm{FR}. We additionally have a new labeled dataset D _ \text{unused}. The goal is to sample D _ \text{unused} using (s_1,D _ 1 ^ \text{heldout}), (s _ 2, D _ 2 ^ \text{heldout}), \cdots: \mathrm{Sample}(D _ \text{unused}); we also have a random sample with same size \mathrm{RandomSample}(D _ \text{unused}) as baseline.

  • Note: The D _ i and D _ i ^ \text{heldout} are completely different. For example, if the specification s _ i is operationalized through templates, these two sets are disjoint in terms of templates. What we are certain about D _ i and D _ i ^ \text{heldout} is that they are ideally sufficient and necessary with respect to s _ i; practically, the semantic underspecification of them are low [3].

There are a lot of things we could do with \mathrm{RandomSample}(D _ \text{unused}) and \mathrm{Sample}(D _ \text{unused}). For example

  • Fine-tuning a model from scratch using \mathrm{RandomSample}(D _ \text{unused}) \cup D _ \text{train}.
  • Patching the model using constrained fine-tuning [4] and related approaches.

Whichever method we choose, if we denote the intervention with \mathrm{RandomSample}(D _ \text{unused}) as \mathcal{M} _ 1 and \mathrm{Sample}(D _ \text{unused}) as \mathcal{M} _ 2. We expect the following conditions will hold:

  • $D _ \text{train} ^ \text{heldout}$: $\mathcal{M} _ 0 \approx \mathcal{M} _ 1 \approx \mathcal{M} _ 2$.
  • $D _ 1 \cup D _ 2 \cup D _ 3 \cdots$: $\mathcal{M} _ 2 \ll \mathcal{M} _ 0$, $\mathcal{M} _ 2 \ll \mathcal{M} _ 1$. That is, the specification-following data selection improves over the random selection on the specification-based benchmarks.
  • Assumption: The samples x _ {ij} is fully specified by the specification s _ i.
  • Note: If the annotations of a dataset strictly follow the annotation codebook, then the machine learning learns the specifications in the codebook. The process described above is a reverse process: we have a model that is already trained by others; we want to use the model in a new application but do not want to or can not afford to relabel the entire dataset, what is the minimal intervention we could apply to the dataset so that the model could quickly meet my specifications?

Detecting Inconsistent Labels with Specifications

Following the previous problem setup, we have a list of specifications in the format of (s_1, D _ 1, D _ 1 ^ \text{heldout}), (s _ 2, D _ 2, D _ 2 ^ \text{heldout}), \cdots; each specification has an unambiguous label. Rather than augmenting the D _ \text{train} with additional data by selecting using either (1) D _ 1 ^ \text{heldout} \cup D _ 2 ^ \text{heldout} \cup \cdots itself or (2) a model trained on it, we aim to correct labels directly in D _ \text{train} which are inconsistent with specifications.

Specifically, we could do the following for train, validation, and test sets:

  • Note: It is important to note that the data splitting should happen before we correct labels; otherwise the scores between trials will not be comparable. An alternative is to use D _ 1 ^ \text{heldout} \cup D _ 2 ^ \text{heldout} \cup \cdots as the validation set so that all scores are comparable.
  • Step 1: Grouping the specifications by the binary labels (for example, 0 and 1).
  • Step 2: Using the queries corresponding to each label to rank samples D _ s; each sample in D _ s will receive an integer ranking ranging from 0 to \vert D _ s \vert. For example, for a set of positive specifications S^+, his will lead to a matrix of shape (\vert D _ s\vert, \vert S^+ \vert).
  • Step 3: Merging the \vert S^+\vert (or \vert S^-\vert) ranking list into one list using some rank aggregation methods.
  • Step 4: Removing all samples of label 0 (or 1). The top-k samples are the ones that should be corrected.

The main issue with this pipeline is that the number of corrected samples is strictly no more than k; retraining with only \frac{k}{\vert D _ \text{train}\vert} of labels changed may not have direct impact on the modified model.

  • Note: This process is different from cleanlab as the latter does not consider specifications (i.e., the guaranteed uncorrupted labels). Their setting is useful in many ways as their system only require noisy labels and predicted probabilities of each sample.

Reverse Engineering Queries Given Documents

For a DPR model trained on large corpus (for example, facebook/dpr-ctx_encoder-single-nq-base and facebook/dpr-question_encoder-single-nq-base), if we have a list of documents D that are aligned with our goal (or true underlying query) q, is it possible to search for its approximated version \hat{q} that returns D as relevant documents with high probability?

A somehow related problem called Doc2Query has been studied before; the difference is that these previous works use Doc2Query as a data augmentation (it is called document expansion in the IR community) approach.

With the vec2text, it may be possible to search for the best query in the embedding space using approaches like Projected Gradient Descent (PGD).

Geometry of Embeddings and its Implication for Text Generation

This is based on the hypothesis that there exists certain relation between the geometry of embedding space and semantic meaning of each point in that space. For example, sampling a convex set leads to sentences that have similar high-level specifications.

Many recent works show that text embeddings may be anisotropic: directions of word vectors are not evenly distributed across space but rater concentrated in a narrow cone; this peculiarity may not be related to performance [7].

Retrieval Augmented LM

RALM could be useful in numerous ways.

  • Copyright: This is the idea of SiloLM, where the LM itself is fine-tuned with CC0 data. The copyright data is stored in a non-parametric database; these data could be incorporated into the inference process using RALM. However, with this design, the authors of the copyrighted texts could easily request a removal.
  • Traceability: The retrieved samples serve as evidence to support the decisions made by the LM.
  • QA: When we would like to do QA on a large tabular database (for example, asking “what is the percentage of patients who have allergy” to a large EHR database), the RALM is the most natural way to incorporate the necessary information in the database into the inference process of an LLM. Previously we need to build a pipeline that first generates queries written in formal language (for example, ElasticSearch queries) and then use these generated queries to answer the question.

These benefits are offered by the complementary nature of non-parametric databases’ high data fidelity and LMs’ inference ability. Specifically, the knowledge is stored distributionally in the LM; it is not straightforward to retrieve the exact know compared to using a non-parametric database. At the same time, the inference ability available to exploit in LMs are not available in other smaller models.

HateModerate

  • Dataset Statistics

image-20231107125236836

Label Inconsistency of Different Datasets

Given multiple datasets D_1, D_2, \cdots with the same input and output space \mathcal{X} \times \mathcal{Y} (for example, binary hate speech classification), is there a systemic approach that finds inconsistent labeling criteria. Specifically, if two similar sentences that belong to two datasets receive different labels, how do we explain the discrepancy in their underlying labeling criterion? This is done preferably in the format of FOL or natural language.

  • If we treat GPT-4 as an oracle and use it to annotate the samples from D _ 1, D _ 2, \cdots, we could obtain an accuracy vector of size \vert \mathcal{Y} \vert to characterize the label quality of each dataset. Note that for comparison purposes, the datasets to be annotated should be made same size and remains the original label distribution.

    Previously it has been shown that using a simple zero-shot prompt shows an binary label inconsistency rate from 9% to up to 36%; the datasets under study are 15 hate speech datasets (uniform random sample of 200 samples per dataset) whose labels have been normalized to binary labels per each dataset’s description.

    Note: The dataset label normalization process may be questionable.

Adversarial Attack on RLHF

We assume there is an underlying utility function U: \mathcal{Y} \rightarrow [-1, 1] that measures a response y‘s alignment to the input x: a response receives a high score when it is helpful, honest, and harmless.

  • One thing we could do is investigating the relation between the ratio of reversed comparison pairs and the degradation on performance on the downstream tasks, such as HHH.
  • The comparison reversal is not uniformly adversarial to the downstream tasks. If U(y _ i) and U(y _ j) is very close, then reversing them is not as effective as reversing another pair where U(y _ i’) and U(y _ j ‘) is very different.

OOD for Reward Model in RLHF

The reward model r(x, y; \phi) is fixed when fine-tuning the LM with PPO. There may be some distribution shifts between two stages. From the high level, this may not be an issue as the goal of RLHF is general enough (for example, HHH and Constitutional AI).

Applications of RLHF to Other Tasks

According to Hyungwon Chung, RLHF is the new paradigm to create application-specific loss function. It is therefore likely beneficial to abandon traditional cross-entropy loss altogether and opt for RLHF.

Pairwise Regression

This is especially useful for highly abstract tasks like hate speech classification. For example, we could initialize a RM and use the normalized score [0, 1] (for example, hatefulness) to fine-tune a hate speech regressor based on some open-source models. We could find a threshold on the validation set and then deployment the RM (with a threshold) to the testing environment. This idea is indeed the pairwise regression; it is one of three approaches (point-wise, pairwise, and list-wise) for learning to rank.

References

  1. ScAN: Suicide Attempt and Ideation Events Dataset (Rawat et al., NAACL 2022)
  2. A Computational Approach to Understanding Empathy Expressed in Text-Based Mental Health Support (Sharma et al., EMNLP 2020)
  3. Dealing with Semantic Underspecification in Multimodal NLP (Pezzelle, ACL 2023)
  4. [2012.00363] Modifying Memories in Transformer Models (Zhu et al.)
  5. cleanlab

    1. [1911.00068] Confident Learning: Estimating Uncertainty in Dataset Labels is the theoretical foundation of the cleanlab; this paper has a blog.
    2. [2103.14749] Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks is an application of the principle in the first paper to machine learning benchmarks; this paper has a blog.
  6. Doc2Query

    1. [1904.08375] Document Expansion by Query Prediction (Nogueira et al.)
    2. From doc2query to docTTTTTquery (Nogueira and Lin) and its associated GitHub.
    3. [2310.06816] Text Embeddings Reveal (Almost) As Much As Text (Morris et al., EMNLP 2024)
    4. Decoding a Neural Retriever’s Latent Space for Query Suggestion (Adolphs et al., EMNLP 2022)
  7. Is Anisotropy Truly Harmful? A Case Study on Text Clustering (Ait-Saada & Nadif, ACL 2023)

Reading Notes | Exploring and Predicting Transferability across NLP Tasks

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

  • 2023-09-16: First draft. This paper appears at ACL 2020.
  • Data selection strategy for best transfer learning performance.

Reference

  1. [1811.01088] Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks (Phang et al)
  2. Identifying beneficial task relations for multi-task learning in deep neural networks (Bingel & Søgaard, EACL 2017)

Talk Notes | Lessons Learned from Analyzing Systems for Hate Speech Detection and Bias Mitigation by Sarah Masud

[YouTube] – [Personal Website]

  • The presenter has authored several interesting papers ([1] through [5]) on hate speech detection.

Notes

Status Quo of Hate Speech Detection

  • There are varying definitions of hate speech.
  • Labels related to hate speech include hate, offensive, toxic, profane, and toxic. There could be also more fine-grained categories, such as sexist, racist, and islamophobic.
  • Because of the reasons mentioned above, there is no leaderboard in hate speech detection.

Data Sources

We should pay attention to data bias; it is doubtful to collect hate speeches from people and sites that are more likely to generate hate speech. The authors propose to collect datasets from neutral sources; this design choice makes the data annotation difficult.

Annotations

Current approaches of hate speech annotation rely on people (crowdworkers or experts). The authors use the two-phase approach to ensure the label quality.

Building Better Hate Speech Detection Models

  • The complexity of models does not necessarily help. It is more important to capture the signals that predict the final labels, for example, the history and the social network information. This observation also applies to other tasks that involve modeling social behaviors.
  • However, we should carefully monitor the overfitting: spurious correlation between overfitted phrases and labels should not be the signals we allow the models to pick up. That is, the models should generalize without the presence of these words.
  • In the work [2], the authors propose a system that considers not just the text information, but also the timeline and social network information. They merge the three sources of signal using an attention mechanism. However, we could see two limitations:
    • This design is specific to Twitter. Other platforms, such as Reddit, do not have this information with respect to users.
    • The best performing system (M14) does not significantly outperform the baseline system, which is simply fine-tuning a mBERT (M8).

image-20230925174150642

Lexical Bias

  • Replacing the bias sensitive words with more general words is likely shift the bias towards the WordNet ancestors. This hypothesis could be supported by a measurement called pinned bias, where t is the single word in the sensitive word list.

pB _ T = \sum _ {t \in T} \frac{\vert p(\text{“toxic”}\vert t) – \phi\vert}{ \vert T \vert},\quad \phi=\min(p(\text{“toxic”}\vert t), 0.5)

Horizons

The presenter has three high-level observations:

  • Like energy: Bias seems to be transferring from one source to the other.
  • Like a system at rest: A model or dataset will remain biased unless external force (for example, mitigation and regularization) is enabled.
  • Like interactive systems: A system is evolving more chaotic over time. The toxicity needs to be monitored and mitigated in a continuous fashion.

Reference

  1. [2010.04377] Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter (Masud et al., ICDE 2021): This paper presents a dataset called RETINA that focus on hate speech in the Indian context.
  2. [2206.04007] Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization (Masud et al., KDD 2022)
  3. [2201.00961] Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media (Chakraborty and Masud)
  4. [2306.01105] Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment (Masud et al., KDD 2023)
  5. [2202.00126] Handling Bias in Toxic Speech Detection: A Survey (Garg et al., CSUR).
  6. Language (Technology) is Power: A Critical Survey of “Bias” in NLP (Blodgett et al., ACL 2020)
  7. [2305.06626] When the Majority is Wrong: Modeling Annotator Disagreement for Subjective Tasks (Fleisig et al.)
  8. Handling Disagreement in Hate Speech Modelling | SpringerLink (Novak et al., IPMU 2022)
  9. [2001.05495] Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations (Badjatiya et al., WWW 2019).