Reading Notes | ToxiGen – A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Log

  • 2023-12-05: First draft.

Overview

The authors propose a method to automatically generate a balanced dataset (13 identity groups and both toxic and benign) of 14K (using ALICE) + 260K (using demonstraionts) = 274K samples without explicit words based on the following two observations:

  • It is hard to collect hard toxic content to augment the training set of machine learning models as overly toxic content often co-occur with small set of explicit words.
  • Furthermore, the explicit mention (for example, Muslim) of language styles (for example, African-American English) of some identity groups are unfairly classified as toxic by existing models.

Method

ALICE

The authors incoporate a binary hate speech classifier’s score on the “hate” or “non-hate” class into the decoding process to encourge more hateful or more non-hateful generation given a prompt.

Originally, the hateful prompt will lead to hateful continuation. However, when we have the classifier in the loop, the continuation’s hatefulness will be mitigated yet not reversed, leading to implicit hate speech (i.e., hard toxic content).

Demonstration

Another method the authors propose is manually collecting implicit hate speech from the web, and then demonstrate to obtain more texts from GPT-3. This effort lead to 260K samples.

Experiments

  • Data Augmentation with ToxiGen Improves Accuracy on OOD Test Sets

    The authors further fine-tune HateBERT and ToxDectRoBERTa using the collected dataset and test it on social_bias_frames, SALT-NLP/ImplicitHate, and aps/dynahate. The authors observe improved accuracy after fine-tuning.

    image-20231205224136646

Reading Notes | Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Slide]

Change Logs:

  • 2023-10-05: First draft. This paper appears at EACL 2023; it is dated 2204.08952. The code is not released.

Overview

Paraphrasing and back-translation methods are only applicable for texts that are not sensitive to changes in texts. However, the privacy policies could convey wildly different meanings for small differences in the texts; this makes these two techniques less applicable to the problem being studied.

Method

The authors propose a coarse-to-fine architecture for retrieval-based data augmentation. It consists of an ensemble of retrieval and filter models; these models include (1) regular BERT, (2) PBERT, a BERT fine-tuned with MLM objective on the privacy policies, and (3) the PBERT fine-tuned with SimCSE.

  • Retrieval Model (Bi-Encoder): This is a typical structure proposed in [1].
  • Filter Model (Cross-Encoder): This is indeed a text classification model that takes the query, retrieved sentence pair and return a binary decision.

Note that

  • The retrieval model and filter model are trained separately; they are not jointly trained in this work.
  • The ensemble here is more like three systems working in parallel and aggregating the collected sentences altogether at last.

During inference, the top-k retrieved samples are filtered by the trained filter model. The aggregated retrieved texts are combined with original dataset to fine-tune the privacy QA model.

Reference

  1. Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., EMNLP 2020) and HuggingFace.