[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Log

2023-12-05: First draft.

Contents

Overview

The authors propose a method to automatically generate a balanced dataset (13 identity groups and both toxic and benign) of 14K (using ALICE) + 260K (using demonstraionts) = 274K samples without explicit words based on the following two observations:

It is hard to collect hard toxic content to augment the training set of machine learning models as overly toxic content often co-occur with small set of explicit words.
Furthermore, the explicit mention (for example, Muslim) of language styles (for example, African-American English) of some identity groups are unfairly classified as toxic by existing models.

Method

ALICE

The authors incoporate a binary hate speech classifier’s score on the “hate” or “non-hate” class into the decoding process to encourge more hateful or more non-hateful generation given a prompt.

Originally, the hateful prompt will lead to hateful continuation. However, when we have the classifier in the loop, the continuation’s hatefulness will be mitigated yet not reversed, leading to implicit hate speech (i.e., hard toxic content).

Demonstration

Another method the authors propose is manually collecting implicit hate speech from the web, and then demonstrate to obtain more texts from GPT-3. This effort lead to 260K samples.

Experiments

Data Augmentation with ToxiGen Improves Accuracy on OOD Test Sets

The authors further fine-tune HateBERT and ToxDectRoBERTa using the collected dataset and test it on social_bias_frames, SALT-NLP/ImplicitHate, and aps/dynahate. The authors observe improved accuracy after fine-tuning.

Method

The authors propose a coarse-to-fine architecture for retrieval-based data augmentation. It consists of an ensemble of retrieval and filter models; these models include (1) regular BERT, (2) PBERT, a BERT fine-tuned with MLM objective on the privacy policies, and (3) the PBERT fine-tuned with SimCSE.

Retrieval Model (Bi-Encoder): This is a typical structure proposed in [1].

Filter Model (Cross-Encoder): This is indeed a text classification model that takes the query, retrieved sentence pair and return a binary decision.

Note that

The retrieval model and filter model are trained separately; they are not jointly trained in this work.

The ensemble here is more like three systems working in parallel and aggregating the collected sentences altogether at last.

During inference, the top-

k

retrieved samples are filtered by the trained filter model. The aggregated retrieved texts are combined with original dataset to fine-tune the privacy QA model.

Tag: DataAugmentation

Reading Notes | ToxiGen – A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Overview

Method

ALICE

Demonstration

Experiments

Reading Notes | Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Overview

Method

Reference