Talk Notes | Paraphrasing Evades Detectors of AI-generated Text, But Retrieval is an Effective Defense by Kaplesh Krishna @ Google

[YouTube] – [Personal Website]

  • The presenter is the author of multiple influential papers on the topics such as paraphrasing and attacks.

Reference

  1. Reformulating Unsupervised Style Transfer as Paraphrase Generation (Krishna et al., EMNLP 2020)
  2. [1910.12366] Thieves on Sesame Street! Model Extraction of BERT-based APIs (Krishna et al., ICLR ’20’)

Reading Notes | Using GPT4 for Content Moderation

Method

This blog post illustrates an idea of human-AI collaboration in revising an existing content policy. Specifically,

  • Based on an initial policy P_0, a human expert may disagree with a moderation decision of GPT-4.
  • The human expert elicit suggestions from GPT-4 to revise the policy P_0 into P_1 until the human expert agrees with the decision from GPT-4.

The blog post does not clearly explain how either step is done. For example, (1) what prompt is used to turn the general purpose GPT-4 into a content moderator, (2) what prompt is used to ask the feedback from GPT-4, and (3) how human experts ingest GPT-4 feedback into concrete policy revisions.

Reading Notes | DoReMi – Optimizing Data Mixtures Speeds Up Language Model Pretraining

Overview

Other Information

  • The ratios of domains should be counted using number of tokens rather than number of documents, even though different tokenizers may return slightly different ratios.

Reference

  1. [2110.10372] Distributionally Robust Classifiers in Sentiment Analysis (Stanford Course Project Report).
  2. Distributionally Robust Finetuning BERT for Covariate Drift in Spoken Language Understanding (Broscheit et al., ACL 2022): This paper is one of few papers I could find that applies DRO to an NLP model; the problem the authors addressing here is mitigating the spurious correlation (or improving robustness) of a cascade of text and token classification models.

    The standard ERM (aka. MLE) assumes a single distribution and therefore all losses are equally important. However, the DRO tries to minimize the maximum (i.e., the worse case) of a set of distributions; this set of distributions is modeled by prior knowledge.

  3. [1810.08750] Learning Models with Uniform Performance via Distributionally Robust Optimization
  4. Distributionally Robust Language Modeling (Oren et al., EMNLP-IJCNLP 2019): The main paper extensively cites this paper. The goal of this paper is to train a language model on a dataset mixture of K sources \cup _ {i=1}^K\mathcal{D} _ i without degrading the perform on each domain’s test set; it is a practical application of [3] in language modeling.

    This setting may be useful because (1) each \mathcal{D} _ i may not be large enough to train the model, and (2) the authors observe that training on data mixture degrades the performance on the each domain’s test set than using a smaller dataset.

  5. [1911.08731] Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization (ICLR ’20; 1K citations). This paper fine-tunes BERT using DRO on the MNLI dataset; the paper also experiments on the image datasets.

Talk Notes | Building End-to-End Content Moderation Pipelines in the Real World

[Website] – [Paper] – [Blog]

Note:
– The presenter of this talk is the lead author of the paper A Holistic Approach to Undesired Content Detection in the Real World.

Change Logs:

  • 2023-08-29: First draft.

Overview

There are two main iterations to build an end-to-end content moderator.
– Annotation Iteration: OpenAI outsource the most of the annotation iteration to external data providers. They also have internal expert annotators to provide the labels of the quality control set.
– Main Iteration: This is the bulk of the OpenAI’s contribution.

Annotation Iteration

  • Labeling guidelines need to be clarified and updated multiple times with more and more edges surface. The specifications from OpenAI are finally turned into training materials of their label providers to educate their annotators.
  • There should be sessions that
    • Calibrating the annotators by clarifying the annotation guidelines.
    • Auditing data that are flagged harmful either by the annotators or the model. Removing annotations from the annotator that has low per-category F1 scores. This process could be accelerated using cross-auditing with multiple annotators.

Main Iteration

There following are the diagrams that outline the steps above:

  • Step 0: Creating an initial dataset. This initial dataset includes those from “bad” (and unlabeled) subset of CommonCrawl, expert selected academic dataset, and zero-shot synthetic data from GPT-3 based on hand-crafted templates.
  • Step k-1: \cdots
  • Step k: In the iteration k, training a model \mathcal{M}_k based on GPT-series model using the standard cross-entropy loss.

One of the things the OpenAI could not solve well is the calibration.

  • Step k+1: Using \mathcal{M}_k to run inference on the unlabeled production data; the probabilities are used to select the subset for annotation. Three methods are compared:
    • Purely Random Sampling
    • Random Sampling for Samples Above a Threshold
    • Uncertainty Sampling

Active learning substantially improves the ratio of harmful contents in the user traffic (10 – 22 times).

After the subset is annotated, it is added back to the training set. Further, there is also synthetic data that is added to address the counterfactual bias.

  • Step k+2: Running the following steps to further improve the training data.

    • Overfitted Phrase Detection.
    • Mislabeling Detection.
  • Step k+3: Internal red teaming.
  • Step k+4: \cdots
  • Step -3:
  • Evaluating on the static test set.
  • A/B testing.
  • Step -1: Product release.

Here is a more detailed diagram; it is same as the one provided in the paper.

Future Direction

  • Dataset

    • A more systematic approach to create synthetic dataset. The current approach OpenAI uses is described ad-hoc.
    • Robustness to prompt injection and ciphers.
  • Continuous GPT-Assisted Red Teaming
  • Active Leraning
    • The current active learning approach relies on the model \mathcal{M}_k at Step k+1, which the model \mathcal{M}_k may not be able to generalize.
    • The presenter also mentions anomaly detection; it is not prioritized in OpenAI due to time constraint.

Reference