Reading Notes | Out-of-Distribution Detection and Selective Generation for Conditional Language Models

Overview

The paper proposes to teach the encoder-decoder language models (the Transformers model on translation task and the PEGASUS model on summarization task) to abstain when receiving sentences substantial different from the training distribution; abstaining from generating some contents (more metaphorically, saying “I don’t know” when “I” really do not know) is indicates that the system is trustworthy; this practice improves the system safety.

Method

Given a domain-specific dataset \mathcal{D}_1 ={(x_1, y_1), \cdots, (x_N, y_N)} and a general-domain dataset (for example, C4) \mathcal{D}_0, the authors fit 4 Mahalanobis distance metrics using the representations of a model f(\cdot). The Mahalanobis distance is defined as \mathrm{MD}(\mathbf{x}; \mu, \Sigma) = (\mathbf{x} – \mu )^T \Sigma^{-1} (\mathbf{x} – \mu ); it is equivalent to -\log \mathcal{N}(\mu, \Sigma) up to a constant and a scalar difference.

Notation Fitting the Distance Metric On
$\mathrm{MD}_0(\cdot)$ $\mathcal{D}_0$
$\mathrm{MD}_\text{input}(\cdot)$ ${x_1, \cdots, x_N}$
$\mathrm{MD}_\text{output}(\cdot)$ ${y_1, \cdots, y_N}$
$\mathrm{MD}_\delta(\cdot)$ ${f(x_1), \cdots, f(x_N)}$

Then the authors use either of the following two metrics to compute the OOD score of a test sample z; the w is decoded output of z. The idea of using relative distance comes from authors’ previous work on selective classification (see [3]).

  • Input Relative MD (RMD) Score: \mathrm{RMD}_ \text{input}(z) = \mathrm{MD}_ \text{input}(z) – \mathrm{MD}_0(z).
  • Output Relative MD (RMD) Score: \mathrm{RMD}_ \text{ouput}(w) = \mathrm{MD}_ \text{output}(w) – \mathrm{MD}_\delta(w).

If the scores indicate that the test sample z is an anomaly, the language model abstains from generating actual w; rather, it generates preset content such as “I don’t know.”

Experiments

  • Perplexity should not be used for OOD detection alone because
    • The fitted PDFs of perplexities on different datasets (i.e., domains) mostly overlap (Figure 1).
    • When the averaged OOD scores increase, the Kentall’s \tau between perplexity and quality measure is (1) low and (2) decreases (Figure 4). If perplexity is a good measure, then the curve should be mostly flat.
    • It could be combined with the proposed metric (Section 4.3, 4.4).
  • The proposed metrics perform differently for different types of tasks: \mathrm{RMD}_ \text{input}(\cdot) is more suitable for translation task and \mathrm{RMD}_ \text{output}(\cdot) is more suitable for summarization task. This may be because the summarization task is more “open-ended” than translation task.

  • The distance between domains could be quantitatively measured with the Jaccard similarity of n-grams (1 through 4 in the paper) (Table A.10). This is used to quantify the task difficulties as the authors define “near OOD” and “far OOD” domains (Table 1).

References

  1. [1705.08500] Selective Classification for Deep Neural Networks

  2. [1612.01474] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles: This paper works on predictive uncertainty of deep classification models. Their proposed approach tries to approximate the state-of-the-art Bayesian NNs while being easy to implement and parallelize.

  3. [2106.09022] A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection: For a classification problem of K classes, we could fit K class-dependent Gaussian and 1 background Gaussian. Then we could use these (K+1) Gaussians to detect anomalies: a negative score in class k indicates that the sample is in the domain k and a positive score means it is OOD; a more positive score shows that the sample deviates more from that domain.