Reading Notes | Out-of-Distribution Detection and Selective Generation for Conditional Language Models

Overview

The paper proposes to teach the encoder-decoder language models (the Transformers model on translation task and the PEGASUS model on summarization task) to abstain when receiving sentences substantial different from the training distribution; abstaining from generating some contents (more metaphorically, saying “I don’t know” when “I” really do not know) is indicates that the system is trustworthy; this practice improves the system safety.

Method

Given a domain-specific dataset $\mathcal{D}_1 ={(x_1, y_1), \cdots, (x_N, y_N)}$ and a general-domain dataset (for example, C4) $\mathcal{D}_0$ , the authors fit 4 Mahalanobis distance metrics using the representations of a model $f(\cdot)$ . The Mahalanobis distance is defined as $\mathrm{MD}(\mathbf{x}; \mu, \Sigma) = (\mathbf{x} – \mu )^T \Sigma^{-1} (\mathbf{x} – \mu )$ ; it is equivalent to $-\log \mathcal{N}(\mu, \Sigma)$ up to a constant and a scalar difference.

Notation	Fitting the Distance Metric On
$\mathrm{MD}_0(\cdot)$	$\mathcal{D}_0$
$\mathrm{MD}_\text{input}(\cdot)$	${x_1, \cdots, x_N}$
$\mathrm{MD}_\text{output}(\cdot)$	${y_1, \cdots, y_N}$
$\mathrm{MD}_\delta(\cdot)$	${f(x_1), \cdots, f(x_N)}$

Then the authors use either of the following two metrics to compute the OOD score of a test sample $z$ ; the $w$ is decoded output of $z$ . The idea of using relative distance comes from authors’ previous work on selective classification (see [3]).

Input Relative MD (RMD) Score: $\mathrm{RMD}_ \text{input}(z) = \mathrm{MD}_ \text{input}(z) – \mathrm{MD}_0(z)$ .
Output Relative MD (RMD) Score: $\mathrm{RMD}_ \text{ouput}(w) = \mathrm{MD}_ \text{output}(w) – \mathrm{MD}_\delta(w)$ .

If the scores indicate that the test sample $z$ is an anomaly, the language model abstains from generating actual $w$ ; rather, it generates preset content such as “I don’t know.”

Experiments

Perplexity should not be used for OOD detection alone because
- The fitted PDFs of perplexities on different datasets (i.e., domains) mostly overlap (Figure 1).
- When the averaged OOD scores increase, the Kentall’s $\tau$ between perplexity and quality measure is (1) low and (2) decreases (Figure 4). If perplexity is a good measure, then the curve should be mostly flat.
- It could be combined with the proposed metric (Section 4.3, 4.4).
The proposed metrics perform differently for different types of tasks: $\mathrm{RMD}_ \text{input}(\cdot)$ is more suitable for translation task and $\mathrm{RMD}_ \text{output}(\cdot)$ is more suitable for summarization task. This may be because the summarization task is more “open-ended” than translation task.
The distance between domains could be quantitatively measured with the Jaccard similarity of $n$ -grams (1 through 4 in the paper) (Table A.10). This is used to quantify the task difficulties as the authors define “near OOD” and “far OOD” domains (Table 1).

References

[1705.08500] Selective Classification for Deep Neural Networks
[1612.01474] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles: This paper works on predictive uncertainty of deep classification models. Their proposed approach tries to approximate the state-of-the-art Bayesian NNs while being easy to implement and parallelize.
[2106.09022] A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection: For a classification problem of $K$ classes, we could fit $K$ class-dependent Gaussian and 1 background Gaussian. Then we could use these $(K+1)$ Gaussians to detect anomalies: a negative score in class $k$ indicates that the sample is in the domain $k$ and a positive score means it is OOD; a more positive score shows that the sample deviates more from that domain.