[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-19: First draft. This paper appears as one of the outstanding papers at ICML 2022.

Overview

The main contribution of the paper is a metric to evaluate the difficulty of the aggregate and sample-wise difficulty of a dataset for a model family $\mathcal{V}$ : a lower score indicates a more difficult dataset. This metric is appealing because it is able to do five things while previous approaches could only do 1 to 3 of them. Specifically,

Comparing Datasets: DIME (accepted as a workshop paper at NeurIPS 2020), IRT [4].
Comparing Models: Dynascore [3]
Comparing Instances: Data Shapley [5]
Comparing Dataset Slices
Comparing Attributes: The paper [6] estimates the attribute importance using MDL.

Method

Despite a lot of theoretical construct in Section 2, the way to compute the proposed metric is indeed fairly straightforward.

Suppose we have a dataset $\mathcal{D} _ \text{train}$ and $\mathcal{D} _ \text{test}$ of a task, such as NLI, the proposed metric requires fine-tuning on $\mathcal{D} _ \text{train}$ two models from the same base model $\mathcal{V}$ and collecting measurements on $\mathcal{D} _ \text{test}$ (Algorithm 1):

Step 1: Fine-tuning a model $g’$ on $\mathcal{D} _ \text{train} = { (x_1, y_1), \cdots, (x_m, y_m) }$ and another model $g$ on ${ (\phi, y_1), \cdots, (\phi, y_m) }$ , where $\phi$ is an empty string; both $g’$ and $g$ are the model initialized from the same base model, such as bert-base-uncased.
Step 2: For each test sample, the sample-wise difficulty (aka. PVI) is defined as $\mathrm{PVI}(x_i \rightarrow y_i) := -\log_2 g(y_i\vert \phi) + \log_2 g'(y_i\vert x_i)$ ; the aggregate difficulty is its average $\hat{I} _ \mathcal{V}(X \rightarrow Y) = \frac{1}{n}\sum _ i \mathrm{PVI}(x_i \rightarrow y_i)$ .

If the input and output are independent, the metric is provably and 0; it will be empirically close to 0.

Note that:

The method requires a reasonably large dataset $\mathcal{D} _ \text{train}$ . However, the exact size is not known in advance unless we train many models and wait to see when the curve plateaus, which is not feasible in practice. The authors use 80% of the SNLI dataset for estimation (Appendix A).
The specific choice of models, hyperparameters, and random initializations does not influence the results a lot (Section 3.2).

Applications

There are several applications when we use the proposed metric to rank the samples in a dataset:

Identifying the annotation errors (Section 3).
Using the metric to select challenging samples for data selection, including training data selection, data augmentation, and TCP (Section 4).
Guiding the creation of new specifications as it is possible to compute the token-wise metric (Section 4.3).

Additional Notes

It is quite surprising that the CoLA dataset is more difficult than SNLI and MNLI according to the authors’ measure.

Code

Reference

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics (Swayamdipta et al., EMNLP 2020): The method in the main paper and this paper both requires training a model.
[2002.10689] A Theory of Usable Information Under Computational Constraints (Xu et al., ICLR 2020).
[2106.06052] Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? (Rodriguez et al., ACL-IJCNLP 2021)
[1904.02868] Data Shapley: Equitable Valuation of Data for Machine Learning (ICML 2019): Data shapley could give a pointwise estimate of a sample’s contribution to the decision boundary.
[2103.03872] Rissanen Data Analysis: Examining Dataset Characteristics via Description Length (ICML 2021).