[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-09-14: First draft. The paper appears at ACL 2023. The code base has very detailed instructions on how to reproduce their results.

Method

The authors find that the labeling errors are both annotator-dependent and instance-dependent.

Experiments

The best performing LNL method on the benchmark is SEAL [1]: one could also consider MixUp regularization [2]. All other LNL methods have almost indistinguishable difference as the base models, i.e., not doing any intervention on the training process.

Additional Note

Comments

The reason why creating a new dataset is necessary is that the users could customize the noise level to compare performances of different algorithms in a controlled setting.

Reference

[2012.05458] Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise (Chen et al. AAAI 2021).
[1710.09412] mixup: Beyond Empirical Risk Minimization (Zhang et al., ICLR 2018, 7.6K citations).
Nonlinear Mixup: Out-of-Manifold Data Augmentation for Text Classification (Guo, AAAI 2020). One application of MixUp regularization in NLP. It is based on a CNN classifier and the improvement is quite marginal.
[2006.06049] On Mixup Regularization (Carratino et al., JMLR): A theoretical analysis of MixUp regularization.
Learning with Noisy Labels (Natarajan et al., NIPS 2013): This paper is the first paper that (theoretically) studies LNL. It considers the binary classification problem where labels are randomly flipped, which is theoretically appealing but less relevant empirically according to the main paper.

Problem Statement

Suppose we have a collection of datasets from $K$ sources $\mathcal{D} _ 1, \cdots, \mathcal{D} _ K$ . These $K$ datasets have been unified regarding input and output spaces.

Now we split each $\mathcal{D} _ i$ into train, validation, and test splits $\mathcal{D} _ i ^ \text{train},\ \mathcal{D} _ i ^ \text{val}$ and $\mathcal{D} _ i ^ \text{test}$ and form the aggregated train, validation, and test sets as $\mathcal{D}^\text{train} := \cup _ {i=1}^ K D _ i^\text{train}$ , $\mathcal{D}^\text{val} := \cup _ {i=1}^ K D _ i^\text{val}$ , and $\mathcal{D}^\text{test} := \cup _ {i=1}^ K D _ i^\text{test}$ .

The learning problem could vary depending the quality of the datasets after (1) dataset collection and annotation by authors of different datasets, and (2) dataset unification when merging $K$ datasets into one. This is because:

If labels are reliable, then this is dataset selection problem. The argument is to save computation resources when training on $\mathcal{D} \subseteq \mathcal{D} ^ \text{train}$ while maintaining the performance as a model trained in (1) each $\mathcal{D}_i,\ i \in [K]$ , (2) $\mathcal{D} ^ \text{train}$ , and (3) $\mathrm{Sample}(\mathcal{D} ^ \text{train})$ that matches the size of $\mathcal{D}$ .

In some special cases, another motivation for dataset selection is that we know the size of a sampled dataset (for example, the dataset statistics described in a paper) but we are not sure what are exactly these samples.
If labels are not reliable, then the argument is to prevent the low-quality labels from offsetting the benefits of a larger training dataset (rather than distilling a smaller dataset to save compute). We have three options:

Index	Method	Type
1	Reannotating the entire dataset. This could be reduced as a dataset distillation problem as now we have more confidence on the filtered datasets.	Offline
2	Identifying and removing unreliable labels and optionally using these samples as an unsupervised dataset. This is also reducible to a dataset selection problem as 1.	Offline
3	Learning with the noisy labels (LNL as described in 1) they are; this requires the learning algorithm to explicitly consider the variablity in the label quality.	Online

Note that there is a easily topic called “dataset distillation” that one may easily confused with. The goal of dataset distillation is to create synthetic dataset in the feature space based on the original one to match the performance on the test set. Previous show that it is possible to attain the original performance on MNIST ([3]) and IMDB ([4]) with a synthetic dataset of size (surprisingly) 10 and 20.

Adaptive Data Selection

With the test sets finalized, we could now work on sampling training sets, i.e., choosing one specific $\mathrm{Sample}(\cdot)$ function described above. The goal here is to sample the training set so that the scores on the test sets are maximized:

DSIR: Suppose we need to sample $B$ batches of samples totaling $K$ , then we could start by randomly sampling the 1st batch and then calling the DSIR algorithm in the future batches until we have collected $K$ samples. This should be done for each label.

Reference

NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing (Wu et al., Findings 2023)
[2202.01327] Adaptive Sampling Strategies to Construct Equitable Training Datasets (Cai et al., FAccT 2023)
[2301.04272] Data Distillation: A Survey (Sachdeva and McAuley, JMLR).
[1811.10959] Dataset Distillation (Wang et al.)
[1910.02551] Soft-Label Dataset Distillation and Text Dataset Distillation (Sucholutsky and Schonlau, IJCNN 2020). This is the only paper referenced in 3 describing the dataset distillation for texts. This paper is based on the very original data distillation objective proposed in 4.
[2302.03169] Data Selection for Language Models via Importance Resampling (Xie et al.)
[2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Xie et al.)
[2306.11670] GIO: Gradient Information Optimization for Training Dataset Selection (Everaert and Potts): This paper has similar settings as the DSIR paper [6]: we are selecting new samples by minimizing their KL divergence with an existing set of unlabeled samples. The paper claims an advantage over the DSIR as the proposed algorithm requires fewer samples:

Like GIO, these heuristic methods aim to select a subset of data that is higher quality and more relevant. However, they are either highly tailored to their particular tasks or they require very large numbers of examples (to develop classifiers or construct target probabilities). By contrast, GIO is task- and domain-agnostic, it can be applied plug-and-play to a new task and dataset, and it requires comparatively few gold examples X to serve as the target distribution.

Tag: NoisyLabel

Reading Notes | Beyond Class-Conditional Assumption – A Primary Attempt to Combat Instance-Dependent Label Noise

Reference

Reading Notes | NoisywikiHow – A Benchmark for Learning with Real-world Noisy Labels in Natural Language Processing

Method

Experiments

Additional Note

Comments

Reference

Research Notes | Training Data Optimization

Problem Statement

Adaptive Data Selection

Reference