Talk Notes | Data-Centric AI

Overview

The following notes are the data-centric AI IAP course notes from MIT; Independent Activities Period (IAP) is a special four-week semester of MIT. The standard time for each lecture is 1 hour.

Lecture 1 – Data-Centric AI vs. Model-Centric AI

  • It is not hard to design fancy models and apply various tricks on the well curated data. However, these models and tricks do not work for real-world data if we do not explicitly consider the real-world complexities and take them into account. Therefore, it is important to focus on data rather than model.

    It turns out there are pervasive label errors on the most cited test sets of different modalities, including text, image, and audio. They could be explored in labelerrors.com.

  • To understand why data is important, we could think about kNN algorithm. The accuracy of kNN is purely based on the quality of datasets. However, the kNN is not a data-centric algorithm because it does not modify the labels.
  • Two Goals of Data-Centric AI

    Rather than modifying loss function, doing HPO, or changing the model itself, we do either of the following:

    • Designing an algorithm that tries to understand the data and using that information to improve the model. One such example is curriculum learning by Yoshua Bengio; in curriculum learning, the data is not changed but its order is shuffled.
    • Modifying the dataset itself to improve the models. For example, the confident learning (i.e., removing wrong labels before training the model) studied by Curtis Northcutt.
  • What are NOT Data-Centric AI and Data-Centric AI Counterpart

    • Hand-picking data points you think you will improve a model. \rightarrow Coreset Selection.
    • Doubling the size of dataset. \rightarrow Data Augmentation. For example, back-translation for texts, rotation and cropping for images. However, we need to first fix label errors before augmenting the data.
  • Typical Examples of Data-Centric AI

    Curtis Northcutt cites Andrew Ng and other sources on the importance of data in machine learning ([1] through [3]). Here are some examples of data-centric AI:

    • Outlier Detection and Removal. However, this process relies on a validation process on which threshold to choose.
    • Label Error Detection and Correction
    • Data Augmentation
    • Feature Engineering and Selection. For example, solving XOR problem by adding a new column.
    • Establishing Consensus Labels during Crowd-sourcing.
    • Active Learning. I want to improve 5% accuracy on the test set but I could afford as little new annotated data as possible.
    • Curriculum Learning.
  • Data-Centric AI Algorithms are Often Superior to Model-Centric Algorithms

    The model centric approaches (i.e., training less on what a model believes are the bad subset of data) is a much worse idea than the data-centric approach (i.e., confident learning).

image-20231106162934431

  • Root-Causing the Issues – Models or Data
    • The model should perform well on the slices of data. Slicing means not only sampling data to a smaller number but also reducing the number of classes from a large number to a very small number. For example, rather than classifying images to 1000 classes, we only focus on performance on two classes.
    • The model should perform similarly on similar datasets (for example, MNIST datasets and other digits dataset).

Lecture 2 – Label Errors

Notation

Notation Meaning Note
$\tilde{y}$ Noisy observed label
$y ^ *$ True underlying label
$\mathbf{X} _ {\tilde{y} =i, y ^ {*} = j}$ A set of examples whose true label is j but they are mislabeled as i.
$\mathbf{C} _ {\tilde{y} =i, y ^ {*} = j}$ The size of the dataset above.
$p(\tilde{y} =i, y ^ {*} = j)$ The joint probability of label i and label j; it could be estimated by normalizing \mathbf{C}; it is indeed dividing each entry by the sum of all entries in the matrix \mathbf{X}.
$p(\tilde{y} =i\vert y ^ {*} = j)$ The transition probability that the label j flips to label i; it could also be called flipping rate.

Categories of Label Errors

When comparing the consensus crowd-sourcing labels and the final label in the dataset, there are 4 types of label errors:

  • Correctable: The given label is wrong and it could be corrected with crowd-sourcing. This is the type of label the lecture focus on detecting.
  • Multi-label: The given label and the consensus label are both right. However, more than one label in \mathcal{Y} could be used to label the samples. For example, an image with co-existence of laptop and humans that is incorrectly labeled as “laptop.”
  • Neither: The given label and the consensus label are both wrong.
  • Non-agreement: There is no way to tell whether the given label or the consensus label is correct.

There are also two categories of the label errors the presenter does not focus on:

  • Uniform Random Flipping p(\hat{y} = i \vert y ^ * = j) = \epsilon, \forall i\neq j: This will show as a symmetric \mathbf{X} matrix. It is easy to solve and this type of errors are unlikely to happen in the real world.
  • Instance-Dependent Label Noise p(\hat{y} = i \vert y ^ * = j, \mathbf{x}): This will require a lot of assumptions on the data distribution. Importantly, this type of label errors seldom happen in the real world.

Uncertainty

There are two sources of uncertainty:

  • Aleatoric Uncertainty: Label noise. It is the difficulty of an sample. This difficulty could come from incorrect label y or the strange distribution of \mathbf{x}.
  • Epistemic Uncertainty: Model noise. It is the model’s inability to understand the example. For example, the model has never seen similar examples before or the model class is too simple.

Confident Learning

The focus of the lecture is the correctable errors; it is defined in previous sections; the matrix \mathbf{X} is non-symmetric. Furthermore, the lecture will focus on samples with one label and one annotation.

  • Motivation of Using Confident Learning

    • Ranking samples by loss does not work. We could not find a loss threshold and claim the samples above this threshold are label errors.
    • Deep learning does not solve the label noise problem (despite many papers and many claims) because these problems try to solve the datasets polluted by uniform noise.
  • Assumption: Class-Conditional Label Noise
    p(\hat{y} \vert y ^ {_}; \mathbf{x} ) = p(\hat{y} \vert y ^ {_})

    • Interpretation: Given the true label, there is a constant flipping rate for the samples under that true label to other labels.
    • Rationale: A pig image often confused with a boar image but not other items such as “missiles” and “keyboards.” This tendency has nothing to do with what exactly a pig look like in an image but the similarities of the classes.
    • Motivation: This assumption is made because the LHS couples the aleatoric uncertainy and epistemic uncertainty and this assumption decouples these two uncertainties.
  • Confident Learning

    • For each of the class j, we could define a model’s self-confidence. If the self-confidence score of class j is low, but some of the samples have very high confidence, then we could say that there is something wrong with that label.

    t _ j = \frac{1}{ \vert \mathbf{X} _ {\tilde{y} = j}\vert } \sum _ {x \in \mathbf{X} _ {\tilde{y} = j}} \hat{p} ( \tilde{y} = j; \mathbf{x}, \theta)

    • For samples labeled with i, if its predicted probability for class j larger then t _ j, then this sample is likely mislabeled and we could assign it to the set. We could obtain this matrix in a cross-validation style. For example, if we have 3 folds, we use 2/3 of the data to train the model \hat{p} and use the remaining 1/3 to compute this matrix.
      \hat{ \mathbf{X} } _ {\tilde{y} = i, y ^ {*} = j} = { \mathbf{x} \in \mathbf{X} _ {\tilde{y} = i}: \hat{p} (\tilde{y} = j; \mathbf{x}, \theta) \geq t_j}
    • Example

      Suppose we know the t _ j for “dog”, “fox”, and “cow” are 0.7, 0.7, and 0.9. We have following predictions and labels. We could obtain a matrix that looks like one below. The off-diagonal entries correspond to labeling errors.

      $y ^ {*} = \text{dog}$ $y ^ {*} = \text{fox}$ $y ^ {*} = \text{cow}$
      $\hat{y} = \text{dog}$ 1 1 0
      $\hat{y} = \text{fox}$ 1 3 0
      $\hat{y} = \text{cow}$ 0 0 1

      Note the following:

      • The last sample does not contain any animal and it is not counted. This shows that this scheme is robust to outliers.
      • It is possible t _ j is very small but this will happen when there are many classes. In this case, the predicted probability for each class will also small.

      image-20231106204002280

  • Applications

    • Confident Learning + Ranking by Loss

      If we see there are in total k off-diagonal samples, then we could pick the top-k samples based on loss values.

    • Confident Learning + Ranking by Normalized Margin

      We could also rank by normalized margin for a specific class i; normalized margin is defined as following
      p(\tilde{y} = i) – \max _ {j\neq i} p(\tilde{y} =j; \mathbf{x} \in \mathbf{X} _ i)

    • Self-Confidence

      When p(\tilde{y}=i) is close to 1, then as far as the model could think, the sample is not likely to be a label error.

Theory of Confident Learning

  • The model-centric approaches (i.e., model reweighting methods) will still propagate the errors back to the weights. However, the data-centric approaches (i.e., pruning methods) does not have this problem.
  • We could prove that even if the model is miscalibrated (i.e., overly confident in some classes), the confident learning method is still robust.

Implications on Testing

  • When focusing on the subset of data whose labels could be corrected, more capable models (for example, ResNet-50 vs. ResNet-18) perform worse as they fit the random noise in the training set.

Lecture 8 – Encoding Human Priors

Human priors could be encoded (i.e., finding a function to represent) into the ML pipeline in two ways. During training time, it could be done using data augmentation. During test time, this is done through prompt engineering with an LLM.

  • Data Augmentation
    • Images: Flip, Rotation, Mobius transformation, Mixup. Mixup could be thought of as the linear interpolation of two images.
    • Texts: Back-translation.

cleanlab Library

Anatomy

  • Understanding Cross-Validation in cleanlab

    The cross-validation in cleanlab means the probabilities have to be the test scores. Specifically, if we have 3 folds, then what we will keep are the test prediction probabilities of each 1/3 fold using the model trained on the remaining 2/3 folds.

    This logic could be found in estimate_confident_joint_and_cv_pred_proba() in cleanlab/count.py; it is the most important functions for cleanlab. It is used in find_label_issues function in CleanLearning class; this class also inherits from sklearn.base.BaseEstimator. The code could be found here.

  • keras is Necessary to Port cleanlab and transformer

    • cleanlab requires an API that similar to sklearn.
    • As of 2023-11-08, neither transformer or sklearn team provides a solution to port each other (except a less relevant library called skops that is about sharing sklearn models to HuggingFace hub; also see news). We therefore need to rely the keras-based code from cleanlab official tutorial that fine-tunes a TF-based bert-base-uncased to find label errors in imdb dataset.
    • The complete script is available here.

Example

In the official demo that tries to find label errors in the imdb dataset, the authors use a simple MLP as the base model. The following (confusing at the first look) code indeed tokenize the texts into fixed length vectors (i.e., sequence_length).

import re
import string

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers import TextVectorization

raw_train_ds = tfds.load(name="imdb_reviews", split="train", batch_size=-1, as_supervised=True)
raw_test_ds = tfds.load(name="imdb_reviews", split="test", batch_size=-1, as_supervised=True)

raw_train_texts, train_labels = tfds.as_numpy(raw_train_ds)
raw_test_texts, test_labels = tfds.as_numpy(raw_test_ds)

max_features = 10000
sequence_length = 250

def preprocess_text(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "
", " ") return tf.strings.regex_replace(stripped_html, f"[{re.escape(string.punctuation)}]", "") vectorize_layer = TextVectorization( standardize=preprocess_text, max_tokens=max_features, output_mode="int", output_sequence_length=sequence_length, ) vectorize_layer.reset_state() vectorize_layer.adapt(raw_train_texts) # (N, sequence_length) train_texts = vectorize_layer(raw_train_texts).numpy() test_texts = vectorize_layer(raw_test_texts).numpy()

Additional Notes

  • “You are what you eat” is particularly relevant to the process of training machine learning models.
  • The data collection, labeling, and cleaning process could be called “data engine” or “data flywheel” in tech firms (blog).
  • The benefits of data-centric AI is that it disentangle the effects of data and modeling. Previously, we blindly trust the labels and efforts (including using larger models, changing loss functions, doing HPO) to improve the performance may only end up fitting the noise. If we make the data clean, we could identify what are the truly useful techniques and what are not.
  • cleanlab could not only flag the label issues but also automatically fix the top label issues. (blog).

    Here we use Cleanlab Studio’s Clean Top K feature, which allows us to automatically correct the top most severe issues detected in our dataset with an automatically suggested label (inferred to be more suitable for each example than its original label in the dataset).

Reference

  1. Why it’s time for ‘data-centric artificial intelligence’ | MIT Sloan
  2. Bad Data Costs the U.S. $3 Trillion Per Year (Harvard Business Review)
  3. Bad Data: The $3 Trillion-Per-Year Problem That’s Actually Solvable | Entrepreneur
  4. [1710.09412] mixup: Beyond Empirical Risk Minimization (Zhang et al., ICLR 2017)

Talk Notes | Data-Centric NLP @ USC CSCI-699 Fall 2022

Outline

The following is the course schedule (indeed a reading list) compiled from the course website for quick reference.

Section Date Topic Readings
I. Datasets in NLP Aug 22 Introduction, Historical Perspective, and Overview Fair ML Book Chapter 7. Datasets
Sambasivan et al., 2021: “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI
Paullada et al., 2021 Data and its (dis)contents
Raji et al., 2022 Ethical Challenges of Data Collection & Use in Machine Learning Research
Aug 24 Data Collection and Data Ethics Deng et al., 2009 ImageNet: A large-scale hierarchical image database
Kwiatkowski et al., 2019 Natural Questions: A Benchmark for Question Answering Research
Sakaguchi et al., 2019 WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Bowman et al. 2015 A large annotated corpus for learning natural language inference
Nie et al., 2020 Adversarial NLI: A New Benchmark for Natural Language Understanding
Aug 31 More on Data Ethics Bender et al., 2021 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Koch et al., 2021 Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Klein and D’Ignazio, 2020 Data Feminism Book: Intro and Chapter 1
Strubell et al., 2019 Energy and Policy Considerations for Deep Learning in NLP
II. Bias and Mitigation Sep 7 Biases: An Overview Geirhos et al., 2020 Shortcut Learning in Deep Neural Networks
Hort et al., 2022 Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey
Feder et al., 2021 Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
Sep 12 Spurious Biases I Torralba & Efros, 2011 Unbiased Look at Dataset Bias
Geva et al., 2019 Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
McCoy et al., 2019 Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in NLI
Sep 14 Spurious Biases II Gardner et al., 2021 Competency Problems: On Finding and Removing Artifacts in Language Data
Eisenstein, 2022 Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language
Sep 19 Data-Centric Bias Mitigation Srivastava et al., 2020 Robustness to spurious correlations via human annotations
Dixon et al., 2018 Measuring and mitigating unintended bias in text classification
Gardner et al., 2019 On Making Reading Comprehension More Comprehensive
Sep 21 Data Augmentation for Bias Mitigation Ng et al., 2020 SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving O.O.D. Robustness
Kaushik et al., 2019 Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
III. Estimating Data Quality Sep 26 Estimates of Data Quality Le Bras et al., 2020 Adversarial Filters of Dataset Biases
Swayamdipta et al., 2020 Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Liu et al., 2022 WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Ethayarajh et al., 2022 Understanding Dataset Difficulty with V-Usable Information
Sep 28 Aggregate vs. Point-wise Estimates of Data Quality Ghorbani & Zou, 2019 Data Shapley: Equitable Valuation of Data for Machine Learning;
Perez et al., 2021 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length;
Mindermann et al., 2022 Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Oct 3 Anomalies, Outliers, and Out-of-Distribution Examples Hendrycks et al., 2018 Deep Anomaly Detection with Outlier Exposure
Ren et al., 2019 Likelihood Ratios for Out-of-Distribution Detection
Oct 5 Disagreements, Subjectivity and Ambiguity I Pavlick et al., 2019 Inherent Disagreements in Human Textual Inferences;
Röttger et al., 2022 Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks;
Denton et al., 2021 Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation
Oct 12 Disagreements, Subjectivity and Ambiguity II Miceli et al., 2020 Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision;
Davani et al., 2021 Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations
IV. Data for Accountability Oct 17 Creating Evaluation Sets Recht et al., 2019 Do ImageNet Classifiers Generalize to ImageNet?;
Card et al., 2020 With Little Power Comes Great Responsibility;
Clark et al. 2021 All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Ethayarajh & Jurafsky, 2020 Utility is in the eye of the user: a critique of NLP leaderboards
Oct 19 Counterfactual Evaluation Gardner et al., 2020 Evaluating Models’ Local Decision Boundaries via Contrast Sets;
Ross et al., 2021 Tailor: Generating and Perturbing Text with Semantic Controls
Oct 24 Adversarial Evaluation Jia and Liang, 2017 Adversarial Examples for Evaluating Reading Comprehension Systems;
Kiela et al., 2021 Dynabench: Rethinking Benchmarking in NLP;
Li and Michael, 2022 Overconfidence in the Face of Ambiguity with Adversarial Data
Oct 26 Contextualizing Decisions Gebru et al., 2018 Datasheets for Datasets;
Bender and Friedman, 2018 Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
V. Beyond Labeled Datasets Oct 31 Unlabeled Data Dodge et al., 2021 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Lee et al., 2022 Deduplicating Training Data Makes Language Models Better
Gururangan et al., 2022 Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Nov 2 Prompts as Data? Wei et al., 2022 Chain of Thought Prompting Elicits Reasoning in Large Language Models
Nov 7 Data Privacy and Security Amodei et al., 2016 Concrete Problems in AI Safety
Carlini et al., 2020 Extracting Training Data from Large Language Models
Henderson et al., 2022 Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Nov 9 Towards Better Data Citizenship Jo & Gebru, 2019 Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
Hutchinson et al., 2021 Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure