Research Notes | Debugging Machine Learning Models

Posted on October 9, 2023December 11, 2023 by David Yang

Contents

1 Overview
2 ROME
3 KnowledgeNeuron
4 KnowledgeEditor
5 MEND
6 TransformerPatcher
7 MEMIT
8 Experiments
- 8.1 Datasets
9 Additional Notes
10 Reference

Overview

The edited knowledge in this paper is in the form of triplets. Given the prompt Eiffel Tower is located in the city of, the original model will output Paris as expected. However, after model editing, the output could be other tokens with high probability. For example, Seattle.

Suppose we have an input $x$ and its original output is $y := \mathcal{M}(x)$ , if we apply some intervention to $\mathcal{M}(\cdot)$ and expect the future output to be $y’$ , we require the editing to be reliable, local, and general:

Reliable: The edited model should output $y’$ with a high probability.
Local: The output of anything semantically different from $x$ should not change.
General (or Consistent): The output of anything semantically equivalent to $x$ should also change.

The community seems to focus on editing encoder-decoder models or decoder-only models ([12] and [13]) due to their ability to generate texts. However, the encoder-only models are less of interest even though MEND and TransformerPatcher both study it. For example, the paper [13] mentions the following:

blockquote>
Previous studies typically used smaller language models (<1B) and demonstrated the effectiveness of current editing methods on smaller models like BERT (Devlin et al., 2019). However, whether these methods work for larger models is still unexplored. Hence, considering theHowever, whether these methods work for larger models is still unexplored. Hence, considering the editing task and future developments, we focus on generation based models and choose larger ones: T5-XL (3B) and GPT-J (6B), representing both encoder-decoder and decoder-only structures.

The editing methods could be compared on whether the model parameters have been modified. There are several scenarios:

Model Parameters are Unchanged
Model Parameters are Unchanged, but there are Additional Parameters
Model Parameters are Changed: This could be done using either (1) locating-and-editing, or (2) meta-learning with a separate hypernetwork.

Method	Category	Note
ENN	3
KnowledgeEditor	3
MEND	3
SEARC	1
ROME	3
MEMIT	3
TransformerPatcher	2
KnowledgeNeuron	3
MQuAKE	1
IKE	1
MemPrompt	1

ROME

KnowledgeNeuron

KnowledgeEditor

MEND

TransformerPatcher

MEMIT

Experiments

Datasets

The canonical tasks of model editing includes fact-checking on FEVER and QA with the zsRE datasets.

For FEVER, the editing dataset is based on the original input and flipped label.
For zsRE, the editing dataset is based on the original input and an answer that is not top-1.

Paper	Fact Checking	QA	Generation	Note
MEMIT [1]	N/A	zsRE and CounterFact	N/A	There are two intermediate works ROME and SEARC. But they are omitted as the best model is MEMIT.
MEND [5]	Binary FEVER	zsRE	Wikitext	The first two tasks are chosen same as De Cao et al.; Wikitext is an additional dataset.
KnowledgeEditor [4]	Binary FEVER	zsRE	N/A
Constrained Fine-Tuning [3]	N/A	zsRE and T-REx	N/A
ENN [4]	N/A	N/A	N/A	This early work experiments on CIFAR-10 and MT tasks.

Additional Notes

The RDF triplet may be the most unambiguous way to express instances of a specification; it is a classical way to represent knowledge and could be bidirectionally converted from and to a SQL database (Wikipedia).
The overarching research field is called “mechanistic interpretibility.”
Knowledge editing is thought to be difficult because now knowledge is stored distributionally rather than symbols. However, the paper [2] finds that the localization is quite concentrated in MLPs; the authors focus on MLPs because they believe the attention is too complicated to study.

MLPs are storing information while attention is gathering information: the information “Seattle” is in one specific location of GPT-2 before the “the space needle is located at” is asked.

Model editing is different from adversarial attack since the former tries to change the model while the latter tries to change the input data. However, model editing could have dual use beyond model patching: engineering an LM that always generates non-factual content.
One limitation of the model editing is that we could only update singleton facts; we could not update the higher level content, for example, specifications and political leanings.

Reference

Kevin Meng and David Bau have published a series of works ([1] and [2]) on knowledge editing for transformers. [3] through [6] are the predecessors to the proposed work; they could at most scale to 75 edits.

[2210.07229] Mass-Editing Memory in a Transformer (MEMIT system).
[2202.05262] Locating and Editing Factual Associations in GPT (ROME system).
[2012.00363] Modifying Memories in Transformer Models: This paper is the first to study the problem of fact editing transformers. The authors propose to fine-tune the models’ first and last transformer block on the modified facts $\mathcal{D} _ M$ while constraining the parameter within a small space.
$\min _ {\theta \in \Theta} \frac{1}{m} \sum _ {x \in \mathcal{D}_M} L(x;\theta)\quad s.t. \Vert \theta – \theta_0 \Vert \leq \delta$
[2004.00345] Editable Neural Networks (Sinitsin et al., ICLR 2020) (ENN system): This paper is the first to apply meta-learning to model editing; it is a precursor to follow-up works [5], [6], and [7]. Besides, it mentions the following important observations:
- The goal of model editing is quickly patching critical mistakes made by a neural model. The problem precludes (1) retraining with augmented dataset because it is slow, and (2) manual cache as it does not adapt to diverse input changes.
Editing Factual Knowledge in Language Models (De Cao et al., EMNLP 2021) (KnowledgeEditor system): The authors observe that the previous methods [3] and [4] have following limitations in their edited models:
- Unreliable Edits: For sentences that are different from $x$ , the behaviors should not have changed.
- Inconsistent Edits: For sentences that are semantically equivalent to $x$ , the behaviors should have changed.
Furthermore, the method [4] also requires expensive retraining.
[2110.11309] Fast Model Editing at Scale (Mitchell et al.) (MEND system): This paper improves the De Cao et al. in editing models with a scale of 10B parameter. On smaller models, the ENN model is better than KnowledgeEditor. The code base of this work also implements ENN and KnowledgeEditor for comparison.
[2206.06520] Memory-Based Model Editing at Scale (Mitchell et al.) (SEARC system): The authors do not release code for SEARC.
Transformer Feed-Forward Layers Are Key-Value Memories (Geva et al., EMNLP 2021): This paper helps the main paper constrain the editing target to the MLP layers.
Knowledge Neurons in Pretrained Transformers (Dai et al., ACL 2022) (KnowledgeNeuron system)
[2305.12740] Can We Edit Factual Knowledge by In-Context Learning? (Zhang et al.)
[2301.09785] Transformer-Patcher: One Mistake worth One Neuron (Huang et al., ICLR 2023): This paper proposes to add one neuron in the last FFN layer and activates this neuron when the exact same error is seen again; this error will be corrected; their experiments include both an encoder-only model (BERT) and an encoder-decoder model (BART).
[2308.07269] EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models (Wang et al.)
[2305.13172] Editing Large Language Models: Problems, Methods, and Opportunities (Yao et al., EMNLP 2023): This paper, together with the above paper introducing easyedit library, provides comprehensive survey and Python library for knowledge editing. We could stick to these papers and only read original papers when necessary.
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (Feng et al., ACL 2023)
[2305.14795] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions (Zhong et al.)
Memory-assisted prompt editing to improve GPT-3 after deployment (Madaan et al., EMNLP 2022)

The following are other useful references:

semantic web – Translating a complex Sentence into set of SPO triple (RDF) (maybe with reification) – Stack Overflow: The user notes that it is difficult to convert the natural language into a standard structure in a definitive way. Some of the approximations include dependency parsing, constituency parsing, knowledge graph, and First-Order Logic (FOL).
ROME: Locating and Editing Factual Associations in GPT (Paper Explained & Author Interview) – YouTube

Reading Notes | Faithful Low-Resource Data-to-Text Generation through Cycle Training

Posted on October 8, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [Poster]

Change Logs:

2023-10-06: First draft. The paper appears at ACL 2023.

Method

The cycle training has two models involved: a data-to-text model $\mathcal{M} _ \text{D2T}$ and a text-to-data model $\mathcal{M} _ \text{T2D}$ ; they are both initialized as `google/t5-base; this base model empirically shows an edge in the WebNLG 2020 competition for RDF-to-text generation.

The proposed approach is similar to self-training in the text-generation domain. Specifically, there are three datasets: paired texts and data, unpaired data $D$ and unpaired texts $T$ .

Initialization: Fine-tuning $\mathcal{M} _ \text{D2T}$ and $\mathcal{M} _ \text{T2D}$ using the paired dataset; the data is converted into linearized triplets.
Repeating the following for multiple epochs: the number of epochs in the paper is set to 50. At epoch k, we do the following:
- Generating text $\hat{T} =\mathcal{M} _ \text{D2T} ^ {(k-1)}(D)$ and data $\hat{D}=\mathcal{M} _ \text{T2D} ^ {(k-1)}(T)$ with models from epoch $(k-1)$ .
- Fine-tuning models with pseudo pairs (D, \hat{T}) and (\hat{D}, T). Specifically, we do the following:
  - $\mathcal{M} _ \text{D2T} ^{(k)} \leftarrow \mathrm{FineTune}(\mathcal{M} _ \text{D2T} ^{(k-1)}, (\hat{D}, T))$; this step tries to reconstruct texts $T$ from intermediate $\hat{D}$.
  - $\mathcal{M} _ \text{T2D} ^{(k)} \leftarrow \mathrm{FineTune}(\mathcal{M} _ \text{T2D} ^{(k-1)}, (D, \hat{T}))$; this step tries to reconstruct data $D$ from intermediate $\hat{T}$.

Note that the difference between this scheme and self-training is that we use the labels inferred from the model to train itself in self-training. However, we do not use the generated pairs $(D, \hat{T})$ from $\mathcal{M} _ \text{D2T}$ to fine-tune itself; rather, we leverage a second model $\mathcal{M} _ \text{T2D}$ to generate the training data for $\mathcal{M} _ \text{D2T}$ .

From the experiment results, we could see:

The low-resource cycle training has strong performance on par with full-scale fine-tuning.
The small set of paired texts is important: the low-resource setting consistently outperforms the unsupervised setting.
Pretraining does not help much if the paired datasets are of small scale.

Additional Notes

Prerequisite

The unpaired data and text corpus should have at least 50% overlap in terms of entities to obtain a reasonable level of faithfulness.
Automatic Faithfulness Evaluation

The PARENT metric [1] used in this work highly correlates with human annotations; this metric is specially designed for table-to-text tasks.

Reference

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation (Dhingra et al., ACL 2019)

Reading Notes | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Posted on October 6, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-10-06: First draft. This paper appears at NeurIPS 2020.

Method

Given a query $x$ , the RAG system first retrieves $z$ from traditional index (for example, Wikipedia) based on a DPR model $p _ \eta(z \vert x)$ . Then the generator generates answers in the free text form through $p _ \theta (y _i \vert x, z, y _ {1:i-1})$ , where $y _ {1:i-1}$ is a prompt. In this process, the $z$ is a latent variable that is not observable by the users.

Note: The ability to generate answer in the free-text form is impressive because many of the experimented tasks are extractive.

The RAG system could be trained jointly on $p _ \eta$ and $p _ \theta$ as it is end-to-end differentiable. The authors provide two variants of the RAG system:

RAG-Sequence: For a query, the entire output sequence is conditioned on the same document.
RAG-Token: For a query, each token in the output sequence could be conditioned on the different documents. The authors note that the RAG could be used for knowledge-intensive tagging task:

Finally, we note that RAG can be used for sequence classification tasks by considering the target class as a target sequence of length one, in which case RAG-Sequence and RAG-Token are equivalent.

Note that RAG-Token does not seem to much better than RAG-Sequence but the former has much more downloads on HuggingFace.

Specifically, the retrieval model is based on bert-base-uncased and the generator is based on facebook/bart-large. Importantly, to accelerate the training, document encoder is frozen and gradients only travel to the query encoder; this design choice does not hurt performance.

Additional Notes

The benefits of RAG is that the index could be updated on demand (“hot-swapping” in the paper).

Reading Notes | Dense Passage Retrieval for Open-Domain Question Answering

Posted on October 6, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-10-05: First draft. This paper appears at EMNLP 2020.

Overview

Dense Passage Retrieval is a familiar thing proposed in this paper. The issue is that previous solutions underperform BM25. The contribution of this paper is discovering an engineering feasible solution that learns a DPR model effectively without many examples; it improves upon the BM25 by a large margin.

Method

The training goal of DPR is to learn a metric where the distance between the query $q$ and relevant documents $p^+$ smaller than that of irrelevant documents $p^-$ in the high-dimensional space. That is, we want to minimize the loss below:
$L(q _ i, p _ i ^ +, p _ {i1} ^ -, \cdots, p _ {in}^-) := -\log \frac{ \exp(q _ i^T p _ i^+)}{\exp(q_i^T p _ i ^ +) + \sum _ {j=1}^n \exp(q _ i ^ T p _ {ij}^-)}$
The authors find that using the “in-batch negatives” is a simple and effective negative sampling strategy (see “Gold” with and without “IB”; also see the dissection of code). Specifically, within a batch of $B$ examples, any answer that is not associated with the current query is considered a negative. If one answer (see the bottom block) retrieved from BM25 is added as a hard negative, the performance will improve more.

The retrieval model has been trained for 40 epochs for larger datasets (“NQ”, “TriviaQA”, “SQuAD”) and 100 epochs for smaller ones (“WQ”, “TREC”) with a learning rate 1e-5. Note that the datasets the authors use to fine-tune the models are large. For example, natural_questions is 143 GB.

Additional Notes

The dual-encoder + cross-encoder design is a classic; they are not necessarily end-to-end differentiable. For example, in this work, after fine-tuning the dual-encoder for retrieval, the authors separately fine-tuned a QA model. This could be a favorable design due to better performance:

This approach obtains a score of 39.8 EM, which suggests that our strategy of training a strong retriever and reader in isolation can leverage effectively available supervision, while outperforming a comparable joint training approach with a simpler design.
The inner product of unit vectors is indeed the cosine similarity.

Quickstart

HuggingFace provides classes for DPR. The Retrieval Augmented Generation (RAG) is one example that fine-tunes using DPR to improve knowledge-intense text generation.
simpletransformers provides easy-to-use interfaces to train DPR models; it even provides a routine to select hard negatives. The following is a minimal working example:

import os
import logging

os.environ["WANDB_DISABLED"] = "false"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.retrieval import (
RetrievalModel,
RetrievalArgs,
)

from datasets import (
Dataset,
DatasetDict,
)

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# trec_train.pkl and trec_dev.pkl are prepared from the original repository
# see: https://github.com/facebookresearch/DPR/blob/main/README.md
df = pd.read_pickle("../datasets/trec_train.pkl")
train_df, eval_df = train_test_split(df, test_size=0.2)
test_df = pd.read_pickle("../datasets/trec_dev.pkl")

columns = ["query_text", "gold_passage", "title"]

train_data = train_df[columns]
eval_data = eval_df[columns]
test_data = test_df[columns]

# Configure the model
model_args = RetrievalArgs()

model_args.num_train_epochs = 40
model_args.include_title = False

# see full list of configurations:
# https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# critical settings
model_args.learning_rate = 1e-5
model_args.num_train_epochs = 40
model_args.train_batch_size = 32
model_args.eval_batch_size = 32
model_args.gradient_accumulation_steps = 1
model_args.fp16 = False
model_args.max_seq_length = 128
model_args.n_gpu = 1
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False

# saving settings
model_args.no_save = False
model_args.overwrite_output_dir = True
model_args.output_dir = "outputs/"
model_args.best_model_dir = "{}/best_model".format(model_args.output_dir)
model_args.save_model_every_epoch = False
model_args.save_best_model = True
model_args.save_steps = 2000

# evaluation settings
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 100

# logging settings
model_args.silent = False
model_args.logging_steps = 50
model_args.wandb_project = "HateGLUE"
model_args.wandb_kwargs = {
"name": "DPR"
}

model_type = "dpr"
context_encoder_name = "facebook/dpr-ctx_encoder-single-nq-base"
question_encoder_name = "facebook/dpr-question_encoder-single-nq-base"

model = RetrievalModel(
model_type=model_type,
context_encoder_name=context_encoder_name,
query_encoder_name=question_encoder_name,
use_cuda=True,
cuda_device=0,
args=model_args
)

# Train the model
model.train_model(train_data, eval_data=eval_data)
result = model.eval_model(eval_data)

Code

This section tries to dissect the code used by simpletransformers.

The entire process is trying to maximizing the probability of correct pairing of query and the gold passage; this is done through minimizing the negative log-softmax defined in _calculate_loss().

Here the torch.nn.functiona.log_softmax() + torch.nn.NLLLoss() is equivalent to torch.nn.CrossEntropyLoss(); torch.nn.NLLLoss() requires the input to be the log-softmax of shape (B, C) and the label of shape (B,). For example, the output scalar of code below is -0.2.

import torch

loss = torch.nn.NLLLoss(reduction="mean")
probs = torch.diag(torch.linspace(0, 1, 5))
labels = torch.LongTensor([3, 2, 3, 4, 4])

print(loss(probs, labels))

The effect of adding hard negatives is simply making the process of maximizing the probability of correct pairs more difficult yet conducive to the training.

class RetrievalModel:
    //...
    def _calculate_loss(
        self,
        context_model,
        query_model,
        context_inputs,
        query_inputs,
        labels,
        criterion,
    ):
        context_outputs = context_model(**context_inputs).pooler_output
        query_outputs = query_model(**query_inputs).pooler_output

        context_outputs = torch.nn.functional.dropout(context_outputs, p=0.1)
        query_outputs = torch.nn.functional.dropout(query_outputs, p=0.1)

        # (B, B) or (B, 2B) depending if there are hard negatives
        similarity_score = torch.matmul(query_outputs, context_outputs.t())
        softmax_score = torch.nn.functional.log_softmax(similarity_score, dim=-1)

        criterion = torch.nn.NLLLoss(reduction="mean")

        # for k-th row, summing up the labels[k] entry and do the average over -1/B * (l1 + l2 + ... + lB)
        loss = criterion(softmax_score, labels)

        max_score, max_idxs = torch.max(softmax_score, 1)
        correct_predictions_count = (
            (max_idxs == torch.tensor(labels)).sum().cpu().detach().numpy().item()
        )

        return loss, context_outputs, query_outputs, correct_predictions_count
    //...
    def _get_inputs_dict(self, batch, evaluate=False):
        device = self.device

        labels = [i for i in range(len(batch["context_ids"]))]
        labels = torch.tensor(labels, dtype=torch.long)

        if not evaluate:
            # Training
            labels = labels.to(device)

            # adding hard negatives will increase the number of samples
            # in each batch from B to 2B
            if self.args.hard_negatives:
                shuffled_indices = torch.randperm(len(labels))
                context_ids = torch.cat(
                    [
                        batch["context_ids"],
                        batch["hard_negative_ids"][shuffled_indices],
                    ],
                    dim=0,
                )
                context_masks = torch.cat(
                    [
                        batch["context_mask"],
                        batch["hard_negatives_mask"][shuffled_indices],
                    ],
                    dim=0,
                )
            else:
                context_ids = batch["context_ids"]
                context_masks = batch["context_mask"]
            context_input = {
                "input_ids": context_ids.to(device),
                "attention_mask": context_masks.to(device),
            }
            query_input = {
                "input_ids": batch["query_ids"].to(device),
                "attention_mask": batch["query_mask"].to(device),
            }
        else:
            # Evaluation
            shuffled_indices = torch.randperm(len(labels))

            labels = labels[shuffled_indices].to(device)

            if self.args.hard_negatives:
                context_ids = torch.cat(
                    [
                        batch["context_ids"][shuffled_indices],
                        batch["hard_negative_ids"],
                    ],
                    dim=0,
                )
                context_masks = torch.cat(
                    [
                        batch["context_mask"][shuffled_indices],
                        batch["hard_negatives_mask"],
                    ],
                    dim=0,
                )
            else:
                context_ids = batch["context_ids"][shuffled_indices]
                context_masks = batch["context_mask"][shuffled_indices]

            context_input = {
                "input_ids": context_ids.to(device),
                "attention_mask": context_masks.to(device),
            }
            query_input = {
                "input_ids": batch["query_ids"].to(device),
                "attention_mask": batch["query_mask"].to(device),
            }

        return context_input, query_input, labels

Research Notes | Writing

Posted on October 5, 2023December 11, 2023 by David Yang

Phrase Bank

Alliteration

Alliteration is a literary device that involves the repetition of initial consonant sounds in a sequence of words, and it is often used for stylistic or rhetorical purposes to create rhythm, emphasize key ideas, or make phrases more memorable.

Although this focused work completely aligns, addresses, and adheres to the guidelines for a short-paper in this venue, we have not performed any experiments on data outside this privacy policy domain.

Reading Notes | Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Posted on October 5, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Slide]

Change Logs:

2023-10-05: First draft. This paper appears at EACL 2023; it is dated 2204.08952. The code is not released.

Overview

Paraphrasing and back-translation methods are only applicable for texts that are not sensitive to changes in texts. However, the privacy policies could convey wildly different meanings for small differences in the texts; this makes these two techniques less applicable to the problem being studied.

Method

The authors propose a coarse-to-fine architecture for retrieval-based data augmentation. It consists of an ensemble of retrieval and filter models; these models include (1) regular BERT, (2) PBERT, a BERT fine-tuned with MLM objective on the privacy policies, and (3) the PBERT fine-tuned with SimCSE.

Retrieval Model (Bi-Encoder): This is a typical structure proposed in [1].
Filter Model (Cross-Encoder): This is indeed a text classification model that takes the query, retrieved sentence pair and return a binary decision.

Note that

The retrieval model and filter model are trained separately; they are not jointly trained in this work.
The ensemble here is more like three systems working in parallel and aggregating the collected sentences altogether at last.

During inference, the top- $k$ retrieved samples are filtered by the trained filter model. The aggregated retrieved texts are combined with original dataset to fine-tune the privacy QA model.

Reference

Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., EMNLP 2020) and HuggingFace.

Research Notes | Generalizable Hate Speech Detection

Posted on October 3, 2023December 11, 2023 by David Yang

Overview

This post is the summary of the following methods; they rank top on the CivilComments-WILDS benchmark:

Rank	Method	Paper
1	FISH	[2104.09937] Gradient Matching for Domain Generalization (Shi et al., ICLR 2022
2, 3	IRMX	[2206.07766] Pareto Invariant Risk Minimization: Towards Mitigating the Optimization Dilemma in Out-of-Distribution Generalization (Chen et al., ICLR 2023)
4	LISA	[2201.00299] Improving Out-of-Distribution Robustness via Selective Augmentation (Yao et al., ICML 2022)
5	DFR	[2204.02937] Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations (Kirichenko et al., ICLR 2023)
6, 8	Group DRO
7, 12	Reweighting	[1901.05555] Class-Balanced Loss Based on Effective Number of Samples (Cui et al., CVPR 2019) is one example that uses this method; the reweighting method could date back to much earlier works.

Reweighting, IRM, and CORAL

IRM [2] and CORAL [3] are two extensions of the basic reweighting method by adding an additional penalty term on top of the reweighting loss; this term is based on some measures of the data representations from different domains to encourage the data distribution of different domains to be similar.

Reference

[2012.07421] WILDS: A Benchmark of in-the-Wild Distribution Shifts
[1907.02893] Invariant Risk Minimization (Arjovsky et al.)
[2007.01434] In Search of Lost Domain Generalization (Gulrajani and Lopez-Paz)

Reading Notes | Wild-Time – A Benchmark of in-the-Wild Distribution Shift over Time

Posted on October 3, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website and Leaderboard] – [Slide] – [Lead Author]

Change Logs:

2023-10-03: First draft. The authors provide 5 datasets (2 of them are text classification datasets, the others include 2 image classification datasets and 1 EHR dataset) and more than 10 mitigation methods for distribution shift.

Experiments

The authors find that most of the mitigation methods are not effective compared to the standard ERM on the proposed benchmark. Note that SimCLR and SwaV methods are only applicable to image classification tasks.

Additional Notes

From the content below, we could see that:

To address this challenge, we adapt the above invariant learning approaches to the temporal distribution shift setting. We leverage timestamp metadata to create a temporal robustness set consisting of substreams of data, where each substream is treated as one domain. Specifically, as shown in Figure 3, we define a sliding window G with length L. For a data stream with T timestamps, we apply the sliding window G to obtain $T - L + 1$ substreams. We treat each substream as a “domain” and apply the above invariant algorithms on the robustness set. We name the adapted CORAL, GroupDRO and IRM as CORAL-T, GroupDRO-T, IRM-T, respectively. Note that we do not adapt LISA since the intra-label LISA performs well without domain information, which is also mentioned in the original paper.

The way the authors apply the group algorithms look questionable: it does not make sense to create artificial domains by grouping data from some consecutive timestamps. This may be the reason why the authors do not observe the performance gains.
The LISA, which is the same author’s work, seems to be a good approach as it does not require the domain labels while performing competitively.

Reading Notes | A Pretrainer’s Guide to Training Data – Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Posted on October 3, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide]

Change Logs:

2023-10-03: First draft.

Reading Notes | Muppet – Massive Multi-task Representations with Pre-Finetuning

Posted on October 2, 2023December 11, 2023 by David Yang

[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Website] – [Slide] – [HuggingFace]

Change Logs:

2023-10-01: First draft. This paper appears at EMNLP 2021.