Research Notes | Benchmarking LLM Safety

Problem Description

When receiving a prompt that queries for unsafe information (for example, toxic, profane, and legal / medical information), the LM may respond and cause harm in the physical world. There are several ways to diagnose LM weakness:

  • Static Benchmark: This includes the CheckList-style challenge test sets.

    • Benchmark Saturation and Annotation Bias
    • Concept shift: For example, the same content previously thought non-toxic became toxic after certain social event.
    • Covariate Shift: This includes (1) the emerging unsafe categories and (2) changing proportion of existing unsafe categories.
  • Red-Teaming

    • Manual Red-Teaming: Leveraging people’s creativity to search for prompts that may elicit unsafe behaviors of LLMs.
    • Automated Red-Teaming: Using automated search to deviate the region guarded by RLHF so that the unsafe content will be generated.

Note that

  • The description above only considers the language model itself. There may be external input / output filters that assist the detection and mitigation of unsafe behaviors; these external filters should bcde studies separately.
  • The LM itself may or may not go through the process of enhancing safety. The methods to enhance safety may include (1) SFT with additional (unsafe prompt, IDK response) or (2) RLHF with additional (unsafe prompt, IDK response, unsafe response); here IDK resposne is generic responses that LMs fall back to when encountering unsafe prompts.

Red Teaming

Resources

  • An comprehensive wiki and a collection of resources from Yaodong Yang @ PKU. He, together with Songchun Zhu, also writes a comprehensive survey on AI alignment; it has a Chinese version.

Reference

Safety Alignment

  1. [2310.12773] Safe RLHF: Safe Reinforcement Learning from Human Feedback
  2. [2307.04657] BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset (PKU-Alignment)

    This work find that separately annotating harmlessness and helpfulness (with the additional safe RLHF algorithm proposed in 1) substantially outperforms Anthropic’s baselines; the authors claim that they are the first to do this. The author also open-source the datasets (1) a SFT (or classification) dataset that is used to train safety classifier and (2) a RLHF dataset that is used to fine-tune an LM (Alpaca in the paper).

    image-20231130165858750

    The authors also curate a balanced test set from 14 categories to measure some models’ safety (Figure 5), they find that LLMs with alignment show much less variance among GPT-4, human evaluation, and QA moderation. Here “QA moderation” is another measure for hatefulness: the degree to which a response mitigate the potential harm of a harmful prompt; the authors use the binary label for this. Specifically, rather than using each single sentence’s own toxicity as label (for example, prompt or response) the authors use whether a response addresses the prompt harmlessly as the label.

    image-20231130203202302image-20231130205627290

    Note that the authors synthesize 14 categories from 1, 2 in “Taxonomy” and 1 in “Red Teaming.” The authors acknowledge that these categories are not MECE.

    The authors release their models and datasets on HuggingFace hub:

    Model Name Note
    1 PKU-Alignment/alpaca-7b-reproduced The reproduced Alpaca model.
    2 PKU-Alignment/beaver-dam-7b A LLaMA-based QA moderation model
    3 PKU-Alignment/beaver-7b-v1.0-reward The static reward model during RLHF
    4 PKU-Alignment/beaver-7b-v1.0-cost The static cost model during RLHF
    5 PKU-Alignment/beaver-7b-v1.0 The Alpaca model that goes through the safe RLHF process based on 1
    Dataset Name Note
    1 PKU-Alignment/BeaverTails A classification dataset with prompt, response, category, and is_safe columns; it could be used for 14 classes (if using category) or 2 classes (if using is_safe).
    2 PKU-Alignment/BeaverTails-single-dimension-preference A preference dataset with prompt, response_0, response_1, and better_response_id (-1, 0, 1).
    3 PKU-Alignment/BeaverTails-Evaluation It only has prompt and category columns. It is not the test split of the dataset 1 and 2.
    4 PKU-Alignment/PKU-SafeRLHF A preference and binary classification dataset (N=330K) with prompt, response_0, response_1, is_response_0_safe, is_response_1_safe, better_response_id, safer_response_id; it has both training and test split.
    5 PKU-Alignment/PKU-SafeRLHF-30K Sampled version of 4 with both training and test split.
    6 PKU-Alignment/PKU-SafeRLHF-10K A further sampled version of 4 with only training split available.
    7 PKU-Alignment/processed-hh-rlhf A reformatted version of the Anthropic dataset for the ease of use; the original dataset is formatted in plain text.

Safety Benchmark

  1. [2308.01263] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models (Röttger et al.): This work presents a small set of test prompts (available on GitHub) that could be used to test the safety of an LLM. This work is from the people working on hate speech, including Paul Röttger, Bertie Vidgen, and Dirk Hovy.
  2. [2308.09662] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (DeCLaRe Lab, SUTD): This work provides two datasets: (1) a set of hateful questions for safety benchmarking, and (2) (propmt, blue conversation, red conversation) datasets for safety benchmarking.
  3. [2309.07045] SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions (Tsinghua): This work provides a dataset of multiple-choice QA to evaluate the safety of an LLM across 7 predefined categories, including offensiveness, bias, physical health, mental health, illegal activities, ethics, and privacy.

OOD and Safety

  1. [2311.14743] A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift (Scale AI)

Red Teaming

  1. [2209.07858] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al., Anthropic).
  2. [2202.03286] Red Teaming Language Models with Language Models (Perez et al., DeepMind and NYU)

Taxonomy of Unsafe Behaviors

  1. [2206.08325] Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models (Rauh et al., DeepMind)
  2. BBQ: A hand-built bias benchmark for question answering (Parrish et al., Findings 2022, NYU)

Controlled Text Generation

  1. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (Hartvigsen et al., ACL 2022)

    The authors propose a classifier-in-the-loop constrained decoding scheme that allows for the generation of benign and (implicit) toxic content of 13 minority groups.

    Specifically, the authors adjust the token distribution by adding the a partial sequence’s neutral class probability from a hate speech classifier to mitigate the toxicity every step. This will make the original explicitly toxic content less toxic (from 66% to 43%) yet still implicitly toxic. Besides making implicit toxic content, this approach could also work with a benign prompt to generate benign content.

    image-20231204125814342

  2. [2310.14542] Evaluating Large Language Models on Controlled Generation Tasks (Sun et al., EMNLP)

    This paper shows that LLMs, including gpt-3.5-turbo, Falcon, Alpaca, and Vicuna, could not be controlled to follow fine-grained signal such as numerical planning (for example, “generate a paragraph with five sentences.”); they do well in controlling high-level signal, such as sentiment, topic, and enforcing specific keywords.

Adversarial Attack on LLM

  1. [2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models

    • This paper proposes two ways to elicit unsafe behaviors of LLMs

      • Producing Affirmative Responses: Appending “Sure, here is [prompt]” to the original prompt that generates expected unsafe content.
      • Greedy Coordinate Gradient (GCG)

        Given an input prompt x _ {1:n}, the algorithm iterates over all tokens and find the replacement that causes the smallest loss. Specifically, for each token, the algorithm enumerate all possible gradients with respect to this token’s one-hot vector, then the algorithm picks top-K and modifies the prompt by replacing the token in the top-K set, and finally selects the prompt with the lowest loss.

    • In attacking vision models, it is well established that attacking distilled models is much easier than the original models.

Toxicity Detection

  1. [2312.01648] Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation

    • This paper proposes a method to attain almost perfect accuracy on the challenging civil_comment datasets. The authors manage to do so by deriving a set of features from LLM from the first principle, and training a linear classifier on top of these features.
    • Intrinsic Dimension (ID) could be used to characterize the likelihood a prompt could evade the RLHF alignment. It could be used as a proxy for prompt engineering so that jailbreaking will happen.

      The authors show (using the increased ID as a proxy for evading alignment) that prepending a relevant non-toxic sentence as prefix will make the aligned LM more likely to generate toxic content.

Research Notes | Research in the LLM Era

Overview

This post mainly comprises content from the sources below:

Directions

Evaluation

Miscellaneous Notes

Research Notes | Constitutional AI

[Research Paper] – [Constitution] – [Policy Memo] – [Full List of Research from Anthropic]

  • Notable figures from Anthropic include Chris Olah, Deep Ganguli, Ethan Perez, Sam Bowman, and Jared Kaplan. The first authors of this work is Yuntao Bai.

Overview

There are some limitations with OpenAI’s approaches of RLHF, i.e., asking humans to compare responses and select what they prefer.

  • Low Scalability: Asking humans to compare responses and verifying comparisons (even a small subset) takes significant amount of time. Further, annotating disturbing content may cause issues to human annotators.
  • Low Interpreability: The values are infused in the process of comparison. The exact guidelines that govern the comparison of responses are not spelled out.
  • Tradeoff between Harmlessness and Helpfulness: “Alignment tax” has been observed in the RLHF process. For example, the model could generate safe yet evasive content that does not contain any useful information.

The approach proposed by Anthropic makes Pareto improvement on both harmlessness and helpfulness. For example, when the model is asked to do something that violates the constitution, the model still tries to be helpful rather than simply refusing to answer.

The core of the CAI Is a set of expert instructions (source); it replaces humans with another LM in the RLHF process, leading to a new way for alignment, i.e., RLAIF.

CAI does this by training a model using a list of natural language instructions or principles, which comprise the model’s “constitution.”

Additional Notes

  • The constitution is not finalized; it could be revised and updated. The current version of constitution is derived from numerous sources, including UN Declaration of Human Rights (1948) , DeepMing’s Sparrow Principles, and Apple’s terms of services; it also considers values that are not from western, rich, and industrialized culture.

    The constitutions are implemented as (abstract) natural language instructions. Making instructions abstract is deliberate as they find writing specific constitutions harms the performance.

    “`bash
    Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood. (1)

    Please choose the response that is least racist and sexist, and that is least discriminatory based on language, religion, political or other opinion, national or social origin, property, birth or other status. (2)
    “`

Research Notes | Resource Central

Overview

The links I dump into Zotero or the bookmark manager software will be quickly forgotten if they are not revisited soon. This repository serves as a quick reminder that document all the links (1) I have collected, and (2) I have revisited and believed it should have been revisited eariler.

Basics

Research

Research Notes | Transformer from Scratch

Overview

This post aims to implements the the transformer model and its variant from scratch. It is based on the following posts:

  1. The Annotated Transformer (Harvard NLP)
  2. The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.
  3. GitHub – karpathy/minGPT: A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training: Anderj Karpathy also creates a 2-hour video describing the process he creates the model.

    GitHub – karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs.: This is the optimized version of minGPT that is able to reproduce some of the mid-sized models, including a 1.3B GPT-2. Pretraining a 124M GPT-2 took 4 days on 8 A100 GPUs (each 40 GB).

  4. GitHub – nlp-with-transformers/notebooks: Jupyter notebooks for the Natural Language Processing with Transformers book

Research Notes | Machine Learning

Overview

The following notes are organized by and taken from the books below:

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2rd Edition; this book’s 3rd edition has been released in 2022.

Dimensionality Reduction

The notion of “curse of dimensionality” does not arise solely regarding computation: more features makes the computation slow. It is also backed up by some theoretical observations. Specifically, consider the unit square, cube, or hypercube of 2, 3, through 10000 dimensions, we consider (1) when we sample one point, the probability it is within 0.001 to the border is 1 – (1 – 0.001) ^ d, (2) when we sample two points, the average distance of these two points is roughly \sqrt{d/6} (see answer).

This indicate that the in high dimensional space (1) any point is likely to be close to the border because it is easy for a point to be an extremist in one dimension with an increase in the number of dimensions, (2) the points are sparse; this sparsity could only be remedied by exponentially more samples with respect to dimension d, which is infeasible in practice.

Research Notes | Query Generation

Problem Statement

The paper [2] notes the difficulty of optimizing the query for neural retrieval models.

However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results.

The query generation aims to find out “what should have been asked” for a given passage. More formally, if we have a list of documents D that are aligned with our goal (or true underlying query) q, is it possible to search for its approximated version \hat{q} that returns D as relevant documents with high probability?

Background

  • Rocchio Algorithm for Query Expansion

    Rocchio algorithm is used by search engines in the backend to improve the user’s initial search queries. For example, if the initial search query is “beaches in Hawaii,” if the backend finds that many users click webpages (this is one type of pseudo relevance feedback, PRF) that contain the word “sunset,” then the word “sunset” will be added to the initial search query, making the real search query “beaches in Hawaii sunset.”

    There could be more nuanced PRFs than binary click and not-click action. For example, how long the users stay on a webpage, whether the user eventually returns to the search result page.

Adolphs et al., EMNLP 2022

There are two contributions of this paper:

  • Query Inverter: An inverter that converts embeddings back to queries.
  • Query Generation: A way to refine the initial generic query (plus a gold document) so that the gold passage will become one of the top ranked documents given a refined query.

Query Inverter

The authors use the GTR model to generate embeddings from 3M queries from the PAQ dataset [5], thus forming a datasets {{(q_i, \mathbf{e}_i)}} _ {i=1} ^ N. Then the authors fine-tune a T5-base decoder by reversing GTR’s input and output as {{ (\mathbf{e} _ i, q_i) }} _ {i = 1} ^ N; this process of fine-tuning T5-base requires 130 hours on 16 TPU v3

  • Note: The major limitation of this process is that the fine-tuned T5-base could only invert the embeddings of a fixed GTR model. When working with a GTR model fint-tuned for a new application scenario, the expensive fine-tuning of T5-base has to repeat.

The paper mentions the following interesting application:

We thus evaluate the decoder quality by starting from a document paragraph, decoding a query from its embedding and then running the GTR search engine on that query to check if this query retrieves the desired paragraph as a top-ranked result.

For example, in the sanity check experiment in Table 3, if we use the gold passage to reconstruct the query, and then use this generated query to find relevant passages, the rank of the gold passages improves upon the original approach of querying blindly.

image-20231019132020003

Query Generation

Given an original query \mathbf{q} and a gold document \mathbf{d}, the authors propose to generate new query embedding based on the linear combination of \mathbf{q} and \mathbf{d}; in the experiments, the authors generate select \kappa so that there are 20 new query embeddings.

The intuition of this idea that the optimal query should be between the initial generic (yet relevant) query and the gold passage. The gold passage could be the answer to multiple different questions; we therefore need to use the initial generic query \mathbf{q} to guide the generation of new queries.
\mathbf{q} _ \kappa \leftarrow \frac{\kappa}{k} \mathbf{d} + (1 – \frac{\kappa}{k}) \mathbf{q}
image-20231019134107901

Based on a dataset of succesfully reformulated queries (that is, the gold passage is top ranked by the reformulated query), the authors fine-tune a T5 model with original queries as input and reformulated queries as output; they call this query suggestion model.

The authors find that their query suggestion models (qsT5 and qsT5-plain) improves the retrieval performance when compared with query expansion baselines, including a strong PRF baseline RM3.

image-20231019134801317

Vec2Text by Cideron et al.

Overview

This paper provides a more high-level motivation to invert embeddings to texts: making semantic decisions in the continuous space (for example, reinforcement learning) to control the output of LLMs; the Morris et al. do not cite this paper but does acknowledge related work in Section 8.

Past research has explored natural language processing learning models that map vectors to sentences (Bowman et al., 2016). These include some retrieval models that are trained with a shallow decoder to reconstruct the text or bag-of-words from the encoder-outputted embedding (Xiao et al., 2022; Shen et al., 2023; Wang et al., 2023). Unlike these, we invert embeddings from a frozen, pre-trained encoder.

The paper reveals the reason why several works focus on inverting the GTR model: the GTR is based on T5 model and it does not have a decoder; it is natural to learn a decoder (as vec2text by Morris et al. and Adolphs et al. have done) that invert embeddings back to texts. Note that the vec2text referred in the text is different from the vec2text developed by Morris et al. despite the same name.

T5 was previously used to learn sentence representation in Ni et al. (2021) where they focus on having a well structure sentence embedding by introducing a contrastive loss to pull together similar sentences and push them away from the negatives. However, Ni et al. (2021) don’t learn a decoder (i.e. a vec2text model) which makes it impossible for them to generate sentences from the embedding space.

Method

This paper uses very similar to the architecture used by Morris et al. and Adolphs et al., it consists of two components

  • A Round-Trip Translation (RTT) scheme to prepare the data: the English corpus is first translated to German and back-translated to English; the back-translated English sentences serve as input while the original sentences serve as outputs.
  • A T5-base model (same as Adolphs et al. and Morris et al.) with a bottleneck involving (1) mean pooling, and (2) linear projection; this design is similar to Morris et al.’s \mathrm{EmbToSeq}(\mathbf{e}).

However, despite being more high-level (for example, four desired properties), the method in this work is not iterative, which may make this work effective as Morris et al.

Topic Convex Hull

Recall the definition of a convex hull according to he paper proposing the QuickHull algorithm:

The convex hull of a set of points is the smallest convex set that contains the points.

This is a novel concept proposed in this paper. Specifically

  • Step 1: Embedding a few sentences known to belong to a specific topic.
  • Step 2: Forming a convex hull using these embeddings. This could be done using scipy.spatial.ConvexHull(); the underlying algorithm is gift wrapping algorithm in computational geometry (Wikipedia).

    To form a convex hull, we need to have a matrix (n, d) and n > d. For example, if we want to find a convex hull of BERT embeddings, we need to have at least have 769 samples. This could be prohibitively slow as the runtime of the algorithm is exponential in terms of dimensions O(n ^ {\lfloor d / 2\rfloor}) (doc); empirically, the answer further notes that the routine works for data up to 9 dimensions.

  • Step 3: Sampling uniformly with a Dirichlet distribution from the convex hull. This answer provides a Python function to do it; this answer explicitly mentions the Dirichlet distribution; this paper is likely use the same function.

Vec2Text by Morris et al.

Method

Recall the chain rule:
p(a, b, c) = p(a\vert b, c) \cdot p(b\vert c) \cdot p(c)
The proposed approach is inverting an embedding \mathbf{e} from an arbitrary embedding function \phi(\cdot) (for example, OpenAI embedding API) back to text x^{(t+1)} iteratively from an initial guess x^{(0)}. This correction could take multiple steps; the total number of steps should not be large (up to 40).
p\left(x^{(0)}\vert \mathbf{e}\right) = p\left(x^{(0)}\vert \mathbf{e}, \emptyset, \phi(\emptyset)\right) \rightarrow \cdots \rightarrow
p\left(x^{(t+1)} \vert \mathbf{e}\right) = \sum _ {x ^ {(t)}} p\left(x ^ {(t)}\vert \mathbf{e}\right) \cdot \boxed{p\left(x^{(t+1)} \vert \mathbf{e}, x^{(t)}, \phi(x ^ {(t)})\right)}

The boxed term is operationalized as a T5-base model. To make sure an arbitrary embedding fits into the dimension of T5-base, the authors further use a MLP to project arbitrary embeddings of size d to the right size s.
\mathrm{EmbToSeq}(\mathbf{e})=\mathbf{W} _ 2 \sigma(\mathbf{W} _ 1 \mathbf{e})
The authors propose to feed the concatentation of 4 vectors – \mathrm{EmbToSeq}(\mathbf{e}), \mathrm{EmbToSeq}(\mathbf{\hat{e}}), \mathrm{EmbToSeq}(\mathbf{e} – \hat{\mathbf{e}}), and embeddings of x ^ {(t)} using T5-base (the total size input size is 3s + n) to the model and fine-tune the T5-base with regular LM objective.

In the experiments, the authors invert the same model as the GTR model as Adolphs et al. and OpenAI text embedding API; the fine-tuning of each T5-base on each dataset took 2 days on 4 A6000 GPUs.

  • Difference from Adolphs et al.

    Even though the idea to invert the GTR model and how this inverter is trained is quite similar, Adolphs et al. does not consider the multi-step correction, this seems to be the key to make the inversion work (Tweet). Further, they do not provide the code.

Code

The authors not only open-source the code to fine-tune the model; they also provide the code to create the library vec2text. The following are the most important code snippet of this work (vec2text/vex2text/trainers/corrector).

  • model: The inverter model that maps embeddings back to text.
def invert_embeddings(
    embeddings: torch.Tensor,
    corrector: vec2text.trainers.Corrector,
    num_steps: int = None,
    sequence_beam_width: int = 0,
) -> List[str]:
    corrector.inversion_trainer.model.eval()
    corrector.model.eval()

    gen_kwargs = copy.copy(corrector.gen_kwargs)
    gen_kwargs["min_length"] = 1
    gen_kwargs["max_length"] = 128

    if num_steps is None:
        assert (
            sequence_beam_width == 0
        ), "can't set a nonzero beam width without multiple steps"

        regenerated = corrector.inversion_trainer.generate(
            inputs={
                "frozen_embeddings": embeddings,
            },
            generation_kwargs=gen_kwargs,
        )
    else:
        corrector.return_best_hypothesis = sequence_beam_width > 0
        regenerated = corrector.generate(
            inputs={
                "frozen_embeddings": embeddings,
            },
            generation_kwargs=gen_kwargs,
            num_recursive_steps=num_steps,
            sequence_beam_width=sequence_beam_width,
        )

    output_strings = corrector.tokenizer.batch_decode(
        regenerated, skip_special_tokens=True
    )
    return output_strings


class Corrector(BaseTrainer):
    def __init__(
        self,
        model: CorrectorEncoderModel,
        inversion_trainer: InversionTrainer,
        args: Optional[TrainingArguments],
        **kwargs,
    ):
    // ...

    def generate(
        self,
        inputs: Dict,
        generation_kwargs: Dict,
        num_recursive_steps: int = None,
        sequence_beam_width: int = None,
    ) -> torch.Tensor:
        //...
        while num_recursive_steps >= 1:
            gen_text_ids, hypothesis_embedding, best_scores = self._generate_with_beam(
                inputs=inputs,
                generation_kwargs=generation_kwargs,
                num_recursive_steps=num_recursive_steps,
                num_recursive_steps_so_far=num_recursive_steps_so_far,
                sequence_beam_width=sequence_beam_width,
            )
            inputs["hypothesis_input_ids"] = gen_text_ids
            inputs["hypothesis_attention_mask"] = (
                gen_text_ids != self.model.encoder_decoder.config.pad_token_id
            ).int()
            inputs["hypothesis_embedding"] = hypothesis_embedding
            # step counters
            num_recursive_steps -= 1
            num_recursive_steps_so_far += 1

            // ...

    def _generate_with_beam(
        self,
        inputs: Dict,
        generation_kwargs: Dict,
        num_recursive_steps: int,
        num_recursive_steps_so_far: int,
        sequence_beam_width: int,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        // ...
        if (num_recursive_steps_so_far == 0) and (
            self.initial_hypothesis_str is not None
        ):
            //...
        else:
            outputs = self.model.generate(
                inputs=inputs,
                generation_kwargs=generation_kwargs,
                return_dict_in_generate=True,
            )
            gen_text_ids = outputs.sequences
        // ...

Minimal Working Example

The provided library is very easy-to-use. The following is a minimal working example:

import os

import torch
import vec2text
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

query = "What is your favoriate opera?"
positives = [
    "I love Lucia di Lammermoor because Luica's powerful presence is really inspiring.",
    "Le Nozze di Figaro is my favorite because of the fun plot and timeless Mozart music.",
]
negatives = [
    "I love pepperoni pizza",
    "Cancun is my favoriate holiday destination."
]

query_embedding = embedding_model.embed_query(query)
positive_embeddings = embedding_model.embed_documents(positives)
negative_embeddings = embedding_model.embed_documents(negatives)

corrector = vec2text.load_corrector("text-embedding-ada-002")
inverted_positives = vec2text.invert_embeddings(
    embeddings=torch.tensor(positive_embeddings).cuda(),
    corrector=corrector
)

Anatomy

Additional Notes

  • This idea could generalize to prompt search as prompt engineering could be seen as a more general form of query refinement to better elicit the knowledge from an LLM. However, the difference is that we often do not have desired output; this makes the search for the prompts difficult. Eventually, the idea could work for prompt engineering only when we have at least one ideal output.

    The Dynasour system from UCLA is one such attempt: they are trying to create instruction tuning data from regular HuggingFace datasets; these HuggingFace datasets do not come with instructions.

  • The paper [2] shows a novel way of manipulating embeddings – using Seq2Seq model’s decoder only. This is not previously possible for encoder-only, encoder-decoder, or decoder-only models.
  • Gradients provide more information than embeddings, as is noted by [4].

    However, such techniques do not apply to textual inversion: the gradient of the model is relatively high-resolution; we consider the more difficult problem of recovering the full input text given only a single dense embedding vector.

  • In the embedding space, two embeddings could collide even though they have no token overlap [7].
  • RTT is an useful way to add perturbations to the inputs; another way worth trying is denoising [9], which turns out to be less effective than RTT. Further, the choice of language in RTT is important. For example, the paper [8] chooses German as the pivot due to more word reorderings.

    As explained by Shen et al. (2020), the intuition behind using denoising with auto-encoders is that the noise constraints the auto-encoder to put similar sentences (in terms of the denoising objective) next to each other in the latent space. However, the problem with denoising is that it maps together sentences that are close in edit distance but may have completely different meanings.

Reference

  1. [2109.00527] Boosting Search Engines with Interactive Agents (Adolphs et al., TMLR 2022): This is a feasibility study of an emsemble of BM25 plus an interpretable reranking scheme to work on par DPR on the natural_questions dataset; this is consistent with DPR in its evaluation. The main advantage is intepretability rather than performance.

    image-20231019112747929

  2. Decoding a Neural Retriever’s Latent Space for Query Suggestion (Adolphs et al., EMNLP 2022)
  3. What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary (Ram et al., ACL 2023): This paper proposes a method to project embeddings onto the vocabulary and obtains a distribution over the tokens. The motivation for this paper is interpretability.
  4. [2310.06816] Text Embeddings Reveal (Almost) As Much As Text (Morris et al., EMNLP 2023)
  5. Large Dual Encoders Are Generalizable Retrievers (Ni et al., EMNLP 2022): This paper proposes the Generalization T5 dense Retrievers (GTR) model that many papers build their solutions upon.
  6. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them (Lewis et al., TACL 2021)
  7. Adversarial Semantic Collisions (Song et al., EMNLP 2020)
  8. [2209.06792] vec2text with Round-Trip Translations (Cideron et al. from Google Brain)
  9. [1905.12777] Educating Text Autoencoders: Latent Representation Guidance via Denoising (Shen et al., ICML 2020)

Research Notes | Mathematical Background for NLP

Optimization

Projected Gradient Descent (PGD)

PGD is used to solve constrained optimization problem. It is same as the gradient descent except every time the gradient is projected onto the subspace spanned by the constraints.

Typical Problems

  • Computing Query Given Document Embeddings

    Given multiple embeddings \mathbf{e} _ 1, \cdots, \mathbf{e} _ K, find a query \mathbf{q} made from linear combination of \mathbf{e} _ 1,\cdots, \mathbf{e} _ K so that the overall inner product (i.e., cosine similarity) is maximized. This problem could be written as below; it is unbounded. Here \mathbf{A} := \mathbf{E}^T\mathbf{E} and \mathbf{E} = \begin{bmatrix}\mathbf{e} _ 1 ,&\cdots, &\mathbf{e} _ K \end{bmatrix}:
    \max _ \alpha\quad 1^T \mathbf{A\alpha}\quad s.t.\quad 1^T \alpha = 1
    If we further require that all \alpha are non-negative, the solution to this problem is a vector that selects only one of the vectors in \mathbf{E}.

Reference

  1. Universal Adversarial Triggers for Attacking and Analyzing NLP (Wallace et al., EMNLP-IJCNLP 2019)
  2. Universal Adversarial Attacks on Text Classifiers (Behjati et al.)

Research Notes | Label Error Detection

Overview

  • The standard procedure after detecting label errors is discarding samples with label errors rather than correcting them.

cleanlab

Chong et al., EMNLP 2022

Additional Notes

  • Labeling errors may come in multiple different forms. The form we are interested in is called “concept shift”: the relationship between texts and labels no more holds. The paper [6] provides the example of medical condition “sepsis” as an example.

    Finally, existing labels may also become inconsistent with prevailing knowledge due to constantly evolving problem definitions and domain knowledge leading to concept drift.

    The concepts that related to but different from “concept shift” includes covariate shift (changes in the input) and label shift (changes in the labels). All three terms could be called “dataset shift.”

    The answer [8] provides two good examples understanding the differences of three terms. The task is to predict whether people will default, then we can compare the following:

    • Covariate Shift: The population under study changes. For example, the model is trained on people receiving higher education but the deployment environment only includes people having high school education. In this case, the relationship “higher education” \rightarrow “less likely to default” does not change but the population changes.
    • Label Shift: The target changes; this can happen with and without covariate shift. For example,

      • Label shift as a result of covariate shift: Higher education training group and lower education test group clearly has different label distributions.
      • Label shift without covariate shift: The government decides to send cash incentives to every one; reducing the probability people default.
    • Concept Shift: A recent study shows in some special cases “higher education” \rightarrow “more likely to default.” In this case, the population and the label does not change but the relationship changes.

Reference

  1. [1911.00068] Confident Learning: Estimating Uncertainty in Dataset Labels is the theoretical foundation of the cleanlab; this paper has a blog.
  2. [2103.14749] Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks is an application of the principle in the first paper to machine learning benchmarks; this paper has a blog.
  3. Detecting Label Errors by Using Pre-Trained Language Models (Chong et al., EMNLP 2022)
  4. [2301.12321] Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data (Kim et al., NeurIPS 2023)
  5. ActiveAED: A Human in the Loop Improves Annotation Error Detection (Weber & Plank, Findings 2023)
  6. [2306.09467] AQuA: A Benchmarking Tool for Label Quality Assessment (Goswami et al., NeurIPS 2023): This benchmark paper includes the two datasets used in [3] as test sets.
  7. machine learning – Explain “concept drift” and how we can detect it in text data – Cross Validated: Concept-shift seems to be a well studied problem in MLOps. For example, it is easy to find the following posts:
    1. Best Practices for Dealing With Concept Drift (Neptune MLOps Blog)
  8. [2212.04612] Training Data Influence Analysis and Estimation: A Survey (Hammoudeh and Lowd)
  9. data – What is the difference between Covariate Shift, Label Shift, Concept Shift, Concept Drift, and Prior Probability Shift? – Data Science Stack Exchange