[Semantic Scholar] – [Code] – [Tweet] – [Video] – [Slide]

Change Logs:

2023-10-05: First draft. This paper appears at EACL 2023; it is dated 2204.08952. The code is not released.

Overview

Paraphrasing and back-translation methods are only applicable for texts that are not sensitive to changes in texts. However, the privacy policies could convey wildly different meanings for small differences in the texts; this makes these two techniques less applicable to the problem being studied.

Method

The authors propose a coarse-to-fine architecture for retrieval-based data augmentation. It consists of an ensemble of retrieval and filter models; these models include (1) regular BERT, (2) PBERT, a BERT fine-tuned with MLM objective on the privacy policies, and (3) the PBERT fine-tuned with SimCSE.

Retrieval Model (Bi-Encoder): This is a typical structure proposed in [1].
Filter Model (Cross-Encoder): This is indeed a text classification model that takes the query, retrieved sentence pair and return a binary decision.

Note that

The retrieval model and filter model are trained separately; they are not jointly trained in this work.
The ensemble here is more like three systems working in parallel and aggregating the collected sentences altogether at last.

During inference, the top- $k$ retrieved samples are filtered by the trained filter model. The aggregated retrieved texts are combined with original dataset to fine-tune the privacy QA model.

Reference

Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., EMNLP 2020) and HuggingFace.