Table of Contents
Fetching ...

Retrieval-based Disentangled Representation Learning with Natural Language Supervision

Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Lei Chen

TL;DR

Disentangled representation learning is hampered by the absence of universal factors; the authors propose Vocabulary Disentangled Retrieval (VDR), a sparse lexical bi-encoder that maps data and natural-language descriptions into a shared vocabulary space and uses a Disentanglement (DST) head to encourage dimension-wise interpretability. VDR supports text-to-text and cross-modal retrieval through a gating mechanism and a nonparametric entry, with a symmetric contrastive objective L that combines q-to-p and p-to-q terms and a bow-based supervision term. They validate on 15 datasets including BEIR, MS COCO, and Flickr30k, reporting improvements over baselines and competitive performance with higher efficiency, including a 10x speed-up for nonparametric inference. Human evaluation indicates interpretability on par with state-of-the-art captioning models, underscoring VDR's potential for explainable retrieval.

Abstract

Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.

Retrieval-based Disentangled Representation Learning with Natural Language Supervision

TL;DR

Disentangled representation learning is hampered by the absence of universal factors; the authors propose Vocabulary Disentangled Retrieval (VDR), a sparse lexical bi-encoder that maps data and natural-language descriptions into a shared vocabulary space and uses a Disentanglement (DST) head to encourage dimension-wise interpretability. VDR supports text-to-text and cross-modal retrieval through a gating mechanism and a nonparametric entry, with a symmetric contrastive objective L that combines q-to-p and p-to-q terms and a bow-based supervision term. They validate on 15 datasets including BEIR, MS COCO, and Flickr30k, reporting improvements over baselines and competitive performance with higher efficiency, including a 10x speed-up for nonparametric inference. Human evaluation indicates interpretability on par with state-of-the-art captioning models, underscoring VDR's potential for explainable retrieval.

Abstract

Disentangled representation learning remains challenging as the underlying factors of variation in the data do not naturally exist. The inherent complexity of real-world data makes it unfeasible to exhaustively enumerate and encapsulate all its variations within a finite set of factors. However, it is worth noting that most real-world data have linguistic equivalents, typically in the form of textual descriptions. These linguistic counterparts can represent the data and effortlessly decomposed into distinct tokens. In light of this, we present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish dimensions that capture intrinsic characteristics within data through its natural language counterpart, thus facilitating disentanglement. We extensively assess the performance of VDR across 15 retrieval benchmark datasets, covering text-to-text and cross-modal retrieval scenarios, as well as human evaluation. Our experimental results compellingly demonstrate the superiority of VDR over previous bi-encoder retrievers with comparable model size and training costs, achieving an impressive 8.7% improvement in NDCG@10 on the BEIR benchmark, a 5.3% increase on MS COCO, and a 6.0% increase on Flickr30k in terms of mean recall in the zero-shot setting. Moreover, The results from human evaluation indicate that interpretability of our method is on par with SOTA captioning models.
Paper Structure (47 sections, 6 equations, 9 figures, 10 tables)

This paper contains 47 sections, 6 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Illustration of retrieval-based framework. The color intensity reflects the higher values along the dimension.
  • Figure 2: Left: training and inference pipeline of VDR. Right: pseudo code for training VDR.
  • Figure 3: Ablation studies of different components of VDR$_\mathrm{cm}$.
  • Figure 4: Different approaches for internal inspection on disentangled representations.
  • Figure 5: Effectiveness-efficiency comparisons of different retrievers.
  • ...and 4 more figures