Table of Contents
Fetching ...

LongProLIP: A Probabilistic Vision-Language Model with Long Context Text

Sanghyuk Chun, Sangdoo Yun

TL;DR

The paper tackles the limitation of probabilistic vision-language models in handling long text contexts by extending ProLIP to 256-token inputs. It introduces LongProLIP, a LongCLIP-inspired fine-tuning strategy that balances long-context understanding with zero-shot generalization, aided by data filtering (HYPE, DFN) and mixed datasets. The approach achieves state-of-the-art performance on Urban-1k for long-context understanding and delivers strong retrieval results across DataComp while highlighting a trade-off with zero-shot capability that can be mitigated through careful fine-tuning. The work provides practical guidance for leveraging long-context text in VL tasks and releases code for replication and further research.

Abstract

Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip

LongProLIP: A Probabilistic Vision-Language Model with Long Context Text

TL;DR

The paper tackles the limitation of probabilistic vision-language models in handling long text contexts by extending ProLIP to 256-token inputs. It introduces LongProLIP, a LongCLIP-inspired fine-tuning strategy that balances long-context understanding with zero-shot generalization, aided by data filtering (HYPE, DFN) and mixed datasets. The approach achieves state-of-the-art performance on Urban-1k for long-context understanding and delivers strong retrieval results across DataComp while highlighting a trade-off with zero-shot capability that can be mitigated through careful fine-tuning. The work provides practical guidance for leveraging long-context text in VL tasks and releases code for replication and further research.

Abstract

Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning.We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by evaluation datasets by DataComp). Code is available at https://github.com/naver-ai/prolip

Paper Structure

This paper contains 12 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of ProLIP. ProLIP consists of an image encoder and a text encoder. The mean and variance are estimated by [CLS] token and [UNC] token, respectively. Note that the original CLIP text encoder does not use [CLS] token, but the ProLIP text encoder uses the last additional two tokens for [CLS] and [UNC] tokens. ProLIP is trained with probabilistic objective functions, such as probabilistic pairwise contrastive loss, inclusion loss, and variational information bottleneck loss.