LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Wei Wu; Kecheng Zheng; Shuailei Ma; Fan Lu; Yuxin Guo; Yifei Zhang; Wei Chen; Qingpei Guo; Yujun Shen; Zheng-Jun Zha

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, Zheng-Jun Zha

TL;DR

LoTLIP addresses the difficulty of long-text understanding in language-image pre-training by introducing corner tokens and a targeted attention mechanism to fuse diverse long-text information with standard short-text pre-training. By re-captioning 100M images with long captions and training with both long- and short-text losses, LoTLIP achieves improved long-text-image retrieval without sacrificing short-text performance, outperforming prior long-text methods (e.g., Long-CLIP) and surpassing LiT-based baselines. The approach demonstrates a practical balance between performance and efficiency, revealing a controllable trade-off as caption length scales, and achieving state-of-the-art results on long-text retrieval while maintaining strong zero-shot short-text tasks. The work provides a scalable, data-driven path to rich multimodal understanding with long descriptions, with broad implications for search, retrieval, and downstream multimodal NLP tasks.

Abstract

Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. Our method demonstrates superior performance in long-text-image retrieval tasks. The project page is available at https://wuw2019.github.io/lot-lip.

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

TL;DR

Abstract

Paper Structure (35 sections, 7 equations, 6 figures, 11 tables)

This paper contains 35 sections, 7 equations, 6 figures, 11 tables.

Introduction
Related work
Language-Image Pre-training
Long-text Understanding
Preliminary of Language-Image Pre-training
Long Texts in Language-Image Pre-training
Long Text-Image Pair Dataset
Training Dataset.
Evaluation Dataset.
Exploring the Influence of Text length
Method
Corner Tokens.
Optimization.
Experiments
Implementation Details and Datasets
...and 20 more sections

Figures (6)

Figure 1: Illustration of the impacts of long v.s. short captions on image-language pre-training, as observed in the cross-attention maps of CLIP. Training images are usually paired with short captions, leaving certain tokens (e.g., garden token) easily overshadowed by salient tokens (e.g., castle token). Fortunately, the usage of long captions can help bring the overshadowed tokens back into the light, and this phenomenon is not influenced by the order of tokens within the sentence.
Figure 2: The influence of text length. A significant improvement is observed across all tasks when we added one randomly sampled sub-caption from generated texts to the pre-training stage. As the number of sub-captions increases, the performance of the pre-trained model on long-text-image retrieval tasks consistently improves and becomes stable (a). However, there is a performance degradation in MSCOCO retrieval task (b) and ImageNet classification task (c).
Figure 3: Overview of LoTLIP. We add multiple learnable corner tokens ($[\texttt{Cor 1}], [\texttt{Cor 2}],.\cdots$) after $[\texttt{CLS}]$ token. These corner tokens are initialized differently for aggregating diverse token features. Besides, an attention mask mechanism is used to limit the interaction between $[\texttt{CLS}]$ and corner tokens to ensure the diversity of gathered features.
Figure 4: Influence of token number limitation on LoTLIP. The performance of the pre-trained model on different tasks improves when the token number limitation increases up to 192, which exceeds the commonly used 77. Meanwhile, the FLOPs of the text encoder (red stars) rapidly increase with the text token number limitation.
Figure 5: Influence of the number of sub-captions used in the pre-training stages. Both LiT and LoTLIP are trained with long texts. The performance on ShareGPT4v and DCI retrieval are shown in (a)(b). (c)(d) represent the performance on MSCOCO retrieval. (e) shows the performance of image classification on ImageNet.
...and 1 more figures

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

TL;DR

Abstract

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (6)