Table of Contents
Fetching ...

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

TL;DR

CosMo addresses the challenge of long-context vision–language pretraining and cross-modal alignment by decoupling a frozen vision encoder and a two-segment LLM, trained with a document-style interleaved input format and an additional contrastive loss. It introduces Howto-Interlink7M, a high-quality interleaved video–text dataset built from HowTo100M with GPT-4 annotations to improve narrative coherence and alignment. Across 14 image-text and video-text benchmarks, CosMo achieves state-of-the-art results with only $34\%$ of learnable parameters and using $72\%$ of the data, outperforming OpenFlamingo, and gains further from the Howto-Interlink7M data. The work demonstrates that careful data curation, interleaved representations, and a lightweight fusion strategy can deliver strong few-shot and zero-shot performance in diverse multimodal tasks, with practical implications for scalable long-context VLP.

Abstract

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

TL;DR

CosMo addresses the challenge of long-context vision–language pretraining and cross-modal alignment by decoupling a frozen vision encoder and a two-segment LLM, trained with a document-style interleaved input format and an additional contrastive loss. It introduces Howto-Interlink7M, a high-quality interleaved video–text dataset built from HowTo100M with GPT-4 annotations to improve narrative coherence and alignment. Across 14 image-text and video-text benchmarks, CosMo achieves state-of-the-art results with only of learnable parameters and using of the data, outperforming OpenFlamingo, and gains further from the Howto-Interlink7M data. The work demonstrates that careful data curation, interleaved representations, and a lightweight fusion strategy can deliver strong few-shot and zero-shot performance in diverse multimodal tasks, with practical implications for scalable long-context VLP.

Abstract

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.
Paper Structure (53 sections, 2 equations, 7 figures, 17 tables)

This paper contains 53 sections, 2 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Advancements in Vision-Language Pre-training (VLP) have transitioned towards accommodating long-form text inputs. (a). Earlier studies emphasized short, paired image/video text correlations, exemplified by works such as CLIP clip and GiT git. (b). Present research emphasizes in-context learning strategies, showcased by approaches like Flamingo flamingo and Palm-E palme. LLMs' exceptional text-processing enables effortless integration of lengthy documents, showcasing robust few-shot learning sans extensive fine-tuning.
  • Figure 2: Comparing Conventional Video-Text Datasets with Our Howto-Interlink7M. Left: Conventional video-text datasets typically contain brief ASR captions describing videos. Right: In Howto-Interlink7M, to include more details and improved video coherence, videos are segmented into shots. Following this, the GPT-4 model gpt4 is employed to annotate each shot based on historical context and detailed annotations include ASR, captions, and dense captions. We highlight hard object labels as well as connectives between clips.
  • Figure 3: An introduction to CosMo: This model handles both image/video text pairs and inter-level image/video text pairs. The Large Language Model is divided into two parts to compute contrastive loss and language modeling loss.
  • Figure 4: Instances of Low-Similarity Images: In datasets like MMC4 mmc4, based on raw website content, there are often incongruent images that don't align with the accompanying text, leading to training instability.
  • Figure 5: LAION400M laion400m and similar large datasets commonly suffer from redundancy. Clustering and uniform distance-based sampling help alleviate this issue.
  • ...and 2 more figures