MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Jiaqi Xu; Bo Liu; Yunkuo Chen; Mengli Cheng; Xing Shi

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi

TL;DR

MuLTI tackles the dual challenges of efficiency and alignment in video-language understanding with long sequences. It introduces a Text-Guided MultiWay-Sampler that condenses and fuses long video and text features efficiently, and a novel Multiple Choice Modeling pretraining task to bridge the gap between pretraining and downstream tasks like videoQA. The approach yields state-of-the-art results across seven video-language benchmarks and demonstrates notable memory efficiency compared with prior methods. These contributions enable accurate, scalable video-language understanding suitable for industrial applications requiring long-form sequences and rapid adaptation to downstream tasks.

Abstract

Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume high computational costs. Specially, they have difficulty dealing with dense video frames or long text prevalent in industrial applications. This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model that achieves efficient and effective feature fusion and rapid adaptation to downstream tasks. Specifically, we design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules to sample long sequences and fuse multi-modal features, which reduces the computational costs and addresses performance degradation caused by previous samplers. Therefore, MuLTI can handle longer sequences with limited computational costs. Then, to further enhance the model's performance and fill in the lack of pretraining tasks in the video question answering, we propose a new pretraining task named Multiple Choice Modeling. This task bridges the gap between pretraining and downstream tasks and improves the model's ability to align video and text features. Benefiting from the efficient feature fusion module and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

TL;DR

Abstract

Paper Structure (13 sections, 3 equations, 5 figures, 9 tables)

This paper contains 13 sections, 3 equations, 5 figures, 9 tables.

Introduction
Related Work
Methodology
MuLTI’s Architecture
Pretraining for MuLTI
Experiments
Implementation Details
Downstream Tasks and Datasets
Performance of Proposed Methods
The Importance of Text-Guided MultiWay-Sampler
The Importance of Multiple Choice Modeling
Ablation Experiment on Training Strategies
Conclusion

Figures (5)

Figure 1: Comparison of different models. Previous works such as (a) and (b) cannot easily handle long sequences. Previous works such as (c) use randomly initialized query vectors for sampler and condense video features, which is sub-optimal solution.
Figure 2: (a) shows the framework of MuLTI. MuLTI contains a video encoder, a text encoder, and a Text-Guided MultiWay-Sampler. Text-Guided MultiWay-Sampler is used to condense the extracted features and feature fusion. (b) shows the framework of the Text-Guided MultiWay-Sampler. The adapt-pooling feature provides origin information. We share the self-attention module and reserve different feed forward networks for different modalities in the sampler to accommodate modalities.
Figure 3: Comparisons with existing methods on Memory-Usage with different numbers of frames. Text length is 512.
Figure 4: Comparisons of different text length and number of frames on Memory-Usage. The F means Flatten, the D means Decoder, the E means Encoder, the S means Sampler. The number in parentheses represents the number of frames.
Figure 5: A visualization of the cross-attention map from the Text-Guided MultiWay-Sampler.

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

TL;DR

Abstract

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)