Table of Contents
Fetching ...

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

TL;DR

The paper tackles multimodal learning with heterogeneous modalities by decoupling autoregressive modeling into time-aligned audio/video and non-aligned contextual modalities, enabling efficient long-form processing. It introduces a Combiner that fuses per-snippet audio-visual features into compact representations, with two implementations: a Transformer-based Combiner and a memory-augmented Token Turing Machine Combiner. The approach supports long videos (up to 512 frames) without increasing model size and achieves state-of-the-art results on VideoQA benchmarks and long-form video tasks, while maintaining reasonable compute. This work advances scalable, temporally-aware multimodal understanding across audio, video, and text streams.

Abstract

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

TL;DR

The paper tackles multimodal learning with heterogeneous modalities by decoupling autoregressive modeling into time-aligned audio/video and non-aligned contextual modalities, enabling efficient long-form processing. It introduces a Combiner that fuses per-snippet audio-visual features into compact representations, with two implementations: a Transformer-based Combiner and a memory-augmented Token Turing Machine Combiner. The approach supports long videos (up to 512 frames) without increasing model size and achieves state-of-the-art results on VideoQA benchmarks and long-form video tasks, while maintaining reasonable compute. This work advances scalable, temporally-aware multimodal understanding across audio, video, and text streams.

Abstract

One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.
Paper Structure (19 sections, 6 equations, 5 figures, 6 tables)

This paper contains 19 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Autoregressive learning of time-aligned video and audio modalities, in time, and decoupling from the autoregressive text modeling allows for more effective multimodal models at smaller sizes and leads to scaling to longer videos.
  • Figure 2: The Mirasol3B model architecture consists of an autoregressive model for the time-aligned modalities, such as audio and video, which are partitioned in chunks (left) and an autoregressive model for the unaligned context modalities, which are still sequential, e.g., text (right). This allows adequate computational capacity to the video/audio time-synchronized inputs, including processing them in time autoregressively, before fusing with the autoregressive decoder for unaligned text (right). Joint feature learning is conducted by the Combiner, balancing the need for compact representations and allowing sufficiently informative features to be processed in time.
  • Figure 3: Autoregressive modeling of video and audio in time.
  • Figure 4: Combiners: Transformer Combiner (left): all features are input to the transformer, a smaller number of m features are selected as combined features. TTM Combiner (right): uses the TTM mechanism to store a memory and compute the m combined features for each time step. This process is repeated for each step.
  • Figure 5: Visualization of the different combiners we explored in this paper. The Transformer combiner, which is the main one we used, simply takes the last $m$ features of the output to represent the combined inputs. We found this to work well. The CLS combiner and Perceiver combiner we found both underperformed the base combiner. The TTM combiner is different, it uses a memory to store the previous representations and has read, process and write operations. We found this method saved memory with some tradeoff for accuracy for some datasets.