Table of Contents
Fetching ...

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

TL;DR

Current multimodal foundation models are limited by the number of modalities and tasks they support out of the box. This paper extends the 4M framework to 21 diverse modalities by introducing modality-specific discrete tokenizers and co-training on large multimodal datasets, scaling the model to 3B parameters. The result is an any-to-any vision model capable of steerable generation and cross-modal retrieval without sacrificing performance, with open-source training code and models. This work enables finer-grained multimodal interaction and demonstrates that expanding modality diversity can improve transfer and generalization across tasks.

Abstract

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

TL;DR

Current multimodal foundation models are limited by the number of modalities and tasks they support out of the box. This paper extends the 4M framework to 21 diverse modalities by introducing modality-specific discrete tokenizers and co-training on large multimodal datasets, scaling the model to 3B parameters. The result is an any-to-any vision model capable of steerable generation and cross-modal retrieval without sacrificing performance, with open-source training code and models. This work enables finer-grained multimodal interaction and demonstrates that expanding modality diversity can improve transfer and generalization across tasks.

Abstract

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.
Paper Structure (40 sections, 15 figures, 10 tables)

This paper contains 40 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: We demonstrate training a single model on tens of highly diverse modalities without a loss in performance compared to specialized single/few task models. The modalities are mapped to discrete tokens using modality-specific tokenizers. The model can generate any of the modalities from any subset of them.
  • Figure 2: One-to-all generation.4M-21 can generate all modalities from any given input modality and can benefit from chained generation 4m. Notice the high consistency among the predictions of all modalities for one input. Each row starts from a different modality coming from the same scene. Highlighted in green are new input/output pairs that 4M 4m cannot predict nor accept as input. Note that, while this figure shows predictions from a single input, 4M-21 can generate any modality from any subset of all modalities.
  • Figure 3: Tokenization overview. We employ suitable tokenization schemes for different modalities based on their format and performance. For image-like modalities and feature maps, we use spatial VQ-VAEs Oord2017vqvae with optional diffusion decoders for detail rich modalities like RGB. For non-spatial modalities like global tokens or parameterized poses, we compress them to a fixed number of discrete tokens using Memcodes Mama2021NWT with MLP encoders and decoders. All sequence modalities are encoded as text using WordPiece Devlin2019BERT. The shown examples are real tokenizer reconstructions. Notice the low reconstruction error. See \ref{['sec:appendix_dataset_tok_details']} for more details.
  • Figure 4: Fine-grained & steerable multimodal generation.Top left:4M-21 can generate variants of images that are grounded in any input modality, here human poses. Bottom left: This enables us to perform multimodal edits (e.g. editing the shape of a polygon or grounding generation with edges) and probe the learned representation. For example, by only changing the shape of the ellipse, 4M-21 renders the bowl from different angles. Top right: By pre-training on 21 types of modalities, including T5-XXL embeddings, and co-training with language modeling on a large text corpus, we show improved text understanding capabilities (even when the input is captions instead of language model embeddings). Bottom right: Compared to generating images from captions only, metadata provides a more direct and steerable way of controlling the multimodal data generation process, enabling exciting further research into generative dataset design.
  • Figure 5: Different modes of multimodal retrieval. We perform multimodal retrievals by predicting global embeddings (here shown for DINOv2) from a given input (of any modality) using 4M-21 and comparing the cosine distances between the query and retrieval set embeddings. Left: Retrieving RGB images from distinctly different query modalities (here RGB, segmentation map, edges, depth map, color palette, and caption). Middle: Retrieving any modality using any other modality as the query input. Each query modality constrains the retrievals differently, e.g. here the RGB image and caption queries always yield Neuschwanstein castle retrievals. In contrast, for depth and semantic queries, the scene is more ambiguous, thus they retrieve other buildings with similar characteristics. Right: We can also combine any subset of modalities to define the query input, e.g. surface normals and a color palette, to better control the retrieval. See \ref{['sup:retrieve']} for more results.
  • ...and 10 more figures