Table of Contents
Fetching ...

Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning

Pîrvu Mihai-Cristian, Marius Leordeanu

TL;DR

PHG-MAE introduces a probabilistic hypergraph framework realized through Masked Autoencoders to unify multi-modal, multi-task learning. By masking entire modalities, it samples hyperedges and enables test-time ensembles, while intermediate modalities derived from pretrained experts smooth the learning curve across tasks. The approach combines a single training loop for pretraining and fine-tuning, semi-supervised learning via pseudo-labels, and distillation to deploy compact RGB-only models with strong performance. Empirical results on the UAV-focused Dronescapes dataset show state-of-the-art or competitive semantic segmentation with significantly smaller models and improved temporal consistency. The authors provide an open data-pipeline and release code to facilitate scalable multi-modal vision research beyond UAV domains.

Abstract

The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn't require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.

Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning

TL;DR

PHG-MAE introduces a probabilistic hypergraph framework realized through Masked Autoencoders to unify multi-modal, multi-task learning. By masking entire modalities, it samples hyperedges and enables test-time ensembles, while intermediate modalities derived from pretrained experts smooth the learning curve across tasks. The approach combines a single training loop for pretraining and fine-tuning, semi-supervised learning via pseudo-labels, and distillation to deploy compact RGB-only models with strong performance. Empirical results on the UAV-focused Dronescapes dataset show state-of-the-art or competitive semantic segmentation with significantly smaller models and improved temporal consistency. The authors provide an open data-pipeline and release code to facilitate scalable multi-modal vision research beyond UAV domains.

Abstract

The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) \cite{he2022masked,bachmann2022multimae}, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn't require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs \cite{leordeanu2021semi} with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.

Paper Structure

This paper contains 34 sections, 27 figures, 8 tables.

Figures (27)

  • Figure 1: From a graph to a bipartite hypergraph: defining inputs and outputs, followed by creating hyperedges from many input nodes to one or more output nodes. Three edges (out of many valid ones): $[I_{1},I_{2}] \rightarrow O_1$ (hyperedge, top right), $I_3 \rightarrow O_3$ (edge, middle right) and $[I_4,..,I_n] \rightarrow [O_{m-1}, O_m]$ (multi-output hyperedge, bottom right)
  • Figure 2: Left: Standard Masked Autoencoder model. Right: Masked Autoencoder model with explicit masking rules that model certain input and output pairs which teach the model about the distribution of hyperedges. Two such masking operations are shown (out of many valid ones): $I_1 \rightarrow O_2$ (top right), and $[I_{1},I_{2}] \rightarrow O_1$ (bottom right). Inputs are either seen or masked, but never reconstructed, while outputs are always masked and sometimes reconstructed.
  • Figure 3: We use a pretrained MAE model to generate ensemble candidates. In this figure we do not distinguish between inputs and outputs, showing that our ensembles can also work with a regular MAE, not just PHG-MAE. We do multiple random maskings which in turn result in a set of multiple reconstructions. Lastly, we combine them using an ensemble aggregation function, like averaging. In our experiments, we combine this technique with the PHG-MAE masking from Figure \ref{['fig:mae-vs-io-mae']} (right side) where inputs and outputs are clearly defined.
  • Figure 4: The Data-Pipeline and the PHG-MAE model. Left: The process of deriving modalities as pseudo-labels from pretrained experts using RGB only, followed by deriving new modalities from combinations of experts. Right: The integration of these experts & generated modalities in the PHG-MAE semi-supervised training and inference procedure. Modalities are grouped in input, intermediate or output. Intermediate modalities are probabilistically masked and they drive the generation of ensembles at inference time. Different from other works, we mask at whole modality level, not at patch level. Note that in our experiments we generate up to 9 intermediate modalities out of 4 experts: 1 depth and 3 for semantic segmentation. However, this is in no way comprehensive and more modalities could be added in the future to further enhance the data. In the appendix we provide a full sample containing all of them.
  • Figure 5: Temporal Consistency through Optical Flow. Example for a single pixel of a 3x3 frame resulting in a score of 0.5: the pixel in the previous image is consistent (same class), while the pixel on the right changed. This process is repeated for all the pixels in an image and averaged for a final per-frame score.
  • ...and 22 more figures