Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

Jannik Kossen; Cătălina Cangea; Eszter Vértes; Andrew Jaegle; Viorica Patraucean; Ira Ktena; Nenad Tomasev; Danielle Belgrave

Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

Jannik Kossen, Cătălina Cangea, Eszter Vértes, Andrew Jaegle, Viorica Patraucean, Ira Ktena, Nenad Tomasev, Danielle Belgrave

TL;DR

This work defines Active Acquisition for Multimodal Temporal Data (A2MT), a task where agents learn to selectively acquire high-cost modalities over time to balance predictive performance and expenditure. It introduces a Perceiver IO-based framework to handle multimodal, temporally evolving inputs, with two training regimes (small-data with separate models and large-data with shared encoders) and masking pretraining to simulate sparse observations. Experiments on synthetic scenarios demonstrate cross-modal reasoning capabilities, while evaluations on AudioSet and Kinetics-700 show cost-reactive acquisition behavior but limited per-input adaptivity, highlighting the task's difficulty and the need for further methodological advances. The work provides valuable benchmarks and insights with potential implications for domains like medicine, robotics, and finance, where modality informativeness and acquisition cost vary across contexts.

Abstract

We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. We propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness.

Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 12 figures, 9 tables, 2 algorithms)

This paper contains 31 sections, 1 equation, 12 figures, 9 tables, 2 algorithms.

Introduction
Active Acquisition for Multimodal Temporal Data
Datasets for A2MT
Synthetic Datasets
Audio-Visual Datasets
Perceiver IO for A2MT
Variant 1: Small Data Regime
Variant 2: Large Data Regime
Experiments
Synthetic Scenario
Audio-Visual Datasets
Model Pretraining
Agent Training
Related Work
Discussion
...and 16 more sections

Figures (12)

Figure 1: In many practical applications, features are not available a priori at test time and have to be acquired at a real-world cost to allow for the prediction of an associated label. In Active Acquisition for Multimodal Temporal Data, we aim to learn agents that efficiently acquire for multimodal temporal inputs: (a) at each timestep, the agent decides which modalities of the input it acquires, paying a per-modality acquisition cost; (b) then, a separate model predicts given the sparse sequence of observations; (c) lastly, the agent gets rewarded for low prediction loss and small acquisition cost.
Figure 2: The synthetic scenarios allow for sparse acquisition while keeping perfect accuracy. This requires agents capable of cross-modal reasoning. (Label is $9$ in the above.)
Figure 3: Four example sequences from the AudioSet training set. The labels associated with these inputs are (a) Music, (b) Music, (c) Speech, and (d) Electronic music. The audio signal is often more informative of the label than the images for AudioSet. Inputs are downsampled in the above visualization.
Figure 4: Acquisition behavior of the Perceiver IO agent on a simple synthetic scenario. 'Digit' and 'Counter' give ground truth values for the Counter and Digit input modalities. 'Actions Digit' and 'Actions Counter' mark when the agent did (1) or did not acquire (0) for each of the modalities. The agent successfully learns a sparse acquisition strategy: it (almost always) acquires the Digit modality only if the Counter modality is $0$, and further learns to skip some acquisitions in the Counter modality.
Figure 5: Comparing learned acquisition patterns of an agent on AudioSet to the patterns of the random ablations. Our agent learns a set of fixed timesteps for which it always acquires, similar to the random-1hot baseline. In (a) and (c), acquisition rates are close to zero and too small to be visible for some timesteps.
...and 7 more figures

Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

TL;DR

Abstract

Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task

Authors

TL;DR

Abstract

Table of Contents

Figures (12)