Table of Contents
Fetching ...

Predicting Long-horizon Futures by Conditioning on Geometry and Time

Tarasha Khurana, Deva Ramanan

TL;DR

This work tackles predicting long-horizon futures from past observations by repurposing large-scale 2D image diffusion models with explicit frame timestamps for conditional video forecasting. The authors introduce timestamp-conditioned conditioning, a two-stream context encoding, and invariances via pseudo-depth or grayscale to enable data-efficient fine-tuning on modest video datasets. They propose direct, autoregressive, hierarchical, and mixed sampling strategies, finding that mixed sampling offers superior long-horizon coherence and accuracy. Evaluations on TAO and CO3Dv2 demonstrate that invariant modalities and timestamp conditioning yield stronger predictive performance than RGB baselines and prior video-prediction methods, with practical applications including variable framerate forecasting and frame interpolation. The work advances efficient, multi-modal video forecasting with flexible sampling and modality design choices, with potential impact on robotics and autonomous systems where predicting geometry over time is crucial.

Abstract

Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.

Predicting Long-horizon Futures by Conditioning on Geometry and Time

TL;DR

This work tackles predicting long-horizon futures from past observations by repurposing large-scale 2D image diffusion models with explicit frame timestamps for conditional video forecasting. The authors introduce timestamp-conditioned conditioning, a two-stream context encoding, and invariances via pseudo-depth or grayscale to enable data-efficient fine-tuning on modest video datasets. They propose direct, autoregressive, hierarchical, and mixed sampling strategies, finding that mixed sampling offers superior long-horizon coherence and accuracy. Evaluations on TAO and CO3Dv2 demonstrate that invariant modalities and timestamp conditioning yield stronger predictive performance than RGB baselines and prior video-prediction methods, with practical applications including variable framerate forecasting and frame interpolation. The work advances efficient, multi-modal video forecasting with flexible sampling and modality design choices, with potential impact on robotics and autonomous systems where predicting geometry over time is crucial.

Abstract

Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.
Paper Structure (25 sections, 6 equations, 17 figures, 3 tables)

This paper contains 25 sections, 6 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Predicting long-horizon futures by conditioning on geometry and time. In this work, we focus on the task of forecasting sensor observations given the past. Since the unobserved future can unfold in multiple ways, we capitalize on the recent explosion in large-scale pretraining of 2D diffusion networks, which are able to model the multi-modal distribution of natural images. By introducing invariances in data and additionally learning to condition on frame timestamps, we are able to equip 2D diffusion models with the ability to perform predictive video modeling using moderately-sized training data. Since we are able to query arbitrary timestamps, we find new sampling schedules that perform better than traditional autoregressive / hierarchical sampling strategies. Here, we show two pseudo-depth futures each, given the past pseudo-depth for four scenes, along with forecasts from training with luminance.
  • Figure 2: Using 2D diffusion models for video prediction As part of designing the video prediction architecture, we make the important design choice of using image diffusion models. Owing to the scale of data such models are trained on, we can expect them to understand indepedent stages of temporal events such as 'turning head from left to right', and 'flower bud opening up'. We show individual frames prompted from Stable Diffusion v2. We propose to add a control knob to image models in the form of timestamps that helps in temporal understanding.
  • Figure 3: High-level architecture We use a diffusion model that conditions on three video frames, their corresponding timestamps and a query timestamp. It generates a single video frame for the query. We adopt the two-stream conditioning from image-to-image models liu2023zero, and (1) channel-concatenate the context frames with the noisy input to diffusion model, and (2) CLIP-encode the context frames for cross-attention across the UNet layers. Context and query timestamps are positionally encoded and concatenated with CLIP embeddings.
  • Figure 4: Qualitative analysis of single-frame short horizon forecasting We show examples of input-output-groundtruth triplets. Given 3 past frames as input, we show 3 different samples of the future from our diffusion network, and the corresponding groundtruth. Prediction highlighted in red is the closest to groundtruth. Despite learning from only 1000 videos and training for only 7 hours, our method learns to generate multiple realistic futures and listens to low-level details in the historical context frames (e.g., scene structure, actors performing events, and overall camera motion). For reference, the events across examples in row major form could be described as, 'playing in field', 'crossing road', 'doing laundry', 'driving (front view)', 'exiting room while holding a box', 'picking up from table', 'driving (side view)', 'biking', 'fidgeting', 'boating with camera zooming in', 'standing in hallway', 'sailing'.
  • Figure 5: Comparison to state-of-the-art We evaluate future depth prediction for +1s against state-of-the-art video prediction methods by retraining them for pseudo-depth prediction, and against other simple or non-learned baselines. We find that our method beats prior work with a substantial margin.
  • ...and 12 more figures