Table of Contents
Fetching ...

HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

A. Emin Orhan

TL;DR

HVM-1 investigates whether large-scale self-supervised learning from human-like, egocentric video can yield robust visual representations. The authors train two 633M-parameter ViT-H encoders with a spatiotemporal MAE objective on ~$4971$ hours of diverse human-like video data and compare to a Kinetics-700 baseline, evaluating via few-shot action and object recognition. They show competitive few-shot performance, particularly at $448\times 448$, and demonstrate superior object representations compared to image-based MAE pretraining, suggesting that learning temporal regularities benefits representation quality. The results highlight the potential of human-like video data for scalable visual learning and offer resources to researchers at the intersection of ML and cognitive science, while acknowledging limitations related to dataset heterogeneity and embodiment.

Abstract

We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object representations compared to models pretrained with the image-based MAE algorithm on the same data, demonstrating the potential benefits of learning to predict temporal regularities in natural videos for learning better object representations.

HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

TL;DR

HVM-1 investigates whether large-scale self-supervised learning from human-like, egocentric video can yield robust visual representations. The authors train two 633M-parameter ViT-H encoders with a spatiotemporal MAE objective on ~ hours of diverse human-like video data and compare to a Kinetics-700 baseline, evaluating via few-shot action and object recognition. They show competitive few-shot performance, particularly at , and demonstrate superior object representations compared to image-based MAE pretraining, suggesting that learning temporal regularities benefits representation quality. The results highlight the potential of human-like video data for scalable visual learning and offer resources to researchers at the intersection of ML and cognitive science, while acknowledging limitations related to dataset heterogeneity and embodiment.

Abstract

We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object representations compared to models pretrained with the image-based MAE algorithm on the same data, demonstrating the potential benefits of learning to predict temporal regularities in natural videos for learning better object representations.
Paper Structure (12 sections, 5 figures, 1 table)

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: (a) Datasets used for pretraining HVM-1 models. (b) Evolution of the training loss for the three pretrained models over the course of pretraining.
  • Figure 2: Top-5 validation accuracy on the few-shot SSV2 (a) and the few-shot Kinetics (b) benchmarks. Model legend: 'HVM-1@448' and 'HVM-1' are the models pretrained with human-like video data at the 448$\times$448 and 224$\times$224 pixel resolution, respectively; 'Kinetics' denotes the model pretrained on the Kinetics-700 training data; 'Scratch' refers to a model trained on the downstream task only (with no pretraining). Dashed horizontal lines indicate the chance level.
  • Figure 3: t-SNE embeddings of the videos from the SSV2 validation set. From left to right, the embeddings were obtained from the pretrained HVM-1@448 model without any finetuning (0-shot), from the pretrained HVM-1@448 model finetuned on the 10-shot SSV2 task (10-shot), and from the pretrained HVM-1@448 model finetuned on the 50-shot SSV2 task (50-shot). Videos belonging to 10 developmentally realistic action categories (listed in the legend) are highlighted with different colors. Gray dots represent the videos belonging to other categories. Numbers in parentheses in the legend represent the top-5 accuracy for the corresponding categories in the 50-shot condition.
  • Figure 4: HVM-1 models ($\square$ and $\times$) outperform the scaling trends estimated from a subset of SAYCam in orhan2024b, represented by the solid dots ($\bullet$) and the corresponding log-linear fits. The shaded regions indicate the 95% confidence intervals around the log-linear fits. Results are shown for the SSV2 (a) and Kinetics-700 (b) benchmarks and for both 10-shot (red) and 50-shot (blue) conditions.
  • Figure 5: (a) Top-5 validation accuracy on ImageNet. (b) OOD accuracy on the OOD ImageNet benchmark. Model legend: 'HVM-1@448' and 'HVM-1' are the models pretrained with human-like video data at the 448$\times$448 and 224$\times$224 pixel resolution, respectively; 'Kinetics' denotes the model pretrained on the Kinetics-700 training data; 'Scratch' refers to a model trained on the downstream task only (with no pretraining). Dashed horizontal lines indicate the chance performance. Hatched bars represent the performance of image-based models trained with the image-based MAE algorithm on the same data, at the same spatial resolution, and with the same encoder architecture as the corresponding HVM-1 models. These models thus help us isolate the effect of image-based vs. video-based pretraining on object recognition accuracy by controlling for the other main factors.