Table of Contents
Fetching ...

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

Simo Ryu, Chunghwan Han

TL;DR

The engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips are Documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model.

Abstract

We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

TL;DR

The engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips are Documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model.

Abstract

We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.
Paper Structure (38 sections, 7 equations, 16 figures, 5 tables, 1 algorithm)

This paper contains 38 sections, 7 equations, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 1: Benchmark results of video decoding libraries across resolutions and codecs (720p H.264, 1080p H.264, 1080p VP9, 2160p VP9). PyNvVideoCodec consistently achieves the highest throughput and was selected as the default decoder in our pipeline.
  • Figure 2: Within-node Ray actor design: network I/O overlapped with processing; CPU and GPU jobs overlapped via threads inside the actor.
  • Figure 3: Cluster scaling via a task distributor front-end dispatching work to many Ray actors across nodes.
  • Figure 4: Complete video preprocessing pipeline showing the flow from raw footage through segmentation, filtering, and encoding stages. Multiple parallel filters operate on different quality dimensions to ensure only high-quality clips proceed to training.
  • Figure 5: Performance profile of the fine-tuned Qwen 2.5 VL captioning model showing inference throughput across different batch sizes and model configurations.
  • ...and 11 more figures