Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

Simo Ryu; Chunghwan Han

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

Simo Ryu, Chunghwan Han

TL;DR

The engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips are Documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model.

Abstract

We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

TL;DR

Abstract

P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and

P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.

Paper Structure (38 sections, 7 equations, 16 figures, 5 tables, 1 algorithm)

This paper contains 38 sections, 7 equations, 16 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Video Foundation Models
Dataset Engineering and Curation
Scaling and Parameterization
Training Dynamics and Optimization
Dataset Engineering
Scale Requirements and Design Philosophy
Metadata-Driven Collection Strategy
Video Segmentation and Shot Boundary Detection
Engineering Challenges at Scale
Scaling with Ray
Multi-Stage Filtering Pipeline
Visual Filters
Motion Filters
...and 23 more sections

Figures (16)

Figure 1: Benchmark results of video decoding libraries across resolutions and codecs (720p H.264, 1080p H.264, 1080p VP9, 2160p VP9). PyNvVideoCodec consistently achieves the highest throughput and was selected as the default decoder in our pipeline.
Figure 2: Within-node Ray actor design: network I/O overlapped with processing; CPU and GPU jobs overlapped via threads inside the actor.
Figure 3: Cluster scaling via a task distributor front-end dispatching work to many Ray actors across nodes.
Figure 4: Complete video preprocessing pipeline showing the flow from raw footage through segmentation, filtering, and encoding stages. Multiple parallel filters operate on different quality dimensions to ensure only high-quality clips proceed to training.
Figure 5: Performance profile of the fine-tuned Qwen 2.5 VL captioning model showing inference throughput across different batch sizes and model configurations.
...and 11 more figures

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

TL;DR

Abstract

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

Authors

TL;DR

Abstract

Table of Contents

Figures (16)