Computer Vision: January 2026 Week 4
Jan 22 – Jan 28, 2026 · 105 papers analyzed · 3 breakthroughs
Summary
Week 4 (Jan 22-28): 3 breakthroughs from 105 papers. (1) 2601.17761 (AR-Omni) unifies text/image/speech generation in single AR model with joint discrete vocabulary; (2) 2601.16163 (Cosmos Policy) repurposes video diffusion as robot policy backbone via latent-frame injection; (3) 2601.15500 reveals deep connection between Rectified Flow, DDPM, and stochastic localization. Unified multimodal models and video-as-infrastructure are the themes.
Key Takeaway
Video generation models becoming foundational infrastructure; AR and flow matching converging theoretically.
Breakthroughs (3)
1. AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
Why Novel: First unified AR model generating across text, image, and speech using single Transformer decoder over joint discrete vocabulary. Interleaved multimodal tokens with task-aware loss.
Key Innovations:
- Joint discrete vocabulary across text, image, speech
- Interleaved token generation for any-to-any
- Task-aware loss weighting + perceptual loss for images
Evidence:
- — Joint vocabulary and interleaving design
- — Task-aware loss formulation
- — Cross-modal generation benchmarks
Impact: Shows AR can unify modalities despite W3's identified limitations with architectural care.
2. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Why Novel: Repurposes pretrained video diffusion model into unified robot policy, world model, and value function via single-stage latent-frame injection.
Key Innovations:
- Latent-frame injection embeds actions, futures, values into video latent space
- Joint training of policy + world model + value function
- Leverages 19B video model as robot backbone
Evidence:
- — Latent-frame injection architecture
- — Joint training objective
- — Manipulation benchmark results
Impact: Demonstrates video foundation models can serve as unified robot learning backbone.
3. Low-Dimensional Adaptation of Rectified Flow: A New Perspective through the Lens of Diffusion and Stochastic Localization
Why Novel: Reveals deep theoretical connection between Rectified Flow, DDPM, and stochastic localization. Introduces stochastic RF that adapts to intrinsic dimensionality.
Key Innovations:
- Proves RF, DDPM, stochastic localization are deeply connected
- U-shaped nonuniform time discretization exploits low-dim structure
- Stoc-RF adapts to intrinsic dimensionality of data
Evidence:
- — Connection theorem between RF/DDPM/stochastic localization
- — U-shaped discretization schedule
- — FID improvements with low-dim adaptation
Impact: Unifies theory across generative paradigms; practical speedup from dimensionality awareness.
Trends
Unified multimodal AR models emerging (AR-Omni) despite AR limitations
Video diffusion becoming infrastructure for other domains (robotics)
Flow matching theory maturing — connections to DDPM and stochastic localization
3DGS reaching edge deployment (mobile, streaming, 1-minute reconstruction)
Notable Papers (5)
1. Memory-V2V: Augmenting Video-to-Video Diffusion with Memory
External cache + retrieval for cross-turn video editing consistency.
2. PocketGS: On-Device Training of 3D Gaussian Splatting
3DGS training on commodity mobile devices via co-designed operators.
3. CamPilot: Camera Control in Video Diffusion via Reward Feedback
RL-based camera control learning for video diffusion.
4. UniMGS: Unifying Mesh and 3D Gaussian Splatting
Single-pass rasterization bridging mesh and 3DGS representations.
5. LoD-Structured 3DGS for Streaming Video Reconstruction
Level-of-detail 3DGS for efficient video streaming.
Honorable Mentions
- Fast Converging 3DGS for 1-Minute Reconstruction ()
- LGDWT-GS: Wavelet-Regularized 3DGS for Sparse-View Reconstruction ()