Back to artifacts

Computer Vision: January 2026 Week 4

Jan 22 – Jan 28, 2026 · 105 papers analyzed · 3 breakthroughs

Summary

Week 4 (Jan 22-28): 3 breakthroughs from 105 papers. (1) 2601.17761 (AR-Omni) unifies text/image/speech generation in single AR model with joint discrete vocabulary; (2) 2601.16163 (Cosmos Policy) repurposes video diffusion as robot policy backbone via latent-frame injection; (3) 2601.15500 reveals deep connection between Rectified Flow, DDPM, and stochastic localization. Unified multimodal models and video-as-infrastructure are the themes.

Key Takeaway

Video generation models becoming foundational infrastructure; AR and flow matching converging theoretically.

Breakthroughs (3)

1. AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

Why Novel: First unified AR model generating across text, image, and speech using single Transformer decoder over joint discrete vocabulary. Interleaved multimodal tokens with task-aware loss.

Key Innovations:

  • Joint discrete vocabulary across text, image, speech
  • Interleaved token generation for any-to-any
  • Task-aware loss weighting + perceptual loss for images

Evidence:

  • — Joint vocabulary and interleaving design
  • — Task-aware loss formulation
  • — Cross-modal generation benchmarks

Impact: Shows AR can unify modalities despite W3's identified limitations with architectural care.

2. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Why Novel: Repurposes pretrained video diffusion model into unified robot policy, world model, and value function via single-stage latent-frame injection.

Key Innovations:

  • Latent-frame injection embeds actions, futures, values into video latent space
  • Joint training of policy + world model + value function
  • Leverages 19B video model as robot backbone

Evidence:

  • — Latent-frame injection architecture
  • — Joint training objective
  • — Manipulation benchmark results

Impact: Demonstrates video foundation models can serve as unified robot learning backbone.

3. Low-Dimensional Adaptation of Rectified Flow: A New Perspective through the Lens of Diffusion and Stochastic Localization

Why Novel: Reveals deep theoretical connection between Rectified Flow, DDPM, and stochastic localization. Introduces stochastic RF that adapts to intrinsic dimensionality.

Key Innovations:

  • Proves RF, DDPM, stochastic localization are deeply connected
  • U-shaped nonuniform time discretization exploits low-dim structure
  • Stoc-RF adapts to intrinsic dimensionality of data

Evidence:

  • — Connection theorem between RF/DDPM/stochastic localization
  • — U-shaped discretization schedule
  • — FID improvements with low-dim adaptation

Impact: Unifies theory across generative paradigms; practical speedup from dimensionality awareness.

Trends

  • Unified multimodal AR models emerging (AR-Omni) despite AR limitations

  • Video diffusion becoming infrastructure for other domains (robotics)

  • Flow matching theory maturing — connections to DDPM and stochastic localization

  • 3DGS reaching edge deployment (mobile, streaming, 1-minute reconstruction)

Notable Papers (5)

1. Memory-V2V: Augmenting Video-to-Video Diffusion with Memory

External cache + retrieval for cross-turn video editing consistency.

2. PocketGS: On-Device Training of 3D Gaussian Splatting

3DGS training on commodity mobile devices via co-designed operators.

3. CamPilot: Camera Control in Video Diffusion via Reward Feedback

RL-based camera control learning for video diffusion.

4. UniMGS: Unifying Mesh and 3D Gaussian Splatting

Single-pass rasterization bridging mesh and 3DGS representations.

5. LoD-Structured 3DGS for Streaming Video Reconstruction

Level-of-detail 3DGS for efficient video streaming.

Honorable Mentions

  • Fast Converging 3DGS for 1-Minute Reconstruction ()
  • LGDWT-GS: Wavelet-Regularized 3DGS for Sparse-View Reconstruction ()