Computer Vision: January 2026 Week 2
Jan 8 – Jan 14, 2026 · 112 papers analyzed · 3 breakthroughs
Summary
Week 2 (Jan 8-14): 3 breakthroughs from 112 papers. (1) 2601.09212 (COOL-SD) provides theoretical grounding for speculative decoding in AR image gen with annealed relaxation; (2) 2601.09881 (TMD) decouples video diffusion into semantic backbone + recurrent flow head for few-step generation; (3) 2601.05722 shows video diffusion can generate high-quality 3D characters from single image. AR acceleration and video diffusion control are the themes.
Key Takeaway
AR and diffusion paradigms racing on inference speed; 3DGS becoming commodity infrastructure.
Breakthroughs (3)
1. Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
Why Novel: First theoretical grounding of speculative decoding for AR image generation. Derives near-tight TV distance bounds and introduces COOL-SD with annealed relaxation.
Key Innovations:
- Theoretical upper bound on TV distance for speculative decoding
- Annealed relaxation schedule for acceptance probability
- 2-3x speedup on AR image models without quality loss
Evidence:
- — TV distance bound for speculative decoding
- — COOL-SD algorithm with annealing
- — Speedup vs quality tradeoff results
Impact: Makes AR image generation competitive with diffusion on inference speed.
2. Transition Matching Distillation for Fast Video Generation
Why Novel: Decouples video diffusion into semantic backbone and recurrent flow head, enabling few-step generation from multi-step teachers.
Key Innovations:
- Two-stage distillation: semantic backbone + recurrent flow head
- Transition matching between teacher and student trajectories
- 4-8 step generation matching 50-step quality
Evidence:
- — Decoupled architecture design
- — Transition matching loss formulation
- — FVD scores vs step count
Impact: Provides practical path to real-time video generation.
3. Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation
Why Novel: Shows video diffusion models can be repurposed for 3D character generation by generating consistent multi-view rotations.
Key Innovations:
- Video diffusion generates 360° rotation of character
- Multi-view consistency from temporal coherence
- Single image to 3D character pipeline
Evidence:
- — Pipeline from single image to 3D mesh
- — View consistency metrics
Impact: Bridges video generation and 3D reconstruction.
Trends
AR image generation getting theoretical foundations for acceleration
Video diffusion distillation enabling few-step generation
3DGS continuing to specialize (SLAM, face swap, indoor scenes)
Reward hacking mitigation extending beyond T2I to hybrid reasoning
Notable Papers (5)
1. Thinking-Based Non-Thinking: Solving Reward Hacking in Hybrid Reasoning
Derives non-thinking token cap from thinking-mode solution to prevent reward hacking.
2. GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting
Combines 3DGS with video face swapping for temporally consistent results.
3. TIDI-GS: Floater Suppression in 3D Gaussian Splatting
Addresses common floater artifacts in indoor 3DGS reconstruction.
4. Focal Guidance: Controllability from Semantic-Weak Layers in Video Diffusion
Unlocks control from previously ignored layers in video diffusion.
5. FeatureSLAM: Feature-enriched 3D Gaussian Splatting SLAM
Real-time SLAM with semantic features via 3DGS.
Honorable Mentions
- ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3DGS ()
- GS-DMSR: Dynamic Sensitive Multi-scale Enhancement for 3DGS ()