Table of Contents
Fetching ...

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

JF Bastien, Sam D'Amico

Abstract

Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh-video pruning is smaller but real. C-VISION skips timed vision-tower work before the first answer is generated. On Gemma 4-E4B-4bit, the clean 32f short cell reaches 1.316x first-query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage-share ceiling (C-CEILING) is the accounting guardrail: a component speedup becomes an end-to-end speedup only in proportion to the wall-clock share it accelerates, so C-VISION and after-ingest follow-up reuse do not multiply. Candidate C-STREAM remains a native-rate target, not a headline result here. The broader direction is VLM-native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

Abstract

Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: reuse state when validation says it survives, and buy fresh evidence when the scene, query, or cache topology requires it. The largest measured win is after ingest. On frozen Qwen2.5-VL-7B-Instruct-4bit, adaptive same-video follow-up reuse preserves paired choices and correctness on a 93-query VideoMME breadth setting while reducing follow-up latency by 14.90-35.92x. The first query is still cold; the win starts when later questions reuse the same video state. Stress tests bound the result: repeated-question schedules hold through 50 turns, while dense-answer-anchored prompt variation separates conservative fixed K=1 repair from faster aggressive policies that drift. Fresh-video pruning is smaller but real. C-VISION skips timed vision-tower work before the first answer is generated. On Gemma 4-E4B-4bit, the clean 32f short cell reaches 1.316x first-query speedup with no paired drift or parse failures on 20 items; Qwen shows the fidelity/speed boundary. Stage-share ceiling (C-CEILING) is the accounting guardrail: a component speedup becomes an end-to-end speedup only in proportion to the wall-clock share it accelerates, so C-VISION and after-ingest follow-up reuse do not multiply. Candidate C-STREAM remains a native-rate target, not a headline result here. The broader direction is VLM-native media that expose change, motion, uncertainty, object state, sensor time, and active tiles directly, so models do not have to rediscover the world from dense RGB every frame.

Paper Structure

This paper contains 40 sections, 3 equations, 6 figures, 22 tables.

Figures (6)

  • Figure 1: Graphical overview. Video state is reused when validation gates allow it and refreshed when the scene, query, or cache topology requires it. C-PERSIST, C-CEILING, and C-VISION are the main regimes; candidate C-STREAM remains a state-update target.
  • Figure 2: Unrepaired persistent-KV frame-scaling envelope. The figure separates the raw warm-reuse probe from the repaired C-PERSIST breadth result in Table \ref{['tab:c-persist-repair']}: 7B stays inside the tested raw envelope through 16 frames / 6.5k prefill tokens, while 3B trades paired fidelity for a wider pre-basin plateau through 36 frames / 14.5k prefill tokens.
  • Figure 3: Adaptive C-PERSIST timing attribution. Fixed $K=1$ buys the newest-frame tail again at the second follow-up; adaptive repair mostly appends text from the repaired cache. The visual shows the short-slice timing mechanism behind the 9.50$\times$ paired second-follow-up speedup; the broader 0/93 fidelity result is in Table \ref{['tab:c-persist-repair']}.
  • Figure 4: C-CEILING share-model validation. Predicted speedup uses dense vision share and pruned-run vision reduction, then compares with independently observed end-to-end speedup. The n=60 composition audit and low-share Qwen point show denominator binding; measured-sparse markers add Qwen timing-validating fidelity failures plus the low-gain 16f boundary. The clean measured sparse-execution headline is Gemma 32f short in Table \ref{['tab:headline-results']}.
  • Figure 5: Qwen base-policy routing frontier under dense-backend substitution.
  • ...and 1 more figures