Table of Contents
Fetching ...

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Abstract

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.
Paper Structure (43 sections, 5 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 43 sections, 5 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Three paradigms for query-aware frame selection.Left: Similarity methods use global embeddings, offering low latency but lacking compositional reasoning. Right: Agentic methods achieve deep understanding via iterative LVLM calls at prohibitive cost. Center: HiMu (ours) decomposes queries into logic trees evaluated by lightweight experts, achieving multimodal compositional selection at single-shot speed.
  • Figure 2: Accuracy vs. computational cost of frame selection methods. Top-reported accuracy on Video-MME versus total pipeline FLOPs (TFLOPs, log scale) for a 10-minute video on 8$\times$A100 GPUs. Similarity-based selectors (blue) are computationally efficient but plateau at low accuracy; agentic and iterative detector based methods (red region) reach higher accuracy at a substantially greater cost; and HiMu (ours, green star) breaks the prior Pareto front.
  • Figure 3: The HiMu pipeline. (1) An LLM parses the question into a logic tree of modality-specific experts. (2) Experts (CLIP, ASR, OVD, CLAP) extract raw signals, which are then normalized and smoothed. (3) Fuzzy logic operators compose signals into a temporal satisfaction curve. (4) Top frames are sampled for the LVLM using PASS.
  • Figure 4: Logic tree examples on VideoMME questions, combining visual and audio experts.
  • Figure 5: HiMu pipeline overview. Given a natural-language query, an LLM decomposes it into a hierarchical logic tree whose leaves are routed to modality-specific experts (OCR, CLIP, ASR, OVD, CLAP). Each expert produces a per-frame relevance signal over time; these signals are composed bottom-up via fuzzy-logic operators into a satisfaction curve $T(t)$. The top-scoring frames are selected and passed to the LVLM for answering.
  • ...and 2 more figures