Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
Chang Zong, Bin Li, Shoujun Zhou, Jian Wan, Lei Zhang
TL;DR
This work defines In-VAL, a task that simulates multi-turn interactions between users and instructional videos to locate visual answers, and introduces Ask2Loc, a three-module framework (Chatting for intention, Rewriting for completeness, Searching for context) that leverages LLMs and PLMs to progressively clarify user intent and enrich content. The authors reconstruct three multi-domain multilingual In-VAL datasets and demonstrate that Ask2Loc outperforms traditional End-to-End and Two-Stage VAL approaches, achieving gains up to $14.91$ in mIoU. Key innovations include interactive question-based clarification, offline rewriting for description quality, and context-aware retrieval to address content fragmentation, all integrated through multimodal fusion for final localization. The results indicate that incorporating structured interactions and expanded context yields more user-aligned visual answer localization, with potential impact on instructional video platforms and multimodal reasoning benchmarks.
Abstract
Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.
