Table of Contents
Fetching ...

Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

Chang Zong, Bin Li, Shoujun Zhou, Jian Wan, Lei Zhang

TL;DR

This work defines In-VAL, a task that simulates multi-turn interactions between users and instructional videos to locate visual answers, and introduces Ask2Loc, a three-module framework (Chatting for intention, Rewriting for completeness, Searching for context) that leverages LLMs and PLMs to progressively clarify user intent and enrich content. The authors reconstruct three multi-domain multilingual In-VAL datasets and demonstrate that Ask2Loc outperforms traditional End-to-End and Two-Stage VAL approaches, achieving gains up to $14.91$ in mIoU. Key innovations include interactive question-based clarification, offline rewriting for description quality, and context-aware retrieval to address content fragmentation, all integrated through multimodal fusion for final localization. The results indicate that incorporating structured interactions and expanded context yields more user-aligned visual answer localization, with potential impact on instructional video platforms and multimodal reasoning benchmarks.

Abstract

Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.

Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

TL;DR

This work defines In-VAL, a task that simulates multi-turn interactions between users and instructional videos to locate visual answers, and introduces Ask2Loc, a three-module framework (Chatting for intention, Rewriting for completeness, Searching for context) that leverages LLMs and PLMs to progressively clarify user intent and enrich content. The authors reconstruct three multi-domain multilingual In-VAL datasets and demonstrate that Ask2Loc outperforms traditional End-to-End and Two-Stage VAL approaches, achieving gains up to in mIoU. Key innovations include interactive question-based clarification, offline rewriting for description quality, and context-aware retrieval to address content fragmentation, all integrated through multimodal fusion for final localization. The results indicate that incorporating structured interactions and expanded context yields more user-aligned visual answer localization, with potential impact on instructional video platforms and multimodal reasoning benchmarks.

Abstract

Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.

Paper Structure

This paper contains 35 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Existing VAL tasks derive visual answers directly from input questions, neglecting the natural interactions between the user and the video (above). In contrast, we propose an In-VAL task, which gradually acquires domain knowledge and clarifies user intent through multiple interactions to assist in answer localization (below).
  • Figure 2: Apart from existing end-to-end and two-stage localization frameworks (left), our proposed framework (right) better simulates the interactions between humans and video content by leveraging an AI system. By asking multiple questions, the system can progressively acquire further understandings for solving the semantic issues, aiding in better localization.
  • Figure 3: Ask2Loc utilizes the extracted subtitles and visual features from a video and learns to detect locations by leveraging language models and asking questions through three interactive modules. The predicted locations further facilitate cutting the video to obtain the final video clip.
  • Figure 4: The detailed interactive and learning process of our proposed Ask2Loc. Three types of interactions are performed between language models and video content, including chatting for intention awareness, rewriting for description completeness, and searching for context expansion. The generated further knowledge, along with the corresponding visual features, are fused to detect the location of each video segment.
  • Figure 5: The impact of key hyperparameters, including the number of dialogue rounds in Chatting (top left), the number of top relevant subtitles in Searching (top right), the number of tuning epochs for relevance in Searching (bottom left), and the visual encoder models (bottom right).
  • ...and 3 more figures