Table of Contents
Fetching ...

Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding

Jin-Seop Lee, SungJoon Lee, SeongJun Jung, Boyang Li, Jee-Hyong Lee

TL;DR

The paper tackles the problem that VTG models often predict a target segment even for hard-irrelevant queries. It introduces Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) built on Group Relative Policy Optimization (GRPO) with four reward objectives to enable accurate refusals, explanations, and query correction, complemented by a Hard-Irrelevant VTG (HI-VTG) dataset. Empirical results across multiple VTG scenarios show RA-RFT improves refusal behavior and explanation quality without sacrificing grounding performance, and demonstrations of scalability across LVLM-based VTG systems. The HI-VTG dataset and RA-RFT framework collectively advance robust, interpretable video-language reasoning in the presence of fine-grained semantic mismatches.

Abstract

Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.

Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding

TL;DR

The paper tackles the problem that VTG models often predict a target segment even for hard-irrelevant queries. It introduces Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) built on Group Relative Policy Optimization (GRPO) with four reward objectives to enable accurate refusals, explanations, and query correction, complemented by a Hard-Irrelevant VTG (HI-VTG) dataset. Empirical results across multiple VTG scenarios show RA-RFT improves refusal behavior and explanation quality without sacrificing grounding performance, and demonstrations of scalability across LVLM-based VTG systems. The HI-VTG dataset and RA-RFT framework collectively advance robust, interpretable video-language reasoning in the presence of fine-grained semantic mismatches.

Abstract

Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.

Paper Structure

This paper contains 34 sections, 7 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Video temporal grounding result with a hard-irrelevant query. Existing VTG models incorrectly predict a segment due to a lack of fine-grained semantic understanding between the video and the query. In contrast, our model correctly refuses the query and explains the semantic mismatch.
  • Figure 2: The overall framework of our contributions. We introduce a Hard-Irrelevant VTG Dataset, which includes hard-irrelevant queries and their refusal answers. Also, we propose a Refusal-Aware Reinforcement Fine-Tuning to effectively refuse hard-irrelevant queries.
  • Figure 3: Overview of the Hard-Irrelevant VTG dataset construction process. (1) We first extract semantic relevance categories from the original query using an LLM-based category extractor. (2) Based on the selected categories and the video description, we then generate a hard-irrelevant query and its corresponding refusal answer, which explains why the query does not match the video.
  • Figure 4: Semantic relevance categories used in HI-VTG. The right column shows the original queries and modified queries according to each category.
  • Figure 5: Semantic Relevance Category definition
  • ...and 4 more figures