Table of Contents
Fetching ...

Edit As You Wish: Video Caption Editing with Multi-grained User Control

Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Xu Sun, Qin Jin

TL;DR

This work introduces Video Caption Editing (VCE), a task to iteratively revise a video caption conditioned on multi-grained user commands, addressing the limitations of single-grained, one-shot controllable captioning. It models user intent with a triplet command {operation, position, attribute} and presents two benchmarks, VATEX-EDIT and EMMAD-EDIT, to cover open-domain and e-commerce contexts. A small specialist model, OPA, is proposed to translate commands into textual sequences guiding caption edits, and the authors compare it against two large multimodal models (ImgLLM and VidLLM) to analyze performance, data domain effects, and efficiency. The evaluation suite jointly measures fluency, controllability, and text-video alignment, demonstrating that VCE is capable of fine-grained, multi-round edits and highlighting practical implications for scalable, personalized video description editing in real-world settings.

Abstract

Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an open-domain benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.

Edit As You Wish: Video Caption Editing with Multi-grained User Control

TL;DR

This work introduces Video Caption Editing (VCE), a task to iteratively revise a video caption conditioned on multi-grained user commands, addressing the limitations of single-grained, one-shot controllable captioning. It models user intent with a triplet command {operation, position, attribute} and presents two benchmarks, VATEX-EDIT and EMMAD-EDIT, to cover open-domain and e-commerce contexts. A small specialist model, OPA, is proposed to translate commands into textual sequences guiding caption edits, and the authors compare it against two large multimodal models (ImgLLM and VidLLM) to analyze performance, data domain effects, and efficiency. The evaluation suite jointly measures fluency, controllability, and text-video alignment, demonstrating that VCE is capable of fine-grained, multi-round edits and highlighting practical implications for scalable, personalized video description editing in real-world settings.

Abstract

Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an open-domain benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
Paper Structure (39 sections, 3 equations, 19 figures, 10 tables)

This paper contains 39 sections, 3 equations, 19 figures, 10 tables.

Figures (19)

  • Figure 1: Comparisons between our proposed Video Caption Editing (VCE) task with conventional video captioning and controllable video captioning.
  • Figure 2: The triplet control designed in the VCE task can pivot two prevalent interaction signals including natural language (Scenario A) and editing trajectories (Scenario B).
  • Figure 3: Annotated data instances of the VCE task.
  • Figure 4: Attribute statistics on the VATEX-EDIT.
  • Figure 5: Caption length distributions on EMMAD-EDIT.
  • ...and 14 more figures