Table of Contents
Fetching ...

Multimodal Label Relevance Ranking via Reinforcement Learning

Taian Guo, Taolin Zhang, Haoqian Wu, Hanjun Li, Ruizhi Qiao, Xing Sun

TL;DR

This work targets multimodal label relevance ranking, proposing LR2PPO, a three-stage actor-reward-critic framework that learns human-aligned partial orders between labels given multimodal inputs. A novel partial order ratio drives the policy updates, enabling effective transfer from a source to a target domain with limited target annotations. The authors introduce LRMovieNet, a multimodal dataset with relevance orders derived from MovieNet, to evaluate ranking performance, and demonstrate state-of-the-art results on LRMovieNet and transferable gains on traditional LTR benchmarks. The approach directly improves the prioritization of semantically relevant labels, facilitating more accurate scene understanding and downstream decision making in multimodal video analysis.

Abstract

Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference. To resolve these issues, we introduce a novel method for multimodal label relevance ranking, named Label Relevance Ranking with Proximal Policy Optimization (LR\textsuperscript{2}PPO), which effectively discerns partial order relations among labels. LR\textsuperscript{2}PPO first utilizes partial order pairs in the target domain to train a reward model, which aims to capture human preference intrinsic to the specific scenario. Furthermore, we meticulously design state representation and a policy loss tailored for ranking tasks, enabling LR\textsuperscript{2}PPO to boost the performance of label relevance ranking model and largely reduce the requirement of partial order annotation for transferring to new scenes. To assist in the evaluation of our approach and similar methods, we further propose a novel benchmark dataset, LRMovieNet, featuring multimodal labels and their corresponding partial order data. Extensive experiments demonstrate that our LR\textsuperscript{2}PPO algorithm achieves state-of-the-art performance, proving its effectiveness in addressing the multimodal label relevance ranking problem. Codes and the proposed LRMovieNet dataset are publicly available at \url{https://github.com/ChazzyGordon/LR2PPO}.

Multimodal Label Relevance Ranking via Reinforcement Learning

TL;DR

This work targets multimodal label relevance ranking, proposing LR2PPO, a three-stage actor-reward-critic framework that learns human-aligned partial orders between labels given multimodal inputs. A novel partial order ratio drives the policy updates, enabling effective transfer from a source to a target domain with limited target annotations. The authors introduce LRMovieNet, a multimodal dataset with relevance orders derived from MovieNet, to evaluate ranking performance, and demonstrate state-of-the-art results on LRMovieNet and transferable gains on traditional LTR benchmarks. The approach directly improves the prioritization of semantically relevant labels, facilitating more accurate scene understanding and downstream decision making in multimodal video analysis.

Abstract

Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference. To resolve these issues, we introduce a novel method for multimodal label relevance ranking, named Label Relevance Ranking with Proximal Policy Optimization (LR\textsuperscript{2}PPO), which effectively discerns partial order relations among labels. LR\textsuperscript{2}PPO first utilizes partial order pairs in the target domain to train a reward model, which aims to capture human preference intrinsic to the specific scenario. Furthermore, we meticulously design state representation and a policy loss tailored for ranking tasks, enabling LR\textsuperscript{2}PPO to boost the performance of label relevance ranking model and largely reduce the requirement of partial order annotation for transferring to new scenes. To assist in the evaluation of our approach and similar methods, we further propose a novel benchmark dataset, LRMovieNet, featuring multimodal labels and their corresponding partial order data. Extensive experiments demonstrate that our LR\textsuperscript{2}PPO algorithm achieves state-of-the-art performance, proving its effectiveness in addressing the multimodal label relevance ranking problem. Codes and the proposed LRMovieNet dataset are publicly available at \url{https://github.com/ChazzyGordon/LR2PPO}.
Paper Structure (24 sections, 21 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 21 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of the Difference between Label Confidence and Label Relevance. This figure provides an example of a movie footage consisting of three consecutive keyframes and its scene description. Generally, conventional label confidence tends to place more emphasis on the tangible objects, whereas the proposed label relevance better reveals the relations between labels and the real scene which they correspond to. As shown in the top right histogram, label confidence models tend to assign a higher level of confidence to the label 'Man' due to its higher frequency of occurrence within the context. In contrast, the label 'Flirting' is more closely aligned with the primary theme of the movie scene, resulting in a higher label relevance score.
  • Figure 2: Illustration of the training paradigm of LR2PPO. Each stage takes multimodal data as input but differs in terms of specific data division and annotation type. Technically, in Stage 1, data from the source domain is employed to establish a label relevance ranking base model (i.e., Actor). Stage 2 involves preference data to train a Reward model. Finally, in Stage 3, Critic model interacts with the first two models and all data w/o annotations is utilized to boost the performance of the Actor, which will solely be applied in the inference stage.
  • Figure 3: NDCG curves during training. (a) PPO with different ratio design. Original ratio in PPO is not applicable to the definitions of state and action in the ranking task, leading to a training collapse, while our proposed partial order ratio solves this problem. (b) PPO with different thresholds $\delta$ in $r'_t(\theta)$. A small negative threshold $\delta=-0.1$ stabilizes the training, leading to superior performance.
  • Figure 4: Comparison between LR2PPO and other state-of-the-art ranking methods. The red, blue and green labels listed after the method represent low, medium and high in ground truth, respectively. The value below each label represents the corresponding relevance score. Best viewed in color and zoomed in.
  • Figure A: Data statistics of LRMovieNet.
  • ...and 3 more figures

Theorems & Definitions (2)

  • definition thmcounterdefinition: Label Confidence
  • definition thmcounterdefinition: Label Relevance