Table of Contents
Fetching ...

IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

Yuyang Ji, Yixuan Shen, Kien Nguyen, Lifeng Zhou, Feng Liu

TL;DR

IDSelect is a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off, and shows that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources.

Abstract

Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect's superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.

IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

TL;DR

IDSelect is a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off, and shows that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources.

Abstract

Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect's superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.
Paper Structure (16 sections, 7 equations, 4 figures, 4 tables)

This paper contains 16 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Top: Traditional multi-modal person recognition methods (e.g., QME zhu2025quality) use fixed model combinations for all inputs, while our IDSelect employs an RL-based cost-aware agent to adaptively select complementary models from diverse pools based on input characteristics. Bottom: Accuracy vs. computational cost on CCVID dataset CAL. Our method achieves superior accuracy (95.9%, $+1.8\%$) with 92.4% fewer FLOPs than QME.
  • Figure 2: IDSelect framework architecture. The selection agent processes input video pairs through a feature encoder and attention pooling to generate modality-specific selection distributions. An actor-critic reinforcement learning policy optimizes model selection from pre-trained pools under a budget constraint. The framework minimizes a multi-objective loss combining classification reward, computational cost, and selection diversity to discover optimal model combinations for multi-modal fusion.
  • Figure 3: Model selection frequency distribution for CCVID configurations. (a) Configuration 1 shows concentrated patterns with IDSelect predominantly selecting adaface_101 + gaitset + ap3d_34 (71.3%) and adaface_101 + gaitbase + ap3d_34 (24.2%). (b) Configuration 2 exhibits more diverse selection patterns, demonstrating adaptability to different model pool characteristics.
  • Figure 4: Adaptive model selection examples on CCVID. The agent uses lighter models for clear inputs and switches to stronger models for challenging or low-quality cases.