Table of Contents
Fetching ...

FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition

Jie Zhu, Xiao Guo, Yiyang Su, Anil Jain, Xiaoming Liu

Abstract

Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbf{FusionAgent}, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: \href{https://fusionagent.github.io/}{FusionAgent}.

FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition

Abstract

Model fusion is a key strategy for robust recognition in unconstrained scenarios, as different models provide complementary strengths. This is especially important for whole-body human recognition, where biometric cues such as face, gait, and body shape vary across samples and are typically integrated via score-fusion. However, existing score-fusion strategies are usually static, invoking all models for every test sample regardless of sample quality or modality reliability. To overcome these limitations, we propose \textbf{FusionAgent}, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each expert model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that FusionAgent significantly outperforms SoTA methods while achieving higher efficiency through fewer model invocations, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. Project page: \href{https://fusionagent.github.io/}{FusionAgent}.

Paper Structure

This paper contains 65 sections, 10 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Comparison of score-fusion methods. Top: Rule-based methods apply predefined transformations to fuse all model scores, while learning-based methods infer a fusion model from data but still assume that every model contributes to all test samples. Bottom: our framework leverages an MLLM agent to dynamically select a subset of models, followed by the proposed score-fusion strategy, enabling adaptive and robust integration.
  • Figure 2: Overview of the FusionAgent framework. Recognition models are wrapped as tools to generate score vectors and predicted identities based on gallery features. The MLLM agent receives multimodal biometric inputs and performs a reasoning-action step through multi-turn, selectively invokes tools, and integrates predictions into a final identity decision and fused score vector. The agent is optimized with reinforcement fine-tuning using rule-based rewards, including the proposed metric-based reward.
  • Figure 3: Overview of the ACT score-fusion. Based on tool execution results (i.e., score vectors) and selected model combination, the first selected model serves as the anchor, and ACT produces the final score vector via confidence weighting and top-k filtering.
  • Figure 4: A toy example of the proposed ACT score-fusion. Three models are used with the FR model as the anchor and $k=1$. ACT amplifies the gap between match and non-match scores through confidence-based top-k and anchor weighting, improving verification and open-set search performance.
  • Figure 5: Performance comparison on LTCC in four metrics. FusionAgent consistently outperforms baselines, including the hard selection (i.e., using all models), which highlights the effectiveness of dynamic model selection.
  • ...and 8 more figures