Table of Contents
Fetching ...

Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval

Tien-Huy Nguyen, Huu-Loc Tran, Huu-Phong Phan-Nguyen, Quang-Vinh Dinh

TL;DR

The paper tackles text-based person anomaly retrieval by introducing a Local-global Hybrid Perspective (LHP) to fuse fine-grained and global cues, and a Unified Image-Text (UIT) model that jointly optimizes multiple cross-modal losses (MIM, MLM, ITC, ITM). A novel feature-selection strategy guided by LHP and an iterative ensemble method are proposed to refine predictions and leverage diverse model strengths. Empirical results on the PAB dataset demonstrate state-of-the-art recalls, with notable gains in R@1 when scaling from 0.1M to 1M training images and clear ablation-supported contributions from LHP, UIT, and ensemble components. The approach offers a robust, multi-faceted framework for fine-grained, cross-modal person anomaly retrieval in real-world settings.

Abstract

Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model's performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70\% improvement in R@1, 1.77\% improvement in R@5, and 1.01\% improvement in R@10.

Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval

TL;DR

The paper tackles text-based person anomaly retrieval by introducing a Local-global Hybrid Perspective (LHP) to fuse fine-grained and global cues, and a Unified Image-Text (UIT) model that jointly optimizes multiple cross-modal losses (MIM, MLM, ITC, ITM). A novel feature-selection strategy guided by LHP and an iterative ensemble method are proposed to refine predictions and leverage diverse model strengths. Empirical results on the PAB dataset demonstrate state-of-the-art recalls, with notable gains in R@1 when scaling from 0.1M to 1M training images and clear ablation-supported contributions from LHP, UIT, and ensemble components. The approach offers a robust, multi-faceted framework for fine-grained, cross-modal person anomaly retrieval in real-world settings.

Abstract

Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model's performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70\% improvement in R@1, 1.77\% improvement in R@5, and 1.01\% improvement in R@10.

Paper Structure

This paper contains 13 sections, 8 equations, 1 figure, 4 tables, 2 algorithms.

Figures (1)

  • Figure 1: (a) Overview of Local-global Hybrid Perspective (LHP) Modeling. It processes an image probabilistically, applying either a local transform for fine-grained details or a global transform for comprehensive context. Contrastive learning aligns image and text embeddings by minimizing distances for matching pairs and maximizing distances for non-matching pairs. (b) Unified Image-Text (UIT) Modeling with Feature Selection. UIT is a cross-modal framework that integrates MIM, MLM, ITC, and ITM to unify image and text understanding. UIT enhances inference by leveraging LHP-based feature selection for efficient and accurate multi-modal representation learning.