Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search

Jiayi He; Shengeng Tang; Ao Liu; Lechao Cheng; Jingjing Wu; Yanyan Wei

Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search

Jiayi He, Shengeng Tang, Ao Liu, Lechao Cheng, Jingjing Wu, Yanyan Wei

TL;DR

This work addresses Text-based Person Anomaly Search (TPAS), extending Text-based Person Search to identify pedestrians exhibiting normal or abnormal behavior by aligning text descriptions with large image collections. The authors fine-tune the X-VLM model on the Pedestrian Anomaly Behavior (PAB) dataset and introduce Similarity Coverage Analysis (SCA) to mitigate confusion when different text descriptions yield similar search results, improving cross-modal discrimination. Through in-batch contrastive and match losses, along with SCA, the approach achieves Recall@1 of 85.49 on PAB, ranking competitively among participants. The method demonstrates data-efficient fine-tuning and practical applicability for large-scale, text-driven anomaly search in surveillance-like contexts, offering actionable insights for future multimodal retrieval of unusual human behaviors.

Abstract

This paper presents the HFUT-LMC team's solution to the WWW 2025 challenge on Text-based Person Anomaly Search (TPAS). The primary objective of this challenge is to accurately identify pedestrians exhibiting either normal or abnormal behavior within a large library of pedestrian images. Unlike traditional video analysis tasks, TPAS significantly emphasizes understanding and interpreting the subtle relationships between text descriptions and visual data. The complexity of this task lies in the model's need to not only match individuals to text descriptions in massive image datasets but also accurately differentiate between search results when faced with similar descriptions. To overcome these challenges, we introduce the Similarity Coverage Analysis (SCA) strategy to address the recognition difficulty caused by similar text descriptions. This strategy effectively enhances the model's capacity to manage subtle differences, thus improving both the accuracy and reliability of the search. Our proposed solution demonstrated excellent performance in this challenge.

Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search

TL;DR

Abstract

Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)