Table of Contents
Fetching ...

Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection

Sungjune Park, Hyunjun Kim, Yong Man Ro

TL;DR

This work addresses the challenge of pedestrian detection under wide appearance variation by leveraging large language models to generate rich language-derived appearance elements from a comprehensive description corpus. The method extracts appearance knowledge embeddings with an LLM, samples representative centroids via $K$-means, and refines them through task-prompting to produce $K$ appearance elements $\boldsymbol{E}$ that are integrated with visual features using a cross-attention fusion mechanism and a reference loss $L_{ref}$. Importantly, these language-derived elements are generated offline and do not require language inputs during inference, enabling compatibility with diverse detectors. Across CrowdHuman and WiderPedestrian benchmarks, the approach yields notable performance gains and achieves state-of-the-art results, validating the effectiveness and adaptability of incorporating language-derived appearance cues into visual perception systems with modest parameter overhead.

Abstract

Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of the crucial tasks directly related to our safety (e.g., intelligent driving systems), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. Subsequently, we perform a task-prompting process to obtain appearance elements which are guided representative appearance knowledge relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).

Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection

TL;DR

This work addresses the challenge of pedestrian detection under wide appearance variation by leveraging large language models to generate rich language-derived appearance elements from a comprehensive description corpus. The method extracts appearance knowledge embeddings with an LLM, samples representative centroids via -means, and refines them through task-prompting to produce appearance elements that are integrated with visual features using a cross-attention fusion mechanism and a reference loss . Importantly, these language-derived elements are generated offline and do not require language inputs during inference, enabling compatibility with diverse detectors. Across CrowdHuman and WiderPedestrian benchmarks, the approach yields notable performance gains and achieves state-of-the-art results, validating the effectiveness and adaptability of incorporating language-derived appearance cues into visual perception systems with modest parameter overhead.

Abstract

Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of the crucial tasks directly related to our safety (e.g., intelligent driving systems), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. Subsequently, we perform a task-prompting process to obtain appearance elements which are guided representative appearance knowledge relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).
Paper Structure (27 sections, 3 equations, 5 figures, 5 tables)

This paper contains 27 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: This illustrates that a vision AI model (i.e., pedestrian detector) recognizes pedestrians in the street scene. (a) When visual cues are provided only, the pedestrian detector could be confused about recognizing instances because of their similar appearances (e.g., color). (b) However, when language cues are provided (e.g., "A person who crouches down."), they can help the detector to perceive instances properly.
  • Figure 2: The overview of formulating appearance elements through an LLM. After composing a description corpus that includes diverse appearance variations of instances, an LLM takes them to obtain the appearance knowledge sets $\boldsymbol{S}$, which contain a number of appearance representations. Among the numerous appearance features in the knowledge set $\boldsymbol{S}$, we sample $K$ numbers of representative appearance knowledge, named appearance knowledge centroids $\boldsymbol{C}$, by using K-means clustering. Along with $K$ centroids, we place $K$ learnable appearance prompts $\boldsymbol{P}$, and we conduct an elementwise summation with $\boldsymbol{C}$ and $\boldsymbol{P}$. Therefore, we obtain appearance elements $\boldsymbol{E}$. Furthermore, we repurpose $\boldsymbol{E}$ with pedestrian classification loss which is directly related to the pedestrian detection downstream task, that is task-prompting. During task-prompting, $\boldsymbol{P}$ is the only learnable parameters updated only. Therefore, we can formulate $\boldsymbol{E}$ which contains appearance knowledge from an LLM and becomes more task-relevant.
  • Figure 3: The way to incorporate language-derived appearance elements $\boldsymbol{E}$ with visual features in a pedestrian detector. The integrating module, consisting of multi-modality cross-attention, addition (Add), and normalization(Norm), is embedded into a pedestrian detector. The cross-attention takes visual query features ($\boldsymbol{Q}$) to refer to $\boldsymbol{E}$ as key ($\boldsymbol{K}$) and value features ($\boldsymbol{V}$).
  • Figure 4: It shows the proportion of appearance elements for pedestrian and background. Among 200 appearance elements, the number of pedestrian-related appearance elements $\boldsymbol{E}_{p}$ is 102 (51%), and the residual background-related elements $\boldsymbol{E}_{b}$ take 98 elements (49%). (a) and (b) show the example elements and the descriptions mapped to each element for both $\boldsymbol{E}_{p}$ and $\boldsymbol{E}_{b}$. For example, "A low resolution rendering of a small person wearing a yellow jacket." and "A cropped photo of a short girl wearing a yellow t-shirt." belong to the first appearance element $\boldsymbol{e}_{1}$, one of the pedestrian-related elements.
  • Figure 5: Ablation study with varying number of appearance elements $K$. We adopt average precision (AP) as evaluation metric. As shown in the figure, it shows consistent performance improvements while being insensitive to the number of $K$.