Table of Contents
Fetching ...

SilLang: Improving Gait Recognition with Silhouette Language Encoding

Ruiyi Zhan, Guozhen Peng, Canyu Chen, Jian Lei, Annan Li

Abstract

Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.

SilLang: Improving Gait Recognition with Silhouette Language Encoding

Abstract

Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.
Paper Structure (18 sections, 7 equations, 8 figures, 8 tables)

This paper contains 18 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Binary gait silhouettes are encoded with the proposed Contour-Velocity Tokenizer, thereby forming an implicit silhouette vocabulary within the text token space. Since low-frequency silhouette tokens denote subtle walking patterns, shifts in the token frequency distribution capture fine-grained motion cues.
  • Figure 2: Aligning and translating binary gait silhouettes into natural language within a shared binary encoding space.
  • Figure 3: Pipeline of the silhouette language model. (a) The silhouette sequences are encoded with a Contour-Velocity Tokenizer for silhouette language branch, and the Adapter adapts the text embedding ($\boldsymbol{emb}_t$) to the visual embedding ($\boldsymbol{emb}_v$). In the visual branch, it keeps the same structure as other silhouette-based gait recognition model ref_deepgaitv2ref_glgait. (b) Extract contour map $\boldsymbol{c}(t)$ and velocity map $\boldsymbol{v}(t)$ from silhouette $\boldsymbol{s}(t)$. (c) Visualization of the normalized token frequencies in silhouette and text vocabulary when align silhouette to language.
  • Figure 4: Distribution of normalized embeddings and tokens across different branches. (a) Embeddings in visual, silhouette language and text branches; (b) Input tokens for LLMs.
  • Figure 5: Performance improvement of SilLang on Gait3D. The number of correctly recognized samples in test set are partitioned according to the frame-length ranges. And the average distance is defined as the mean distance between negative samples minus that between positive samples within each frame-length range.
  • ...and 3 more figures