Table of Contents
Fetching ...

Person Re-Identification for Robot Person Following with Online Continual Learning

Hanjing Ye, Jieting Zhao, Yu Zhan, Weinan Chen, Li He, Hong Zhang

TL;DR

This work tackles robot person following under occlusion by reframing target re-identification as an online continual learning (OCL) problem. A ReID module uses a memory manager to fuse short-term and long-term experiences, continually updating a ResNet-based feature extractor and a ridge-regression classifier to maintain discriminative appearance representations. Memory-guided replay and loss-based keyframe selection mitigate domain drift, enabling reliable re-identification of the target even with appearance changes and distracting distractors, while remaining feasible for onboard, real-time operation. Experiments on public and custom datasets show state-of-the-art ReID performance in RPF and demonstrate the practical benefits of online learning and memory consolidation for robust, persistent following.

Abstract

Robot person following (RPF) is a crucial capability in human-robot interaction (HRI) applications, allowing a robot to persistently follow a designated person. In practical RPF scenarios, the person can often be occluded by other objects or people. Consequently, it is necessary to re-identify the person when he/she reappears within the robot's field of view. Previous person re-identification (ReID) approaches to person following rely on a fixed feature extractor. Such an approach often fails to generalize to different viewpoints and lighting conditions in practical RPF environments. In other words, it suffers from the so-called domain shift problem where it cannot re-identify the person when his re-appearance is out of the domain modeled by the fixed feature extractor. To mitigate this problem, we propose a ReID framework for RPF where we use a feature extractor that is optimized online with both short-term and long-term experiences (i.e., recently and previously observed samples during RPF) using the online continual learning (OCL) framework. The long-term experiences are maintained by a memory manager to enable OCL to update the feature extractor. Our experiments demonstrate that even in the presence of severe appearance changes and distractions from visually similar people, the proposed method can still re-identify the person more accurately than the state-of-the-art methods.

Person Re-Identification for Robot Person Following with Online Continual Learning

TL;DR

This work tackles robot person following under occlusion by reframing target re-identification as an online continual learning (OCL) problem. A ReID module uses a memory manager to fuse short-term and long-term experiences, continually updating a ResNet-based feature extractor and a ridge-regression classifier to maintain discriminative appearance representations. Memory-guided replay and loss-based keyframe selection mitigate domain drift, enabling reliable re-identification of the target even with appearance changes and distracting distractors, while remaining feasible for onboard, real-time operation. Experiments on public and custom datasets show state-of-the-art ReID performance in RPF and demonstrate the practical benefits of online learning and memory consolidation for robust, persistent following.

Abstract

Robot person following (RPF) is a crucial capability in human-robot interaction (HRI) applications, allowing a robot to persistently follow a designated person. In practical RPF scenarios, the person can often be occluded by other objects or people. Consequently, it is necessary to re-identify the person when he/she reappears within the robot's field of view. Previous person re-identification (ReID) approaches to person following rely on a fixed feature extractor. Such an approach often fails to generalize to different viewpoints and lighting conditions in practical RPF environments. In other words, it suffers from the so-called domain shift problem where it cannot re-identify the person when his re-appearance is out of the domain modeled by the fixed feature extractor. To mitigate this problem, we propose a ReID framework for RPF where we use a feature extractor that is optimized online with both short-term and long-term experiences (i.e., recently and previously observed samples during RPF) using the online continual learning (OCL) framework. The long-term experiences are maintained by a memory manager to enable OCL to update the feature extractor. Our experiments demonstrate that even in the presence of severe appearance changes and distractions from visually similar people, the proposed method can still re-identify the person more accurately than the state-of-the-art methods.
Paper Structure (29 sections, 5 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Robot person following with online continual learning. To this end, long-term and short-term experiences are utilized to optimize the feature extractor online to represent the discriminative appearance of the target person.
  • Figure 2: The top part is the pipeline of our RPF system and the bottom part is the proposed person ReID framework. We obtain image patches $\{\mathbf{M}\}_i$ of the tracked people using the current image $\mathbf{I}$ and their bounding boxes $\{\mathbf{B}\}_i$. When the target person is consistently tracked, his label $\mathbf{y}$ represents positives and other people are negatives. Afterward, we add $\{\mathbf{M}, \mathbf{y}\}_i$ to the memory manager for memorization. Additionally, these patches are fed into the feature extractor to extract ReID features. These features are utilized by the target classifier to estimate the target confidence. If the target confidence is greater than a threshold, the corresponding position $\mathbf{p}$ is designated as the target position. In addition to the inference above process, the memory manager simultaneously replays long-term and short-term experiences to train the feature extractor. Meanwhile, the target classifier is trained with short-term experiences. If the target person is not found among the tracked individuals, the training process pauses, and all observations $\{\mathbf{M}, \mathbf{y}\}_i$ become candidates for re-identification. The above training and inference processes are managed by the ReID lifecycle.
  • Figure 3: Feature distribution of the target person (positive) and other distracting people (negative) across all observed samples at the end of the sequence. "S" represents short-term experiences and "L" for long-term ones. (a) Pre-trained features without any online optimization. (b) Trained features with online optimization using short-term experiences only. (c) Trained features using both short-term and long-term experiences within our framework. (d) Ideal feature distribution where features are optimized offline through extensive iterative training.
  • Figure 4: Plots of ReID mean accuracy w.r.t. encountered segments. "Ours w/o MM." indicates fine-tuning the feature extractor without memory management (introduced in Sec. \ref{['sec3-4']}), using newly observed samples only. After fine-tuning from a new segment, the model is evaluated on segments it has encountered previously to determine its mean accuracy. This metric indicates the model's ability to retain knowledge from segments learned earlier.
  • Figure 5: An example to illustrate errors is as follows: The goal of a correct action is to move the target person closer to the center of the image. Here, $(W,H)$ represents the image's width and height. $(W_{exp}, H_{exp})$ denotes the size of the expected centered bounding box, initialized by the first bounding box of the target person. $(W_b,H_b)$ represents the target person's bounding box size in the current frame, and $(x_b,y_b)$ is its center point.
  • ...and 3 more figures