Table of Contents
Fetching ...

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

Xijun Wang, Ruiqi Xian, Tianrui Guan, Fuxiao Liu, Dinesh Manocha

TL;DR

This work presents a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs, and integrates the method into the ROS2 to improve the recognition performance.

Abstract

We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial video datasets (Okutama, NECDrone), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0-3.6% improvement on dataset SSV2. We integrate our method into the ROS2 as well.

SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition

TL;DR

This work presents a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs, and integrates the method into the ROS2 to improve the recognition performance.

Abstract

We present a new learning approach, Soft Conditional Prompt Learning (SCP), which leverages the strengths of prompt learning for aerial video action recognition. Our approach is designed to predict the action of each agent by helping the models focus on the descriptions or instructions associated with actions in the input videos for aerial/robot visual perception. Our formulation supports various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. We present a soft conditional prompt method that learns to dynamically generate prompts from a pool of prompt experts under different video inputs. By sharing the same objective with the task, our proposed SCP can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial video datasets (Okutama, NECDrone), which consist of scenes with single-agent and multi-agent actions. We further evaluate our approach on ground camera videos to verify the effectiveness and generalization and achieve a 1.0-3.6% improvement on dataset SSV2. We integrate our method into the ROS2 as well.
Paper Structure (22 sections, 11 equations, 5 figures, 5 tables)

This paper contains 22 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overall Architecture: Our action recognition method is designed to run one edge devices (on mobile robots) and cloud servers. This includes lightweight prompts (embedded), which can be easily embedded in any action recognition model without much extra computational cost. For large vision models, we perform these computations on cloud server and use low-latency communication with the robots.
  • Figure 2: Task Overview: We use prompt learning for action recognition. Our method leverages the strengths of prompt learning to guide the learning process by helping models better focus on the descriptions or instructions associated with actions in the input videos. We explore various prompts, including optical flow, large vision models, and proposed SCP to improve recognition performance. The recognition models can be CNNs or Transformers.
  • Figure 3: Overview of the action recognition framework: We use transformer-based action recognition methods as an example. We designed a prompt-learning-based encoder to help better extract the feature and use our auto-regressive temporal reasoning algorithm for recognition models for enhanced inference ability.
  • Figure 4: Soft Conditional Prompt Learning (SCP): Learning input-invariant (prompt experts) and input-specific (data dependent) prompt. The input-invariant prompts will be updated from all the inputs, which contain task information, and we use a dynamic mechanism to generate input-specific prompts for different inputs. Add/Mul means element-wise operations. $B\times S\times C$ is the input features' shape, and $l$ is the expert's number in the prompt pool.
  • Figure 5: Visualization We first detect the interested target and generate the prompts, then predict the action.