Table of Contents
Fetching ...

Language-Guided Reinforcement Learning for Hard Attention in Few-Shot Learning

Bahareh Nikpour, Narges Armanfard

TL;DR

LaHA introduces a language-guided reinforcement learning framework that identifies discrete, informative image patches for few-shot learning by deploying a Vision Transformer as the RL agent. It couples a graph-based Baseline Module and a contrastive learning auxiliary task, with a vision-language reward to improve interpretability, and optimizes a combined loss $L_{tot} = \alpha L_{RL} + \beta L_{CL} + \gamma L_{BL}$. Empirical results across MiniImageNet, CIFAR-FS, FC-100, and CUB show consistent improvements over multiple FSL baselines, and ImageNet experiments demonstrate competitive hard-attention performance with larger $N_a$. The method reduces data size and computation while preserving accuracy, and the VLM reward provides semantic interpretability, making LaHA a versatile approach for efficient and interpretable visual learning.

Abstract

Attention mechanisms have demonstrated significant potential in enhancing learning models by identifying key portions of input data, particularly in scenarios with limited training samples. Inspired by human perception, we propose that focusing on essential data segments, rather than the entire dataset, can improve the accuracy and reliability of the learning models. However, identifying these critical data segments, or "hard attention finding," is challenging, especially in few-shot learning, due to the scarcity of training data and the complexity of model parameters. To address this, we introduce LaHA, a novel framework that leverages language-guided deep reinforcement learning to identify and utilize informative data regions, thereby improving both interpretability and performance. Extensive experiments on benchmark datasets validate the effectiveness of LaHA.

Language-Guided Reinforcement Learning for Hard Attention in Few-Shot Learning

TL;DR

LaHA introduces a language-guided reinforcement learning framework that identifies discrete, informative image patches for few-shot learning by deploying a Vision Transformer as the RL agent. It couples a graph-based Baseline Module and a contrastive learning auxiliary task, with a vision-language reward to improve interpretability, and optimizes a combined loss . Empirical results across MiniImageNet, CIFAR-FS, FC-100, and CUB show consistent improvements over multiple FSL baselines, and ImageNet experiments demonstrate competitive hard-attention performance with larger . The method reduces data size and computation while preserving accuracy, and the VLM reward provides semantic interpretability, making LaHA a versatile approach for efficient and interpretable visual learning.

Abstract

Attention mechanisms have demonstrated significant potential in enhancing learning models by identifying key portions of input data, particularly in scenarios with limited training samples. Inspired by human perception, we propose that focusing on essential data segments, rather than the entire dataset, can improve the accuracy and reliability of the learning models. However, identifying these critical data segments, or "hard attention finding," is challenging, especially in few-shot learning, due to the scarcity of training data and the complexity of model parameters. To address this, we introduce LaHA, a novel framework that leverages language-guided deep reinforcement learning to identify and utilize informative data regions, thereby improving both interpretability and performance. Extensive experiments on benchmark datasets validate the effectiveness of LaHA.
Paper Structure (12 sections, 9 equations, 6 figures, 4 tables)

This paper contains 12 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Block diagram of the proposed LaHA framework.
  • Figure 2: The agent's state $S_{k,m}$, which is concatenation of the original image $I_m$, and the image output of the agent in the $k-1^{th}$ step of the episode $I_{k-1,m}$, i.e. $S_{k,m} = [I_m, I_{k-1,m}]$.
  • Figure 3: An example of $I_{k,m}$, $\{k=0,..,K\}$ during one episode in our proposed method. The actions, shown in yellow arrows, for step 1 are go right ($\rightarrow$), go down ($\downarrow$), and go left ($\leftarrow$) for the first, second, and third regions respectively, which results in $I_{1,m}$. After completing an episode, the agent outputs the regions found in $I_{K,m}$.
  • Figure 4: An example of creating positive and negative pairs in 5-way scenario.
  • Figure 5: Visualization of the selected patches by LaHA.
  • ...and 1 more figures