Language-Guided Reinforcement Learning for Hard Attention in Few-Shot Learning
Bahareh Nikpour, Narges Armanfard
TL;DR
LaHA introduces a language-guided reinforcement learning framework that identifies discrete, informative image patches for few-shot learning by deploying a Vision Transformer as the RL agent. It couples a graph-based Baseline Module and a contrastive learning auxiliary task, with a vision-language reward to improve interpretability, and optimizes a combined loss $L_{tot} = \alpha L_{RL} + \beta L_{CL} + \gamma L_{BL}$. Empirical results across MiniImageNet, CIFAR-FS, FC-100, and CUB show consistent improvements over multiple FSL baselines, and ImageNet experiments demonstrate competitive hard-attention performance with larger $N_a$. The method reduces data size and computation while preserving accuracy, and the VLM reward provides semantic interpretability, making LaHA a versatile approach for efficient and interpretable visual learning.
Abstract
Attention mechanisms have demonstrated significant potential in enhancing learning models by identifying key portions of input data, particularly in scenarios with limited training samples. Inspired by human perception, we propose that focusing on essential data segments, rather than the entire dataset, can improve the accuracy and reliability of the learning models. However, identifying these critical data segments, or "hard attention finding," is challenging, especially in few-shot learning, due to the scarcity of training data and the complexity of model parameters. To address this, we introduce LaHA, a novel framework that leverages language-guided deep reinforcement learning to identify and utilize informative data regions, thereby improving both interpretability and performance. Extensive experiments on benchmark datasets validate the effectiveness of LaHA.
