Table of Contents
Fetching ...

Visual Attention Prompted Prediction and Learning

Yifei Zhang, Siyi Gu, Bo Pan, Guangji Bai, Meikang Qiu, Xiaofeng Yang, Liang Zhao

TL;DR

A novel framework for visual attention prompted prediction and learning is introduced, utilizing visual prompts to steer the model's reasoning process and a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activation.

Abstract

Visual explanation (attention)-guided learning uses not only labels but also explanations to guide model reasoning process. While visual attention-guided learning has shown promising results, it requires a large number of explanation annotations that are time-consuming to prepare. However, in many real-world situations, it is usually desired to prompt the model with visual attention without model retraining. For example, when doing AI-assisted cancer classification on a medical image, users (e.g., clinicians) can provide the AI model with visual attention prompt on which areas are indispensable and which are precluded. Despite its promising objectives, achieving visual attention-prompted prediction presents several major challenges: 1) How can the visual prompt be effectively integrated into the model's reasoning process? 2) How should the model handle samples that lack visual prompts? 3) What is the impact on the model's performance when a visual prompt is imperfect? This paper introduces a novel framework for attention-prompted prediction and learning, utilizing visual prompts to steer the model's reasoning process. To improve performance in non-prompted situations and align it with prompted scenarios, we propose a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activations. Additionally, for instances where the visual prompt does not encompass the entire input image, we have developed innovative attention prompt refinement methods. These methods interpolate the incomplete prompts while maintaining alignment with the model's explanations. Extensive experiments on four datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples both with and without prompt.

Visual Attention Prompted Prediction and Learning

TL;DR

A novel framework for visual attention prompted prediction and learning is introduced, utilizing visual prompts to steer the model's reasoning process and a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activation.

Abstract

Visual explanation (attention)-guided learning uses not only labels but also explanations to guide model reasoning process. While visual attention-guided learning has shown promising results, it requires a large number of explanation annotations that are time-consuming to prepare. However, in many real-world situations, it is usually desired to prompt the model with visual attention without model retraining. For example, when doing AI-assisted cancer classification on a medical image, users (e.g., clinicians) can provide the AI model with visual attention prompt on which areas are indispensable and which are precluded. Despite its promising objectives, achieving visual attention-prompted prediction presents several major challenges: 1) How can the visual prompt be effectively integrated into the model's reasoning process? 2) How should the model handle samples that lack visual prompts? 3) What is the impact on the model's performance when a visual prompt is imperfect? This paper introduces a novel framework for attention-prompted prediction and learning, utilizing visual prompts to steer the model's reasoning process. To improve performance in non-prompted situations and align it with prompted scenarios, we propose a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activations. Additionally, for instances where the visual prompt does not encompass the entire input image, we have developed innovative attention prompt refinement methods. These methods interpolate the incomplete prompts while maintaining alignment with the model's explanations. Extensive experiments on four datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples both with and without prompt.
Paper Structure (17 sections, 11 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 11 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: The comparison between attention-guided learning and attention-prompted prediction. (a) explanation-guided learning requires many user-annotated explanations to train the models, while (b) attention-prompted prediction enables users to directly guide the model's prediction process by telling the model which areas are "indispensable" (areas in red that look suspicious), "precluded" (areas in yellow that contain artifacts), and "undecided" (other areas).
  • Figure 2: Illustration of the Visual Attention Prompted Prediction and Learning Framework: (a) depicts our proposed Attention-Prompted Co-Training Mechanism, while (b) outlines the proposed Visual Attention Prompt Refinement Architecture.
  • Figure 3: Visualization of proposed weights-learning function based on constrained MLP architecture.
  • Figure 4: Sensitivity Analysis on the Pancreas dataset.