Table of Contents
Fetching ...

Task-Driven Fixation Network: An Efficient Architecture with Fixation Selection

Shuguang Wang, Yuanjing Wang

TL;DR

Task-Driven Fixation Network (TDFN) addresses the efficiency gap in high-resolution visual processing by integrating a fixation-driven mechanism into a Transformer-based architecture. It combines a Low-Resolution Channel (LRC), a High-Resolution Channel (HRC), and a Hybrid Encoder (HE) connected via a Fixation Point Generator (FPG) that selects regions of interest; fixation points are obtained through Monte Carlo sampling of a saliency map produced from the rec_token. Training proceeds in two phases: initial task learning with random fixations and a subsequent reinforcement-learning update of the FPG using rewards $Reward_{n} = TaskLoss_{n-1} - TaskLoss_{n}$ and $L_{n} = -Reward_{n} \, \cdot \, \log(p_{n})$, with the objective $TaskLoss = ClassLoss + \alpha \cdot ReconLoss$. Experiments on MNIST show that selective high-resolution analysis substantially improves classification accuracy while keeping coverage and computation low, and dynamic termination further reduces fixation steps. Overall, TDFN demonstrates a scalable, task-specific approach that leverages fixation-inspired attention to balance performance and efficiency in vision tasks, with potential extensions to detection and segmentation.

Abstract

This paper presents a novel neural network architecture featuring automatic fixation point selection, designed to efficiently address complex tasks with reduced network size and computational overhead. The proposed model consists of: a low-resolution channel that captures low-resolution global features from input images; a high-resolution channel that sequentially extracts localized high-resolution features; and a hybrid encoding module that integrates the features from both channels. A defining characteristic of the hybrid encoding module is the inclusion of a fixation point generator, which dynamically produces fixation points, enabling the high-resolution channel to focus on regions of interest. The fixation points are generated in a task-driven manner, enabling the automatic selection of regions of interest. This approach avoids exhaustive high-resolution analysis of the entire image, maintaining task performance and computational efficiency.

Task-Driven Fixation Network: An Efficient Architecture with Fixation Selection

TL;DR

Task-Driven Fixation Network (TDFN) addresses the efficiency gap in high-resolution visual processing by integrating a fixation-driven mechanism into a Transformer-based architecture. It combines a Low-Resolution Channel (LRC), a High-Resolution Channel (HRC), and a Hybrid Encoder (HE) connected via a Fixation Point Generator (FPG) that selects regions of interest; fixation points are obtained through Monte Carlo sampling of a saliency map produced from the rec_token. Training proceeds in two phases: initial task learning with random fixations and a subsequent reinforcement-learning update of the FPG using rewards and , with the objective . Experiments on MNIST show that selective high-resolution analysis substantially improves classification accuracy while keeping coverage and computation low, and dynamic termination further reduces fixation steps. Overall, TDFN demonstrates a scalable, task-specific approach that leverages fixation-inspired attention to balance performance and efficiency in vision tasks, with potential extensions to detection and segmentation.

Abstract

This paper presents a novel neural network architecture featuring automatic fixation point selection, designed to efficiently address complex tasks with reduced network size and computational overhead. The proposed model consists of: a low-resolution channel that captures low-resolution global features from input images; a high-resolution channel that sequentially extracts localized high-resolution features; and a hybrid encoding module that integrates the features from both channels. A defining characteristic of the hybrid encoding module is the inclusion of a fixation point generator, which dynamically produces fixation points, enabling the high-resolution channel to focus on regions of interest. The fixation points are generated in a task-driven manner, enabling the automatic selection of regions of interest. This approach avoids exhaustive high-resolution analysis of the entire image, maintaining task performance and computational efficiency.
Paper Structure (13 sections, 4 equations, 2 figures, 2 tables)

This paper contains 13 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: TDFN architecture.
  • Figure 2: Visualization of Fixation Points. The first column shows the original input images. The second column displays the low-resolution images (odd rows) and the reconstructed images generated by TDFN using only the low-resolution inputs (even rows). The third to ninth columns sequentially present the fixation points generated by the FPG (odd rows, represented as light squares) and the reconstructed images generated by TDFN using both low-resolution inputs and high-resolution inputs from the fixation points (even rows). The reconstructed images are displayed to illustrate how the addition of fixation points introduces supplementary information.