Task-Driven Fixation Network: An Efficient Architecture with Fixation Selection
Shuguang Wang, Yuanjing Wang
TL;DR
Task-Driven Fixation Network (TDFN) addresses the efficiency gap in high-resolution visual processing by integrating a fixation-driven mechanism into a Transformer-based architecture. It combines a Low-Resolution Channel (LRC), a High-Resolution Channel (HRC), and a Hybrid Encoder (HE) connected via a Fixation Point Generator (FPG) that selects regions of interest; fixation points are obtained through Monte Carlo sampling of a saliency map produced from the rec_token. Training proceeds in two phases: initial task learning with random fixations and a subsequent reinforcement-learning update of the FPG using rewards $Reward_{n} = TaskLoss_{n-1} - TaskLoss_{n}$ and $L_{n} = -Reward_{n} \, \cdot \, \log(p_{n})$, with the objective $TaskLoss = ClassLoss + \alpha \cdot ReconLoss$. Experiments on MNIST show that selective high-resolution analysis substantially improves classification accuracy while keeping coverage and computation low, and dynamic termination further reduces fixation steps. Overall, TDFN demonstrates a scalable, task-specific approach that leverages fixation-inspired attention to balance performance and efficiency in vision tasks, with potential extensions to detection and segmentation.
Abstract
This paper presents a novel neural network architecture featuring automatic fixation point selection, designed to efficiently address complex tasks with reduced network size and computational overhead. The proposed model consists of: a low-resolution channel that captures low-resolution global features from input images; a high-resolution channel that sequentially extracts localized high-resolution features; and a hybrid encoding module that integrates the features from both channels. A defining characteristic of the hybrid encoding module is the inclusion of a fixation point generator, which dynamically produces fixation points, enabling the high-resolution channel to focus on regions of interest. The fixation points are generated in a task-driven manner, enabling the automatic selection of regions of interest. This approach avoids exhaustive high-resolution analysis of the entire image, maintaining task performance and computational efficiency.
