MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

Polezhaev Ignat; Goncharenko Igor; Iurina Natalya

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

Polezhaev Ignat, Goncharenko Igor, Iurina Natalya

TL;DR

The paper tackles visual saliency prediction for eye-tracking by replacing CNN backbones with a Swin Transformer encoder in MDS-ViTNet and employing two parallel decoders that generate separate attention maps, which are fused for final output. It demonstrates that a Swin-T backbone with multi-scale feature fusion and a CNNMerge fusion stage achieves state-of-the-art results on SALICON and CAT2000, outperforming previous transformer-augmented approaches like TranSalNet. The approach leverages a loss that balances distribution similarity and pixel-level accuracy and uses conventional data augmentations to improve generalization. The work provides a practical, transferable design with public code and datasets, advancing saliency modeling for applications in marketing, robotics, and healthcare.

Abstract

In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to generate two distinct attention maps. These maps are subsequently combined into a singular output via an additional CNN model. Our trained model MDS-ViTNet achieves state-of-the-art results across several benchmarks. Committed to fostering further collaboration, we intend to make our code, models, and datasets accessible to the public.

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

TL;DR

Abstract

MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (2)