Table of Contents
Fetching ...

Predicting Visual Attention in Graphic Design Documents

Souradeep Chakraborty, Zijun Wei, Conor Kelton, Seoyoung Ahn, Aruna Balasubramanian, Gregory J. Zelinsky, Dimitris Samaras

TL;DR

This work addresses predicting visual attention in graphic design documents by introducing AGD, a two-stage framework that first estimates component-wise fixation density maps conditioned on page layout and then predicts scanpaths via inverse reinforcement learning using those maps as state representations. The Stage 1 CSPNet encodes both image features and a layout embedding to produce salient maps for faces, text, logos, banners, and images, which are fused into a final saliency map; Stage 2 uses Generative Adversarial Imitation Learning to generate human-like scanpaths guided by dynamic component beliefs. A large WebSaliency dataset (450 webpages, 41 subjects) is introduced, enabling robust training and layout-aware clustering. The model generalizes across comics, posters, mobile UIs, and even natural images, outperforming several baselines in saliency and scanpath prediction and offering interpretability through component-level saliency and layout conditioning. These advances have practical implications for design evaluation, adaptive content delivery, and cross-domain attention modeling in complex graphic documents.

Abstract

We present a model for predicting visual attention during the free viewing of graphic design documents. While existing works on this topic have aimed at predicting static saliency of graphic designs, our work is the first attempt to predict both spatial attention and dynamic temporal order in which the document regions are fixated by gaze using a deep learning based model. We propose a two-stage model for predicting dynamic attention on such documents, with webpages being our primary choice of document design for demonstration. In the first stage, we predict the saliency maps for each of the document components (e.g. logos, banners, texts, etc. for webpages) conditioned on the type of document layout. These component saliency maps are then jointly used to predict the overall document saliency. In the second stage, we use these layout-specific component saliency maps as the state representation for an inverse reinforcement learning model of fixation scanpath prediction during document viewing. To test our model, we collected a new dataset consisting of eye movements from 41 people freely viewing 450 webpages (the largest dataset of its kind). Experimental results show that our model outperforms existing models in both saliency and scanpath prediction for webpages, and also generalizes very well to other graphic design documents such as comics, posters, mobile UIs, etc. and natural images.

Predicting Visual Attention in Graphic Design Documents

TL;DR

This work addresses predicting visual attention in graphic design documents by introducing AGD, a two-stage framework that first estimates component-wise fixation density maps conditioned on page layout and then predicts scanpaths via inverse reinforcement learning using those maps as state representations. The Stage 1 CSPNet encodes both image features and a layout embedding to produce salient maps for faces, text, logos, banners, and images, which are fused into a final saliency map; Stage 2 uses Generative Adversarial Imitation Learning to generate human-like scanpaths guided by dynamic component beliefs. A large WebSaliency dataset (450 webpages, 41 subjects) is introduced, enabling robust training and layout-aware clustering. The model generalizes across comics, posters, mobile UIs, and even natural images, outperforming several baselines in saliency and scanpath prediction and offering interpretability through component-level saliency and layout conditioning. These advances have practical implications for design evaluation, adaptive content delivery, and cross-domain attention modeling in complex graphic documents.

Abstract

We present a model for predicting visual attention during the free viewing of graphic design documents. While existing works on this topic have aimed at predicting static saliency of graphic designs, our work is the first attempt to predict both spatial attention and dynamic temporal order in which the document regions are fixated by gaze using a deep learning based model. We propose a two-stage model for predicting dynamic attention on such documents, with webpages being our primary choice of document design for demonstration. In the first stage, we predict the saliency maps for each of the document components (e.g. logos, banners, texts, etc. for webpages) conditioned on the type of document layout. These component saliency maps are then jointly used to predict the overall document saliency. In the second stage, we use these layout-specific component saliency maps as the state representation for an inverse reinforcement learning model of fixation scanpath prediction during document viewing. To test our model, we collected a new dataset consisting of eye movements from 41 people freely viewing 450 webpages (the largest dataset of its kind). Experimental results show that our model outperforms existing models in both saliency and scanpath prediction for webpages, and also generalizes very well to other graphic design documents such as comics, posters, mobile UIs, etc. and natural images.
Paper Structure (23 sections, 2 equations, 17 figures, 9 tables)

This paper contains 23 sections, 2 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Overlaid predicted fixation density maps of the SAM-ResNet model and our AGD model (trained on our WebSaliency dataset) on the ieee.org webpage (from FiWI shen2014webpage), a mobile UI image (from Mobile UI gupta2018saliency), a comics image (from DeepComics bannier2018deepcomics) and a natural image (from MIT 1003 judd2009learning) and scanpaths for the webpage and the mobile UI instance. Our predictions more closely resembles the ground truth and our model shows great generalization across different types of graphic designs although the model is trained to predict fixations on webpage images.
  • Figure 2: Our graphic design attention prediction model, AGD. For saliency map prediction (Stage 1), we first extract the image representations using a dilated residual network (with ResNet 50 as the backbone) and combine them with the page layout representations obtained using the Layout Encoding Network (LEnNet) to form the $R_{combo}$ representation, which is input to the dilated inception module network as the encoder. Fixation density maps of the document components are then predicted using the corresponding image decoders and are combined and passed through two $3\times 3$ convolution layers to form the final saliency map, $Sal_{Final}$. For scanpath prediction (Stage 2), the component FDMs from Stage 1 are used for constructing the dynamic component belief maps, which are discretized sequentially into binary fixation masks. The state of the inverse reinforcement learning (IRL) model is updated as depicted. The IRL agent is trained to predict the scanpath, S by applying Inhibition-of-Return on the intermediate fixation probability map (see Section \ref{['sec:scanpath_method']} for details).
  • Figure 3: Comparison of predicted face and text FDMs of a natural saliency model UAVDVSM he2019understanding and AGD-F. Striped yellow boxes indicate the inaccuracies in UAVDVSM predictions. AGD-F produces more accurate salient component FDMs.
  • Figure 4: Pipeline to obtain the page cluster from the original image using connected PageSegNet and PageEncoder networks.
  • Figure 5: t-SNE plot of (a) webpage clusters, (b) poster clusters, (c) mobile UI clusters obtained from our PageEncoder network.
  • ...and 12 more figures