Table of Contents
Fetching ...

Connecting the Dots: Surfacing Structure in Documents through AI-Generated Cross-Modal Links

Alyssa Hwang, Hita Kambhamettu, Yue Yang, Ajay Patel, Joseph Chee Chang, Andrew Head

TL;DR

The paper tackles the cognitive difficulty of understanding dense, multimodal documents by proposing a general framework for fine-grained integration of information across text and visuals. It defines two primitives, entities and links, and instantiates them in an augmented reading interface featuring figure points, highlighted phrases, a persistent reference panel, and a visual index. Through formative and comparative user studies, the approach yields statistically significant improvements in reading quiz performance without increasing time or cognitive load, while highlighting user preferences for cross-modal linking components. The work demonstrates the potential of treating complex documents as networks of localized details that can be surfaced and navigated across modalities, with implications for scalable comprehension of scientific literature.

Abstract

Understanding information-dense documents like recipes and scientific papers requires readers to find, interpret, and connect details scattered across text, figures, tables, and other visual elements. These documents are often long and filled with specialized terminology, hindering the ability to locate relevant information or piece together related ideas. Existing tools offer limited support for synthesizing information across media types. As a result, understanding complex material remains cognitively demanding. This paper presents a framework for fine-grained integration of information in complex documents. We instantiate the framework in an augmented reading interface, which populates a scientific paper with clickable points on figures, interactive highlights in the body text, and a persistent reference panel for accessing consolidated details without manual scrolling. In a controlled between-subjects study, we find that participants who read the paper with our tool achieved significantly higher scores on a reading quiz without evidence of increased time to completion or cognitive load. Fine-grained integration provides a systematic way of revealing relationships within a document, supporting engagement with complex, information-dense materials.

Connecting the Dots: Surfacing Structure in Documents through AI-Generated Cross-Modal Links

TL;DR

The paper tackles the cognitive difficulty of understanding dense, multimodal documents by proposing a general framework for fine-grained integration of information across text and visuals. It defines two primitives, entities and links, and instantiates them in an augmented reading interface featuring figure points, highlighted phrases, a persistent reference panel, and a visual index. Through formative and comparative user studies, the approach yields statistically significant improvements in reading quiz performance without increasing time or cognitive load, while highlighting user preferences for cross-modal linking components. The work demonstrates the potential of treating complex documents as networks of localized details that can be surfaced and navigated across modalities, with implications for scalable comprehension of scientific literature.

Abstract

Understanding information-dense documents like recipes and scientific papers requires readers to find, interpret, and connect details scattered across text, figures, tables, and other visual elements. These documents are often long and filled with specialized terminology, hindering the ability to locate relevant information or piece together related ideas. Existing tools offer limited support for synthesizing information across media types. As a result, understanding complex material remains cognitively demanding. This paper presents a framework for fine-grained integration of information in complex documents. We instantiate the framework in an augmented reading interface, which populates a scientific paper with clickable points on figures, interactive highlights in the body text, and a persistent reference panel for accessing consolidated details without manual scrolling. In a controlled between-subjects study, we find that participants who read the paper with our tool achieved significantly higher scores on a reading quiz without evidence of increased time to completion or cognitive load. Fine-grained integration provides a systematic way of revealing relationships within a document, supporting engagement with complex, information-dense materials.
Paper Structure (53 sections, 15 figures, 10 tables)

This paper contains 53 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Framework design. Our framework for fine-grained integration consists of entities (segments in ovals) and links (gray curves between entities). See Section \ref{['sec:design']} for more details. "Music" image from Indygo at flaticons.com.
  • Figure 2: AI data generation pipeline. To generate data for our interface, we extract all figure images, captions, and referring passages. A multimodal OpenAI model identifies salient visual entities in figures and corresponding references in text, which are visualized as purple circles and highlighted phrases. In a separate pass, we provide the full paper as input to generate descriptions for all of the entities. The visual entities, textual references, and descriptions are integrated into the interface as interactive points, highlights, figure scans, and other affordances. Additional details are provided in Section \ref{['sec:system']}.
  • Figure 3: Opening view of interface. When it is initially launched, the interface shows a toolbar across the top and the visual index along the far right, with the paper taking up the majority of the screen. "Linking mode," which displays the purple points and highlights, is activated by default. Main affordances are detailed in Section \ref{['sec:system_affordances']}.
  • Figure 4: Consolidation via reference panel. The reference panel offers a parallel working area for additional information. It opens to the right of the paper when the user clicks on a figure point or highlighted phrase. It is persistent, allowing users to continue scrolling in the main paper while providing quick access to a zoom- and pan-enabled copy of the figure (top), description of the selected entity ("Tell me more about this"), and links to related passages (bottom). The passages can be expanded and read in the reference panel without needing to scroll. When desired, the user can click on the passage to jump directly to it in the main paper.
  • Figure 5: Navigation via passage links and visual index. Our interface offers two additional affordances for navigating papers, both of which involve directly jumping to a meaningful location. The first way is to click on a passage excerpt in the reference panel (see Figure \ref{['fig:rp2']} for more details on the reference panel). The other navigational affordance is the visual index on the far right. Users can click on one of the figures in the visual index to jump to it in the main paper.
  • ...and 10 more figures