From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Leonardo Gonzalez

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Leonardo Gonzalez

TL;DR

This work tackles the challenge of converting raster infographics into editable native slides by introducing Images2Slides, an API-based pipeline that uses vision-language region extraction to generate a region-grounded representation and then reconstructs a Slides-native slide via the Google Slides API. A strict region JSON schema and deterministic postprocessing enable model-agnostic backends, while typography calibration and collision-aware layout adjustments improve readability and layout fidelity. On a controlled benchmark of 29 programmatically generated slides, the method achieves an overall element recovery of $0.989 \pm 0.057$, with text CER $0.033 \pm 0.149$ and image IoU $0.644 \pm 0.131$, demonstrating strong editability and reasonable layout preservation despite raster inputs. The approach highlights practical engineering considerations, such as background synthesis, deterministic IDs for retries, and asset deduplication, and discusses limitations like the raster-to-native gap and dependence on region quality, offering a path toward more faithful vectorization of infographic content for authoring and localization.

Abstract

Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

TL;DR

, with text CER

and image IoU

, demonstrating strong editability and reasonable layout preservation despite raster inputs. The approach highlights practical engineering considerations, such as background synthesis, deterministic IDs for retries, and asset deduplication, and discusses limitations like the raster-to-native gap and dependence on region quality, offering a path toward more faithful vectorization of infographic content for authoring and localization.

Abstract

(text:

, images:

), with mean text transcription error

and mean layout fidelity

for text regions and

for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.

Paper Structure (33 sections, 3 equations, 2 figures, 4 tables)

This paper contains 33 sections, 3 equations, 2 figures, 4 tables.

Introduction
Related Work
Multimodal document and infographic understanding.
Region extraction and layout reconstruction.
Visual-to-structured generation and derendering.
Problem Formulation
System Overview
VLM-based region extraction
Region JSON schema
Layout postprocessing
Geometry mapping
Slide reconstruction via Google Slides API
Deterministic IDs for retries.
Typography Calibration
Piecewise-linear font scaling
...and 18 more sections

Figures (2)

Figure 1: Layered architecture of Images2Slides. The pipeline is organized into input, analysis, processing, asset management, slides generation, and output layers.
Figure 2: End-to-end example reconstruction produced by Images2Slides.

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

TL;DR

Abstract

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (2)