Table of Contents
Fetching ...

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou, Pascal Frossard

TL;DR

SliDer addresses semantic derendering of raster slides by converting them into editable SVGs with a vision-language model that iteratively refines predictions. It introduces Slide2SVG, a real-world dataset of approximately 38k raster-SVG pairs from scientific presentations, to benchmark this task. The approach yields high perceptual fidelity (LPIPS $0.069$) and strong OCR accuracy, with human evaluators preferring SliDer over strong zero-shot baselines in pairwise judgments. This work enables genuine editability of complex documents and opens avenues for extending semantic derendering to posters, infographics, and other media while balancing fidelity and compute alluding to practical deployment implications.

Abstract

Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

TL;DR

SliDer addresses semantic derendering of raster slides by converting them into editable SVGs with a vision-language model that iteratively refines predictions. It introduces Slide2SVG, a real-world dataset of approximately 38k raster-SVG pairs from scientific presentations, to benchmark this task. The approach yields high perceptual fidelity (LPIPS ) and strong OCR accuracy, with human evaluators preferring SliDer over strong zero-shot baselines in pairwise judgments. This work enables genuine editability of complex documents and opens avenues for extending semantic derendering to posters, infographics, and other media while balancing fidelity and compute alluding to practical deployment implications.

Abstract

Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

Paper Structure

This paper contains 68 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: SliDer derenders raster slides into editable SVG-based format, allowing flexible editing on the slide such as adjusting figures, modifying text, etc.
  • Figure 2: Overview of the SliDer inference pipeline. The VLM takes as input a raster slide, an instruction prompt, and an auxiliary SVG context, generating an editable SVG representation. The generated SVG can optionally be fed back to the VLM for iterative refinement. Given the final predicted SVG, the bounding box information is extracted to crop the image assets from the original raster into external PNG files. Finally, the slide is reconstructed by rendering the resulting SVG code.
  • Figure 3: Examples of derendered slide images. Each row contains a separate sample, showing the original raster slide and the reconstructions from the derendered SVGs by different methods. For SliDer, we show the YOLO-guided versions with one step of iterative refinement. "ZS" refers to zero-shot methods.
  • Figure 4: Qualitative examples for the ablations on the effect of bounding box information priors and iterative refinement during inference. We use the Gemini variant of SliDer. "YOLO" indicates that the model uses bounding box priors. "IR" indicates that the model performs one step of iterative refinement at inference time.
  • Figure 5: Distributions of (left) number of image assets per slide, (middle) number of text assets per slide, and (right) SVG token count in Slide2SVG. Most slides contain a small number of images and a moderate number of text elements, with a tail of more complex slides.
  • ...and 2 more figures