Table of Contents
Fetching ...

SD-$π$XL: Generating Low-Resolution Quantized Imagery via Score Distillation

Alexandre Binninger, Olga Sorkine-Hornung

TL;DR

SD-$\pi$XL addresses the challenge of generating low-resolution, color-quantized imagery under hard palette and resolution constraints by combining a differentiable image generator with score distillation sampling from pretrained diffusion models. It parameterizes the output as an $H \times W \times n$ logits tensor and uses a Gumbel-Softmax mechanism to sample discrete palette elements, yielding crisp pixel art while maintaining differentiability. The approach supports text prompts or optional spatial conditioning via ControlNet (edges and depth), and strictly enforces palette adherence while enabling generation at arbitrary resolutions through a flexible loss that combines semantic guidance with FFT-based smoothness. Empirically, SD-$\pi$XL achieves state-of-the-art performance in quantized image generation and demonstrates practical fabrication applications, such as embroidery, fuse beads, and interlocking-brick designs; limitations include optimization speed and per-pixel independence, with future work focusing on faster convergence, image-only conditioning, and improved frame-to-frame consistency for animation.

Abstract

Low-resolution quantized imagery, such as pixel art, is seeing a revival in modern applications ranging from video game graphics to digital design and fabrication, where creativity is often bound by a limited palette of elemental units. Despite their growing popularity, the automated generation of quantized images from raw inputs remains a significant challenge, often necessitating intensive manual input. We introduce SD-$π$XL, an approach for producing quantized images that employs score distillation sampling in conjunction with a differentiable image generator. Our method enables users to input a prompt and optionally an image for spatial conditioning, set any desired output size $H \times W$, and choose a palette of $n$ colors or elements. Each color corresponds to a distinct class for our generator, which operates on an $H \times W \times n$ tensor. We adopt a softmax approach, computing a convex sum of elements, thus rendering the process differentiable and amenable to backpropagation. We show that employing Gumbel-softmax reparameterization allows for crisp pixel art effects. Unique to our method is the ability to transform input images into low-resolution, quantized versions while retaining their key semantic features. Our experiments validate SD-$π$XL's performance in creating visually pleasing and faithful representations, consistently outperforming the current state-of-the-art. Furthermore, we showcase SD-$π$XL's practical utility in fabrication through its applications in interlocking brick mosaic, beading and embroidery design.

SD-$π$XL: Generating Low-Resolution Quantized Imagery via Score Distillation

TL;DR

SD-XL addresses the challenge of generating low-resolution, color-quantized imagery under hard palette and resolution constraints by combining a differentiable image generator with score distillation sampling from pretrained diffusion models. It parameterizes the output as an logits tensor and uses a Gumbel-Softmax mechanism to sample discrete palette elements, yielding crisp pixel art while maintaining differentiability. The approach supports text prompts or optional spatial conditioning via ControlNet (edges and depth), and strictly enforces palette adherence while enabling generation at arbitrary resolutions through a flexible loss that combines semantic guidance with FFT-based smoothness. Empirically, SD-XL achieves state-of-the-art performance in quantized image generation and demonstrates practical fabrication applications, such as embroidery, fuse beads, and interlocking-brick designs; limitations include optimization speed and per-pixel independence, with future work focusing on faster convergence, image-only conditioning, and improved frame-to-frame consistency for animation.

Abstract

Low-resolution quantized imagery, such as pixel art, is seeing a revival in modern applications ranging from video game graphics to digital design and fabrication, where creativity is often bound by a limited palette of elemental units. Despite their growing popularity, the automated generation of quantized images from raw inputs remains a significant challenge, often necessitating intensive manual input. We introduce SD-XL, an approach for producing quantized images that employs score distillation sampling in conjunction with a differentiable image generator. Our method enables users to input a prompt and optionally an image for spatial conditioning, set any desired output size , and choose a palette of colors or elements. Each color corresponds to a distinct class for our generator, which operates on an tensor. We adopt a softmax approach, computing a convex sum of elements, thus rendering the process differentiable and amenable to backpropagation. We show that employing Gumbel-softmax reparameterization allows for crisp pixel art effects. Unique to our method is the ability to transform input images into low-resolution, quantized versions while retaining their key semantic features. Our experiments validate SD-XL's performance in creating visually pleasing and faithful representations, consistently outperforming the current state-of-the-art. Furthermore, we showcase SD-XL's practical utility in fabrication through its applications in interlocking brick mosaic, beading and embroidery design.
Paper Structure (23 sections, 12 equations, 9 figures, 3 tables)

This paper contains 23 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: SD-$\pi$XL generates low-resolution quantized images that are suitable for many fabrication applications, such as cross-stitch embroidery, fuse beads, or interlocking brick designs. The result image size is $48 \times 48$ pixels, generated without an initialization image, and only conditioned on the prompt "A rose flower. The branch and leaves are visible."
  • Figure 2: Diffusion models allow for the generation of high-resolution images (1). While using a diffusion-based image translation saharia2022palettepodell2023sdxl with prompt-guided style is ineffective (2), fine-tuning the model for pixelized effects nerijspixelartxl2023 (3) is not generalizable across styles and requires retraining for different resolutions. VectorFusion jain2022vectorfusion solves the resolution issue, but does not follow closely the input image (4). Our method supports outputs in any size and applies constraints to a finite palette (5), which can be enforced through either soft (6) or hard constraints (7). Color quantization further emphasizes the pixel art effect and is crucial for some fabrication applications, such as embroidery (8).
  • Figure 3: Visualization of the optimization process for generating a pixelized $H \times W$ image with a color palette of size $n$. If an input image is provided, the process starts with initializing the logits $\lambda_{i, j, k}$ by downsampling the input image and matching each pixel to the nearest palette color. Otherwise, the logits are randomly initialized. Next, Gumbel-distributed random variables $G_{i, j, k}$ are added to the logits. Applying a softmax function and combining the palette colors weighted by $s_{i, j, k}(\tau)$ yields an output image $x$. This $x$, the Canny edge map cannyedge and an estimated depth map DPTF2021Ranftl of the input image are then augmented and used in a latent diffusion modelpodell2023sdxl to compute a semantic loss $\nabla_\theta \mathcal{L}_\mathit{LSDS}$, conditioned on an input prompt $y$. Additionally, a smoothness loss $\mathcal{L}_\mathit{FFT}$ derived from $x$ is used to optimize the parameters $\theta$.
  • Figure 4: Our image generator can strictly adhere to the input palette using an argmax function (bottom frog). Using softmax yields an image whose pixel colors lie in the convex hull of the input palette, leading to less crisp, pixelized outputs (top frog).
  • Figure 5: We show SD-$\pi$XL's results with the Gumbel-softmax reparameterization (first row) and without (second row) during the optimization. The argmax-generation, the softmax-generation, the entropy per pixel and the average normalized entropy over time are displayed. Images are $64 \times 64$ pixels. The average normalized entropy is shown for 30,000 steps to ensure that the obtained results are not due to an early stop.
  • ...and 4 more figures