Table of Contents
Fetching ...

Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu

TL;DR

<3-5 sentence high-level summary>

Abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

Voxify3D: Pixel Art Meets Volumetric Rendering

TL;DR

<3-5 sentence high-level summary>

Abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

Paper Structure

This paper contains 31 sections, 13 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Existing methods often miss key features in voxelization. While IN2N haque2023instruct, Vox-E sella2023vox, and Blender (Geometry Nodes) generate outputs that are coarse, blurry, or semantically inconsistent, they frequently lose critical elements such as facial features. In contrast, our method preserves structural details and produces visually appealing voxel art with sharp abstraction.
  • Figure 2: Our two-stage voxel art generation pipeline. (a) Coarse voxel grid training: Given a 3D mesh, we render multi-view images and optimize a voxel-based radiance field (DVGO sun2022direct) using MSE loss to learn coarse RGB and density. (b) Orthographic pixel art fine-tuning: We refine the voxel grid using six orthographic pixel art views, which also serve to extract a discrete color palette (e.g., via k-means). Optimization includes appearance, depth, and alpha losses. (c) CLIP-guided optimization: A CLIP loss computed over rendered patches and mesh images encourages semantic alignment while being memory-efficient. (d) Differentiable discrete color selection via Gumbel-Softmax: Each voxel stores palette logits. Gumbel-Softmax enables differentiable sampling for end-to-end color optimization, yielding coherent, stylized voxel art.
  • Figure 3: Perspective vs. Orthographic. (Left) Six-view pixel art pipeline. (Right) Perspective views (red) misalign pixels, while six orthographic views (green) enable precise pixel–voxel alignment.
  • Figure 4: Qualitative comparisons on character models from the Rodin wang2022rodin dataset. We compare our voxel art results with Pixel art to 3D extension, IN2N haque2023instruct, Vox-E sella2023vox, and Blender's voxelization. Our method produces stylized yet consistent voxel representations with pixel art aesthetics.
  • Figure 5: Effect of Palette Selection and Color Count. Each row corresponds to a different palette extraction method: K-means, Max-Min, Median Cut, and Simulated Annealing. Each column shows increasing color counts (2, 3, 4, 8). Each method produces unique color clustering effects.
  • ...and 12 more figures