Table of Contents
Fetching ...

GeoPos: A Minimal Positional Encoding for Enhanced Fine-Grained Details in Image Synthesis Using Convolutional Neural Networks

Mehran Hosseini, Peyman Hosseini

TL;DR

The paper identifies a fundamental limitation in CNN-based image synthesis: difficulty in capturing fine geometric details. It introduces Geometry-aware convolution (GeoConv) by appending a single GeoPos channel encoding the $n$-dimensional Cartesian coordinates and applying random coordinate shifts to learn relative geometry, reducing reliance on absolute position. The authors prove theoretical results on bias and equivalence, and demonstrate across GANs, VAEs, and monocular depth estimation that GeoConv yields more realistic, diverse, and stable outputs than Conv and CoordConv baselines. This approach promises practical gains for fine-grained detail synthesis and could enhance large-scale generative models, while highlighting the need to address potential misuse and scalability concerns.

Abstract

The enduring inability of image generative models to recreate intricate geometric features, such as those present in human hands and fingers has been an ongoing problem in image generation for nearly a decade. While strides have been made by increasing model sizes and diversifying training datasets, this issue remains prevalent across all models, from denoising diffusion models to Generative Adversarial Networks (GAN), pointing to a fundamental shortcoming in the underlying architectures. In this paper, we demonstrate how this problem can be mitigated by augmenting convolution layers geometric capabilities through providing them with a single input channel incorporating the relative n-dimensional Cartesian coordinate system. We show this drastically improves quality of images generated by Diffusion Models, GANs, and Variational AutoEncoders (VAE).

GeoPos: A Minimal Positional Encoding for Enhanced Fine-Grained Details in Image Synthesis Using Convolutional Neural Networks

TL;DR

The paper identifies a fundamental limitation in CNN-based image synthesis: difficulty in capturing fine geometric details. It introduces Geometry-aware convolution (GeoConv) by appending a single GeoPos channel encoding the -dimensional Cartesian coordinates and applying random coordinate shifts to learn relative geometry, reducing reliance on absolute position. The authors prove theoretical results on bias and equivalence, and demonstrate across GANs, VAEs, and monocular depth estimation that GeoConv yields more realistic, diverse, and stable outputs than Conv and CoordConv baselines. This approach promises practical gains for fine-grained detail synthesis and could enhance large-scale generative models, while highlighting the need to address potential misuse and scalability concerns.

Abstract

The enduring inability of image generative models to recreate intricate geometric features, such as those present in human hands and fingers has been an ongoing problem in image generation for nearly a decade. While strides have been made by increasing model sizes and diversifying training datasets, this issue remains prevalent across all models, from denoising diffusion models to Generative Adversarial Networks (GAN), pointing to a fundamental shortcoming in the underlying architectures. In this paper, we demonstrate how this problem can be mitigated by augmenting convolution layers geometric capabilities through providing them with a single input channel incorporating the relative n-dimensional Cartesian coordinate system. We show this drastically improves quality of images generated by Diffusion Models, GANs, and Variational AutoEncoders (VAE).
Paper Structure (58 sections, 3 theorems, 12 equations, 33 figures, 9 tables)

This paper contains 58 sections, 3 theorems, 12 equations, 33 figures, 9 tables.

Key Result

Theorem 2.1

When using random shift, GeoConv learns the relative positional information rather than the absolute positional information, as in CoordConv.

Figures (33)

  • Figure 2: A $5 \times 5$ geometry channel of rank 2 is illustrated in the rightmost tensor. The top and bottom rows correspond to horizontal and vertical coordinates, respectively. The standard horizontal and vertical coordinates are shown in the leftmost column. Tensors in the second column show random horizontal and vertical shifts. In the implementation, coordinate channels are divided by their sizes (in this case 4), and for optimisation, we sample horizontal and vertical shifts at once as a single random number representing their sum; thus, reducing the number of additions and samplings.
  • Figure 3: GeoConv in a VAE. Purple blocks indicate the input and output tensors, yellow blocks represent the output tensors resulting from previous layers' convolution operation, and orange blocks indicate the geometry channels appended to them during the GeoConv's operation before applying the next convolution.
  • Figure 4: Ablation study on performance of models using each architecture with different number of layers and filters. A side observation is GPT-4V's GPT-4GPT-4V intriguing failure in this task. We evaluated GPT-4V's performance on 140 (20 per density) dataset images, without fine-tuning, but with prompt-engineering, and scaled it by the same scaling factor as others.
  • Figure 6: Hand gestures generated by ConvWGAN-GP (top), and GeoWGAN-GP (bottom), trained on the ASL Hand dataset. Each image is generated as follows. For a given model and label, we generated 10 images from randomly sampled latent points. The image with highest score from the discriminator is added to the canvas. We repeat this for each of the 36 labels. Hand gestures generated by GeoWGAN-GP, in addition to being clearer, have the correct formation and correspond to the correct label, while some of the gestures by ConvWGAN-GP, like '4', '6, 'h', 'r', and 's', show incorrect gestures and some other, like '3', '7', 'c', 'f', 'i', and 'o', are deformed.
  • Figure 7: Mean and 95% CI of train and validation losses of GeoVAE (red lines), CoordVAE (dotted brown lines), and ConvVAE (dashed blue lines), trained on CelebA dataset for latent dimensions $d \in \{256, 384, 512\}$ over five runs with seeds $0, \dots, 4$. GeoVAE is more consistent across all runs and latent dimensions and obtains smaller mean loss and validation loss than both ConvVAE and CoordVAE.
  • ...and 28 more figures

Theorems & Definitions (7)

  • Theorem 2.1
  • proof
  • Theorem 2.2
  • Theorem 2.3
  • Remark 2.4
  • proof : Proof of \ref{['thm: Filter Collapse']}
  • proof : Proof of \ref{['thm: Equivalence']}