Table of Contents
Fetching ...

PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning

Raahul Krishna Durairaju, K. Saruladha

TL;DR

PyramidStyler addresses NST scalability to high-resolution inputs by introducing Pyramidal Positional Encoding (PPE), a hierarchical multi-scale patch encoding that preserves local details and global context while reducing computation. The framework uses a transformer-based encoder–decoder with a CNN decoder and integrates a lightweight reinforcement learning component to adapt stylization during training, accelerating convergence and improving visual fidelity. Trained on COCO and WikiArt, PyramidStyler achieves substantial reductions in content and style losses (e.g., to $2.07$ and $0.86$ after $4000$ epochs) with inference times around $1.39$ s, and RL further improves these metrics (to $2.03$ and $0.75$) with comparable speed ($\sim1.40$ s). The approach demonstrates real-time, high-quality artistic rendering at scale, suitable for media and design applications, and highlights PPE’s advantage over CAPE in multi-scale spatial reasoning.

Abstract

Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs--achieving 1.39 s inference--and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.

PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning

TL;DR

PyramidStyler addresses NST scalability to high-resolution inputs by introducing Pyramidal Positional Encoding (PPE), a hierarchical multi-scale patch encoding that preserves local details and global context while reducing computation. The framework uses a transformer-based encoder–decoder with a CNN decoder and integrates a lightweight reinforcement learning component to adapt stylization during training, accelerating convergence and improving visual fidelity. Trained on COCO and WikiArt, PyramidStyler achieves substantial reductions in content and style losses (e.g., to and after epochs) with inference times around s, and RL further improves these metrics (to and ) with comparable speed ( s). The approach demonstrates real-time, high-quality artistic rendering at scale, suitable for media and design applications, and highlights PPE’s advantage over CAPE in multi-scale spatial reasoning.

Abstract

Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs--achieving 1.39 s inference--and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.

Paper Structure

This paper contains 28 sections, 25 equations, 5 figures.

Figures (5)

  • Figure 1:
  • Figure 2: Proposed Architecture
  • Figure 3: Existing vs Proposed System Metric Comparison Table
  • Figure 4: Influence of RL Algorithm Comparison
  • Figure 5: Output Comparison of the proposed model for various content and style images.