PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning
Raahul Krishna Durairaju, K. Saruladha
TL;DR
PyramidStyler addresses NST scalability to high-resolution inputs by introducing Pyramidal Positional Encoding (PPE), a hierarchical multi-scale patch encoding that preserves local details and global context while reducing computation. The framework uses a transformer-based encoder–decoder with a CNN decoder and integrates a lightweight reinforcement learning component to adapt stylization during training, accelerating convergence and improving visual fidelity. Trained on COCO and WikiArt, PyramidStyler achieves substantial reductions in content and style losses (e.g., to $2.07$ and $0.86$ after $4000$ epochs) with inference times around $1.39$ s, and RL further improves these metrics (to $2.03$ and $0.75$) with comparable speed ($\sim1.40$ s). The approach demonstrates real-time, high-quality artistic rendering at scale, suitable for media and design applications, and highlights PPE’s advantage over CAPE in multi-scale spatial reasoning.
Abstract
Neural Style Transfer (NST) has evolved from Gatys et al.'s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs--achieving 1.39 s inference--and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.
