Table of Contents
Fetching ...

Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting

Lingyu Liu, Yaxiong Wang, Li Zhu, Lizi Liao, Zhedong Zheng

Abstract

This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired by the human painting process, \ie, observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes. The stroke-by-stroke painting animations are available on our project website.

Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting

Abstract

This work introduces a new approach to automatic oil painting that emphasizes the creation of dynamic and expressive brushstrokes. A pivotal challenge lies in mitigating the duplicate and common-place strokes, which often lead to less aesthetic outcomes. Inspired by the human painting process, \ie, observing, comparing, and drawing, we incorporate differential image analysis into a neural oil painting model, allowing the model to effectively concentrate on the incremental impact of successive brushstrokes. To operationalize this concept, we propose the Differential Query Transformer (DQ-Transformer), a new architecture that leverages differentially derived image representations enriched with positional encoding to guide the stroke prediction process. This integration enables the model to maintain heightened sensitivity to local details, resulting in more refined and nuanced stroke generation. Furthermore, we incorporate adversarial training into our framework, enhancing the accuracy of stroke prediction and thereby improving the overall realism and fidelity of the synthesized paintings. Extensive qualitative evaluations, complemented by a controlled user study, validate that our DQ-Transformer surpasses existing methods in both visual realism and artistic authenticity, typically achieving these results with fewer strokes. The stroke-by-stroke painting animations are available on our project website.

Paper Structure

This paper contains 13 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Differential image-guided inference process. We present four intermediate stages of oil painting according to a real target image (left). Each stage is illustrated with a diagram, where the top-left corner shows the current canvas, the top-right corner displays the corresponding differential image for that stage, and the bottom part presents the painting result inferred by our model. We observe that since we explicitly compare the content in the differential images during training, our model tends to add strokes in areas where discrepancies are more pronounced, thereby progressively reducing the discrepancy content within the differential images.
  • Figure 2: A brief overview of our painter framework. Given the canvas image $I_c$ and the target image $I_t$ generated by the renderer, we first obtain their differential image $I_d$ by simply subtracting one input from the other. Three local encoders comprised of convolutional neural networks are employed to extract image features $F_c,F_t,$ and $F_d$ with positional information. DQ-Transformer has two components, i.e., the DQ-encoder and the DQ-decoder. These visual features $F_c,F_t$ and $F_d$, are concatenated and then fed to the DQ-encoder to obtain the fused feature $F_{kv}$. Next, we transform the differential image features $F_d$ into query tokens to query the key and value pairs generated by the fused feature $F_{kv}$. Finally, the DQ-Transformer outputs a set of predicted strokes $\hat{S_t}$, each accompanied by its respective confidence $\hat{C_t}$. The predicted image $\hat{I_t}$ is generated by rendering these strokes onto the canvas. The discriminator operates by treating the target images $I_t$ as real samples and the predicted images $\hat{I_t}$ as fake samples.
  • Figure 3: Our painting progress following a coarse-to-fine manner.
  • Figure 4: Qualitative comparison between our model and state-of-the-art neural painting methods on unseen real-world datasets at different levels of stroke counts. The actual number of strokes used in the painting is annotated in the top right corner of the image. Our method leverages the difference image as a dynamic query for each painting step. This observation-first approach enables our model to achieve superior visual quality with relatively fewer strokes, effectively reproducing complex details with high fidelity. Please zoom in to obtain a more detailed view.
  • Figure 5: Ablation study on the primary components of our framework at different stroke counts. The actual number of brushstrokes used in the painting is annotated in the top right corner of the image. Please zoom in to obtain a more detailed view.
  • ...and 3 more figures