Table of Contents
Fetching ...

Robot Synesthesia: A Sound and Emotion Guided AI Painter

Vihaan Misra, Peter Schaldenbrand, Jean Oh

TL;DR

Robot Synesthesia introduces S-FRIDA, a framework that enables a robotic painter to be guided by natural sounds and speech by embedding inputs into shared representations with the painting process. By decoupling speech into content (text) and mood (emotion) and applying CLIP-based encodings alongside emotion prediction models, the method enables both semantic content control and mood-driven painting. The approach demonstrates that natural sounds and emotions can guide painting across simulated and real-world outputs, with survey-based evidence of perceptual alignment and qualitative multimodal results. This work broadens accessibility and expressive potential in robotic painting by leveraging auditory and emotional cues, and provides open-source resources to foster further research.

Abstract

If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.

Robot Synesthesia: A Sound and Emotion Guided AI Painter

TL;DR

Robot Synesthesia introduces S-FRIDA, a framework that enables a robotic painter to be guided by natural sounds and speech by embedding inputs into shared representations with the painting process. By decoupling speech into content (text) and mood (emotion) and applying CLIP-based encodings alongside emotion prediction models, the method enables both semantic content control and mood-driven painting. The approach demonstrates that natural sounds and emotions can guide painting across simulated and real-world outputs, with survey-based evidence of perceptual alignment and qualitative multimodal results. This work broadens accessibility and expressive potential in robotic painting by leveraging auditory and emotional cues, and provides open-source resources to foster further research.

Abstract

If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.
Paper Structure (18 sections, 5 equations, 9 figures, 2 tables)

This paper contains 18 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Robot Synesthesia connects a robot painter's action space directly to user driven sonic interactions. For Speech guidance, sound is decoupled into language and emotion whereas with Natural sound guidance, the audio itself drives the content of the painting.
  • Figure 2: $S$-FRIDA Overview A human user's artistic intentions are specified via any combination of natural sounds or speech as well as any of FRIDA's frida2022 existing input modalities, e.g., style or sketches. Features are extracted from the audio, or in the case of speech, a transcription and emotions are estimated. Brush stroke actions are rendered into a simulated painting using FRIDA, then features are extracted and compared to the input features to form loss functions. The loss is backpropagtated and gradient descent updates the actions to decrease the loss. After optimization, the brush stroke actions are executed by a robotic arm (UFactory XArm and Franka Emika platforms). We observe a high degree of fidelity between the simulated and the real painting drawn by the $S$-FRIDA framework.
  • Figure 3: Paintings generated using natural sounds as input.
  • Figure 4: First row images: Painting with only emotion guidance. Second row: painting with both emotion and text "A house and a tree."
  • Figure 5: Examples showing how emotion included in the inputs impacts the overall impressions of the paintings. The figure shows the inputs (first row), and the real paintings drawn by $S$-FRIDA (second row).
  • ...and 4 more figures