Table of Contents
Fetching ...

CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting

Peter Schaldenbrand, Gaurav Parmar, Jun-Yan Zhu, James McCann, Jean Oh

TL;DR

CoFRIDA tackles the semantic and sim-to-real gaps in human-robot co-painting by fine-tuning a strong text-to-image model offline on self-generated data that reflect robotic constraints. It introduces a Co-Painting Module that operates in a hierarchical loop with FRIDA, where high-level semantic planning informs low-level brush-action planning to realize text-driven paintings. Results show improved text-image alignment, coherent integration with existing canvas content, and generalization to real-world media across multiple painting settings and turns. The approach is open-source and platform-agnostic, offering a practical path to interactive, collaborative robotic art.

Abstract

Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engaging with content already painted by a human collaborator. To improve text-image alignment, FRIDA's major weakness, our system uses pre-trained text-to-image models; however, pre-trained models in the context of real-world co-painting do not perform well because they (1) do not understand the constraints and abilities of the robot and (2) cannot perform co-painting without making unrealistic edits to the canvas and overwriting content. We propose a self-supervised fine-tuning procedure that can tackle both issues, allowing the use of pre-trained state-of-the-art text-image alignment models with robots to enable co-painting in the physical world. Our open-source approach, CoFRIDA, creates paintings and drawings that match the input text prompt more clearly than FRIDA, both from a blank canvas and one with human created work. More generally, our fine-tuning procedure successfully encodes the robot's constraints and abilities into a foundation model, showcasing promising results as an effective method for reducing sim-to-real gaps.

CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting

TL;DR

CoFRIDA tackles the semantic and sim-to-real gaps in human-robot co-painting by fine-tuning a strong text-to-image model offline on self-generated data that reflect robotic constraints. It introduces a Co-Painting Module that operates in a hierarchical loop with FRIDA, where high-level semantic planning informs low-level brush-action planning to realize text-driven paintings. Results show improved text-image alignment, coherent integration with existing canvas content, and generalization to real-world media across multiple painting settings and turns. The approach is open-source and platform-agnostic, offering a practical path to interactive, collaborative robotic art.

Abstract

Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engaging with content already painted by a human collaborator. To improve text-image alignment, FRIDA's major weakness, our system uses pre-trained text-to-image models; however, pre-trained models in the context of real-world co-painting do not perform well because they (1) do not understand the constraints and abilities of the robot and (2) cannot perform co-painting without making unrealistic edits to the canvas and overwriting content. We propose a self-supervised fine-tuning procedure that can tackle both issues, allowing the use of pre-trained state-of-the-art text-image alignment models with robots to enable co-painting in the physical world. Our open-source approach, CoFRIDA, creates paintings and drawings that match the input text prompt more clearly than FRIDA, both from a blank canvas and one with human created work. More generally, our fine-tuning procedure successfully encodes the robot's constraints and abilities into a foundation model, showcasing promising results as an effective method for reducing sim-to-real gaps.
Paper Structure (20 sections, 1 equation, 9 figures, 1 table)

This paper contains 20 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Co-Painting with CoFRIDA. We showcase how CoFRIDA collaboratively paints with artists. The process begins with the artist sketching a table. Building on that foundation, CoFRIDA adds to the canvas, guided by the artist's initial prompt: "A bulky robot arm on a table." The artist then iterates on the painting with additional strokes to add detail to the robot arm, and provides a new text prompt, "A robot arm with a hand." CoFRIDA responds by completing the painting to match this new description.
  • Figure 2: Co-Painting. We introduce Co-Painting as a task in which a robot must add content to a painting that engages with the current content without destroying the existing work. We demonstrate that existing models (Instruct-Pix2Pix, bottom row) often cannot successfully add content without making unreasonably large edits to the canvas, overwriting any prior work, while CoFRIDA (top row) adds content that harmonizes with the existing work.
  • Figure 3: Method Overview. Offline, we fine-tune a pre-trained Instruct-Pix2Pix model on our self-supervised data. Online, the user can either draw or give the robot a text description. The Co-Painting Module takes as input the current canvas and text description to generate a pixel prediction of how the robot should finish the painting using the fine-tuned Instruct-Pix2Pix model. FRIDA predicts actions for the robot to create this pixel image and produces a simulation. This process is repeated until the user is satisfied.
  • Figure 4: Self-Supervised Dataset Creation. We describe the process of generating the self-supervised training data pairs for fine-tuning the Co-Painting Module. We start with the input images from the LAION-art dataset and convert them into simulated sketch outputs with the FRIDA simulator. Next, we create partial sketches in four different ways: removing random strokes, removing the salient region, removing a semantic region, and removing all strokes.
  • Figure 5: Qualitative Comparison. We show a comparison between three methods of performing text-based canvas updates: FRIDA, CoFRIDA without fine-tuning, and CoFRIDA with fine-tuning (ours). FRIDA uses a CLIP based optimization and generates outputs that are noisy. CoFRIDA without fine-tuning, is not aware of the constraints of the robot and generates an output that is difficult for the robot to execute and often does not satisfy the text prompt specified by the user. In contrast, CoFRIDA outputs an updated canvas that reflects the user prompt without being noisy.
  • ...and 4 more figures