Table of Contents
Fetching ...

MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback

Chen Chen, Cuong Nguyen, Thibault Groueix, Vladimir G. Kim, Nadir Weibel

TL;DR

MemoVis is introduced, a text editor interface that assists feedback providers in creating reference images with generative AI driven by the feedback comments, and a novel real-time viewpoint suggestion feature, based on a vision-language foundation model, helps feedback providers anchor a comment with a camera viewpoint.

Abstract

Providing asynchronous feedback is a critical step in the 3D design workflow. A common approach to providing feedback is to pair textual comments with companion reference images, which helps illustrate the gist of text. Ideally, feedback providers should possess 3D and image editing skills to create reference images that can effectively describe what they have in mind. However, they often lack such skills, so they have to resort to sketches or online images which might not match well with the current 3D design. To address this, we introduce MemoVis, a text editor interface that assists feedback providers in creating reference images with generative AI driven by the feedback comments. First, a novel real-time viewpoint suggestion feature, based on a vision-language foundation model, helps feedback providers anchor a comment with a camera viewpoint. Second, given a camera viewpoint, we introduce three types of image modifiers, based on pre-trained 2D generative models, to turn a text comment into an updated version of the 3D scene from that viewpoint. We conducted a within-subjects study with feedback providers, demonstrating the effectiveness of MemoVis. The quality and explicitness of the companion images were evaluated by another eight participants with prior 3D design experience.

MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback

TL;DR

MemoVis is introduced, a text editor interface that assists feedback providers in creating reference images with generative AI driven by the feedback comments, and a novel real-time viewpoint suggestion feature, based on a vision-language foundation model, helps feedback providers anchor a comment with a camera viewpoint.

Abstract

Providing asynchronous feedback is a critical step in the 3D design workflow. A common approach to providing feedback is to pair textual comments with companion reference images, which helps illustrate the gist of text. Ideally, feedback providers should possess 3D and image editing skills to create reference images that can effectively describe what they have in mind. However, they often lack such skills, so they have to resort to sketches or online images which might not match well with the current 3D design. To address this, we introduce MemoVis, a text editor interface that assists feedback providers in creating reference images with generative AI driven by the feedback comments. First, a novel real-time viewpoint suggestion feature, based on a vision-language foundation model, helps feedback providers anchor a comment with a camera viewpoint. Second, given a camera viewpoint, we introduce three types of image modifiers, based on pre-trained 2D generative models, to turn a text comment into an updated version of the 3D scene from that viewpoint. We conducted a within-subjects study with feedback providers, demonstrating the effectiveness of MemoVis. The quality and explicitness of the companion images were evaluated by another eight participants with prior 3D design experience.
Paper Structure (29 sections, 22 figures, 1 algorithm)

This paper contains 29 sections, 22 figures, 1 algorithm.

Figures (22)

  • Figure 1: Example reference images from Polycount polycountpolycountCritiques. (a) Initial design of discussion thread 6 and (b) feedback from OFP6-6; (c) initial design of discussion thread 8 and feedback from OFP8-1 (f) and OFP8-2 (d - e); (g) initial design of discussion thread 2; (h) initial design of discussion thread 13. Green and blue labels indicate the associated reference images are from creators and feedback providers respectively.
  • Figure 2: Examples of the suggested viewpoints based on the typed feedback comments (leftmost column). We show viewpoints with top-$\bm{4}$ highest CLIP similarity scores for an office 3D model (a - e), a car model (f - j), and a samurai boy model (k - o). a, f, and k show the bird-eye view of the initial 3D model where red circle highlight the focus of textual comments. The cosine similarity scores are shown at the bottom of each suggested viewpoints (b - e, g - j, l - o).
  • Figure 3: Examples of creating reference image using the text + scribble modifier. (a) Initial design; (b) associated depth map; (c) manually drawn scribbles (black strokes) with the white strokes indicating the removed geometries; (d) the depth map with the scribbling area being reset; (e) an aggregated scribble from the initial image and the manually drawn scribbles, where the red bounding box shows the scribbling area by feedback providers; (f) synthesized image by ControlNet conditioned by scribble $+$ depth, where the red bounding box shows the area that the feedback providers scribbled; (g) segmented mask generated by SAM; (h) initial design with the primitives describing existing computer display being removed; (i) final composed reference image; (j) final composed reference image without removing objects marked for removal by scribbling.
  • Figure 4: Examples showing how the feedback providers can stage the 3D model into different scene with the grab'n go modifier. (a) The initial 3D design of a car; (b, d) synthesized image generated by scribble + text modifier with ControlNet conditioned on depth. The prompt "a Ferrari car driving on the highway" and "a Ferrari car driving on a dessert" were used to synthesize (b) and (d), respectively. The red bounding boxes show the areas drawn by feedback providers; (c) final composed image by bringing initial design into the scene of (b); (e) final composed image by bringing initial design into the scene of (d).
  • Figure 5: Examples of continuous composing. (a) Reference image of Fig. \ref{['fig::controlnet-scribble-demo']}i; (b) feedback provider can draw a bounding box to indicate their intention to add the white keyboard design into the reference image; (c) segmented mask generated by SAM; (d) segmented mask by compute the union of (c) and Fig. \ref{['fig::controlnet-scribble-demo']}g; (e) final reference image after including the white keyboard.
  • ...and 17 more figures