Table of Contents
Fetching ...

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura

TL;DR

NaiLIA is proposed, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval, and introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions.

Abstract

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

TL;DR

NaiLIA is proposed, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval, and introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions.

Abstract

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.
Paper Structure (55 sections, 6 equations, 13 figures, 8 tables)

This paper contains 55 sections, 6 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: A typical use case for our task. The user inputs a dense intent description and an optional palette query. The palette query allows the user to select zero or more colors through a color picker interface. The model should rank the leftmost image higher than the middle image (painted in a darker purple than specified by the palette query) and the rightmost image (featuring realistic shell ornaments rather than the intended shell-inspired design). Details of this task are provided in \ref{['sec:statement']}.
  • Figure 2: Architecture of NaiLIA. A language-palette representation $\bm{l}_\text{+}$ is extracted from $\bm{x}_\text{txt}$ and $\bm{x}_\text{pal}$ by the Intent-Palette Fusion Module, and a visual representation $\bm{v}$ is extracted from $\bm{x}_\text{img}$ by the Visual Design Fusion Module. In the Confidence-based Relaxed Alignment Module, unlabeled positives are assigned, and the loss is calculated considering the unlabeled positives. Here, Enc. denotes the encoder.
  • Figure 3: Examples of $c_{ij}$ estimated by the MLLM for the following $\bm{x}_{\text{txt}}$: "Please put long fake nails on my nails and make them pink only at the base and the rest should be a fancy design with strawberries." Positives, unlabeled positives, and negatives are framed in green, yellow, and red, respectively.
  • Figure 4: Qualitative results of the proposed method (NaiLIA) and a baseline method (SigLIP) 10377550 for the following $\bm{x}_{\text{txt}}$ and $\bm{x}_{\text{pal}}$: (a) "I'd like my nails to have a cute, teenage vibe. I'd love a pink base with floral patterns and maybe some character accessories. Can we do a long nail shape?" with (#ffd3e5). (b) "I would like a fairy-tale nail design based on purple and pink." with (#fff4f4) and (#bfc5ff). (c) "I'd like a colorful and flashy nail design. Please add a large flower nail stone to the ring finger. The tips of the nail tips should be square-shaped." in the setting without $\bm{x}_{\text{pal}}$. (d) "red dress with a high waist. the dress is made of a stretchy material and has a flowy skirt. the dress is a formal style and is suitable for a special occasion." with (#ff1f35). The top-5 retrieved images are shown. Positives and unlabeled positives are enclosed in green and yellow frames, respectively. Examples (a), (b), and (c) show successful cases of NaiLIA on the NAIL-STAR benchmark, whereas (d) is a successful case on the Marqo Fashion200K benchmark. Additional qualitative results, including cases where $\bm{x}_{\text{pal}}$ consists of more than three colors, are presented in Appendix \ref{['sec:additonal_qualitative']}.
  • Figure 5: Examples of $c_{ij}$ estimated by the MLLM in CRAM. Positive, unlabeled positive, and negative labels are framed in green, yellow, and red, respectively, with unlabeled positives identified based on $c_{ij}$.
  • ...and 8 more figures