Table of Contents
Fetching ...

Character-based Outfit Generation with Vision-augmented Style Extraction via LLMs

Najmeh Forouzandehmehr, Yijie Cao, Nikhil Thakurdesai, Ramin Giahi, Luyi Ma, Nima Farrokhsiar, Jianpeng Xu, Evren Korpeoglu, Kannan Achan

TL;DR

This work defines the Character-based Outfit Generation (COG) problem, aiming to generate complete outfits for famous characters conditioned on user attributes like age and gender. It introduces the LVA-COG framework, which combines Large Language Models (LLMs) with vision models (Stable Diffusion SDXL and Detectron2) and prompt engineering to infer character-aligned item prototypes and retrieve cohesive outfits from an e-commerce catalog. Three variants—Baseline (text-only), Vision-Enhanced, and Diverse Style—form an end-to-end multimodal pipeline, with evaluation from both GPT-4-based assessments and human raters across 29 characters. Findings show that the vision-enhanced and diverse-style combination yields the best alignment with character style and user specifications, while also revealing gender biases that warrant further investigation; the approach has practical implications for personalized, character-driven fashion recommendations in e-commerce.

Abstract

The outfit generation problem involves recommending a complete outfit to a user based on their interests. Existing approaches focus on recommending items based on anchor items or specific query styles but do not consider customer interests in famous characters from movie, social media, etc. In this paper, we define a new Character-based Outfit Generation (COG) problem, designed to accurately interpret character information and generate complete outfit sets according to customer specifications such as age and gender. To tackle this problem, we propose a novel framework LVA-COG that leverages Large Language Models (LLMs) to extract insights from customer interests (e.g., character information) and employ prompt engineering techniques for accurate understanding of customer preferences. Additionally, we incorporate text-to-image models to enhance the visual understanding and generation (factual or counterfactual) of cohesive outfits. Our framework integrates LLMs with text-to-image models and improves the customer's approach to fashion by generating personalized recommendations. With experiments and case studies, we demonstrate the effectiveness of our solution from multiple dimensions.

Character-based Outfit Generation with Vision-augmented Style Extraction via LLMs

TL;DR

This work defines the Character-based Outfit Generation (COG) problem, aiming to generate complete outfits for famous characters conditioned on user attributes like age and gender. It introduces the LVA-COG framework, which combines Large Language Models (LLMs) with vision models (Stable Diffusion SDXL and Detectron2) and prompt engineering to infer character-aligned item prototypes and retrieve cohesive outfits from an e-commerce catalog. Three variants—Baseline (text-only), Vision-Enhanced, and Diverse Style—form an end-to-end multimodal pipeline, with evaluation from both GPT-4-based assessments and human raters across 29 characters. Findings show that the vision-enhanced and diverse-style combination yields the best alignment with character style and user specifications, while also revealing gender biases that warrant further investigation; the approach has practical implications for personalized, character-driven fashion recommendations in e-commerce.

Abstract

The outfit generation problem involves recommending a complete outfit to a user based on their interests. Existing approaches focus on recommending items based on anchor items or specific query styles but do not consider customer interests in famous characters from movie, social media, etc. In this paper, we define a new Character-based Outfit Generation (COG) problem, designed to accurately interpret character information and generate complete outfit sets according to customer specifications such as age and gender. To tackle this problem, we propose a novel framework LVA-COG that leverages Large Language Models (LLMs) to extract insights from customer interests (e.g., character information) and employ prompt engineering techniques for accurate understanding of customer preferences. Additionally, we incorporate text-to-image models to enhance the visual understanding and generation (factual or counterfactual) of cohesive outfits. Our framework integrates LLMs with text-to-image models and improves the customer's approach to fashion by generating personalized recommendations. With experiments and case studies, we demonstrate the effectiveness of our solution from multiple dimensions.
Paper Structure (23 sections, 4 figures, 2 tables)

This paper contains 23 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Two examples of the outfit generation by the Complete-the-Look (CTL) module in Walmart.com which generates the outfit based on a specific item customers have already possessed or have expressed interest in. The anchor items of interest are highlighted in the red box. The rest of items in (a) and (b) are the compatible items to complete the outfit, respectively.
  • Figure 2: Generated outfit (right panel) based on James Bond character (left panel). Note the the image is from Internet and listed here for visualizing the character.
  • Figure 3: Architecture of our LVA-COG solution. Note that we refer the generic item retrieval system in LVA-COG-BL and LVA-COG-VE to any implementation based on the item catalog in different e-commerce platforms.
  • Figure 4: A SDXL-generated image (left panel) and a list of Llama2-generated item prototypes (right panel). The SDXL-generated image creates the novel combination of "James Bond" for a teenage girl and balances the style well compared with the prototypes (highlighted in yellow) which are too general.