Character-based Outfit Generation with Vision-augmented Style Extraction via LLMs
Najmeh Forouzandehmehr, Yijie Cao, Nikhil Thakurdesai, Ramin Giahi, Luyi Ma, Nima Farrokhsiar, Jianpeng Xu, Evren Korpeoglu, Kannan Achan
TL;DR
This work defines the Character-based Outfit Generation (COG) problem, aiming to generate complete outfits for famous characters conditioned on user attributes like age and gender. It introduces the LVA-COG framework, which combines Large Language Models (LLMs) with vision models (Stable Diffusion SDXL and Detectron2) and prompt engineering to infer character-aligned item prototypes and retrieve cohesive outfits from an e-commerce catalog. Three variants—Baseline (text-only), Vision-Enhanced, and Diverse Style—form an end-to-end multimodal pipeline, with evaluation from both GPT-4-based assessments and human raters across 29 characters. Findings show that the vision-enhanced and diverse-style combination yields the best alignment with character style and user specifications, while also revealing gender biases that warrant further investigation; the approach has practical implications for personalized, character-driven fashion recommendations in e-commerce.
Abstract
The outfit generation problem involves recommending a complete outfit to a user based on their interests. Existing approaches focus on recommending items based on anchor items or specific query styles but do not consider customer interests in famous characters from movie, social media, etc. In this paper, we define a new Character-based Outfit Generation (COG) problem, designed to accurately interpret character information and generate complete outfit sets according to customer specifications such as age and gender. To tackle this problem, we propose a novel framework LVA-COG that leverages Large Language Models (LLMs) to extract insights from customer interests (e.g., character information) and employ prompt engineering techniques for accurate understanding of customer preferences. Additionally, we incorporate text-to-image models to enhance the visual understanding and generation (factual or counterfactual) of cohesive outfits. Our framework integrates LLMs with text-to-image models and improves the customer's approach to fashion by generating personalized recommendations. With experiments and case studies, we demonstrate the effectiveness of our solution from multiple dimensions.
