Table of Contents
Fetching ...

What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

Antonis Maronikolakis, Ana Peleteiro Ramallo, Weiwei Cheng, Thomas Kober

TL;DR

The paper tackles evaluating LLM-driven fashion assistants that translate customer needs into backend search queries by introducing a multilingual evaluation dataset generated via simulation and rigorously verified. It presents a two-stage data pipeline, a detailed attribute abstraction for fashion items, and three customer-generation modes, enabling robust benchmarking across languages. Through AssistantEval and QueryGenEval, the study shows GPT-4 generally outperforms open-source models but at higher cost, with detailed attribute-precision analyses and theme-based evaluation clarifying strengths and weaknesses across languages. The work provides a practical, scalable framework for fair, reproducible assessment of conversational fashion assistants suitable for deployment in real e-commerce settings.

Abstract

Large language models (LLMs) are poised to revolutionize the domain of online fashion retail, enhancing customer experience and discovery of fashion online. LLM-powered conversational agents introduce a new way of discovery by directly interacting with customers, enabling them to express in their own ways, refine their needs, obtain fashion and shopping advice that is relevant to their taste and intent. For many tasks in e-commerce, such as finding a specific product, conversational agents need to convert their interactions with a customer to a specific call to different backend systems, e.g., a search system to showcase a relevant set of products. Therefore, evaluating the capabilities of LLMs to perform those tasks related to calling other services is vital. However, those evaluations are generally complex, due to the lack of relevant and high quality datasets, and do not align seamlessly with business needs, amongst others. To this end, we created a multilingual evaluation dataset of 4k conversations between customers and a fashion assistant in a large e-commerce fashion platform to measure the capabilities of LLMs to serve as an assistant between customers and a backend engine. We evaluate a range of models, showcasing how our dataset scales to business needs and facilitates iterative development of tools.

What should I wear to a party in a Greek taverna? Evaluation for Conversational Agents in the Fashion Domain

TL;DR

The paper tackles evaluating LLM-driven fashion assistants that translate customer needs into backend search queries by introducing a multilingual evaluation dataset generated via simulation and rigorously verified. It presents a two-stage data pipeline, a detailed attribute abstraction for fashion items, and three customer-generation modes, enabling robust benchmarking across languages. Through AssistantEval and QueryGenEval, the study shows GPT-4 generally outperforms open-source models but at higher cost, with detailed attribute-precision analyses and theme-based evaluation clarifying strengths and weaknesses across languages. The work provides a practical, scalable framework for fair, reproducible assessment of conversational fashion assistants suitable for deployment in real e-commerce settings.

Abstract

Large language models (LLMs) are poised to revolutionize the domain of online fashion retail, enhancing customer experience and discovery of fashion online. LLM-powered conversational agents introduce a new way of discovery by directly interacting with customers, enabling them to express in their own ways, refine their needs, obtain fashion and shopping advice that is relevant to their taste and intent. For many tasks in e-commerce, such as finding a specific product, conversational agents need to convert their interactions with a customer to a specific call to different backend systems, e.g., a search system to showcase a relevant set of products. Therefore, evaluating the capabilities of LLMs to perform those tasks related to calling other services is vital. However, those evaluations are generally complex, due to the lack of relevant and high quality datasets, and do not align seamlessly with business needs, amongst others. To this end, we created a multilingual evaluation dataset of 4k conversations between customers and a fashion assistant in a large e-commerce fashion platform to measure the capabilities of LLMs to serve as an assistant between customers and a backend engine. We evaluate a range of models, showcasing how our dataset scales to business needs and facilitates iterative development of tools.
Paper Structure (17 sections, 9 tables)