Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation
Qingyu Wu, Yuxuan Han, Haijun Li, Zhao Xu, Jianshan Zhao, Xu Jin, Longyue Wang, Weihua Luo
TL;DR
Vectra addresses the critical need for fine-grained, reference-free visual quality assessment in e-commerce In-Image Machine Translation. It introduces a three-component framework: Vectra Score (14 interpretable dimensions with Defect Area Ratio), a diversely sourced Vectra Dataset, and a 4B-parameter Vectra Model trained with supervised fine-tuning and preference alignment. Empirical results show strong correlation with human rankings on in-domain and out-of-domain benchmarks, with Vectra outperforming leading MLLMs in scoring accuracy and diagnostic reasoning. The framework reduces annotation variance and provides interpretable, diagnostic feedback, offering a practical reward signal for optimizing commercial IIMT systems and enabling broader applicability to multimodal vision-language tasks.
Abstract
In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.
