Table of Contents
Fetching ...

Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation

Qingyu Wu, Yuxuan Han, Haijun Li, Zhao Xu, Jianshan Zhao, Xu Jin, Longyue Wang, Weihua Luo

TL;DR

Vectra addresses the critical need for fine-grained, reference-free visual quality assessment in e-commerce In-Image Machine Translation. It introduces a three-component framework: Vectra Score (14 interpretable dimensions with Defect Area Ratio), a diversely sourced Vectra Dataset, and a 4B-parameter Vectra Model trained with supervised fine-tuning and preference alignment. Empirical results show strong correlation with human rankings on in-domain and out-of-domain benchmarks, with Vectra outperforming leading MLLMs in scoring accuracy and diagnostic reasoning. The framework reduces annotation variance and provides interpretable, diagnostic feedback, offering a practical reward signal for optimizing commercial IIMT systems and enabling broader applicability to multimodal vision-language tasks.

Abstract

In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.

Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation

TL;DR

Vectra addresses the critical need for fine-grained, reference-free visual quality assessment in e-commerce In-Image Machine Translation. It introduces a three-component framework: Vectra Score (14 interpretable dimensions with Defect Area Ratio), a diversely sourced Vectra Dataset, and a 4B-parameter Vectra Model trained with supervised fine-tuning and preference alignment. Empirical results show strong correlation with human rankings on in-domain and out-of-domain benchmarks, with Vectra outperforming leading MLLMs in scoring accuracy and diagnostic reasoning. The framework reduces annotation variance and provides interpretable, diagnostic feedback, offering a practical reward signal for optimizing commercial IIMT systems and enabling broader applicability to multimodal vision-language tasks.

Abstract

In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.
Paper Structure (47 sections, 8 equations, 14 figures, 11 tables, 1 algorithm)

This paper contains 47 sections, 8 equations, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: An illustration of the limitations in current IIMT visual quality assessment. Existing methods (Top) struggle to pinpoint fine-grained defects or serve as effective reward signals in contextually dense e-commerce scenarios. In contrast, Vectra decomposes visual quality into 14 dimensions (Middle), enabling precise error detection and reasoning-based diagnostics (Bottom).
  • Figure 2: Illustration of the Defect Area Ratio (DAR) calculation mechanism between the original image (a) and the translated image (b). The metric quantifies quality by measuring the defective areas (red) relative to the total target areas (green) for both textual (c) and non-textual (d) elements.
  • Figure 3: Overview of the Vectra framework. Visual quality is decomposed into 14 dimensions: Text (Blue, 1--8) and Scene (Pink, 9--14), with surrounding examples illustrating representative defects. The center shows the Vectra Score computation via multiplicative aggregation, Data Suite construction, and Vectra Model training pipeline.
  • Figure 4: Expert rejection rates as a function of DAR values. A sharp increase in rejection rate occurs at $\tau = 0.3$, which we adopt as the DAR threshold for distinguishing Fair (Score 2) from Poor (Score 1) quality.
  • Figure 5: Data pipeline: ($\alpha$) Diversity-aware sampling and translation pair construction; ($\beta$) Distribution balancing via minority augmentation and pruning.
  • ...and 9 more figures