Table of Contents
Fetching ...

Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation

Aymen Sekhri, Seyed Ali Amirshahi, Mohamed-Chaker Larabi

TL;DR

This paper tackles AR image quality assessment under visual confusion by introducing TransformAR, a lightweight transformer-based FR-IQA framework that leverages content-aware encoders, shift representations, and cross-attention decoders to capture distortion-related quality information. The approach is enhanced with knowledge distillation, ground-truth class supervision, label smoothing, and elastic-net regularization, yielding three variants: TransformAR, TransformAR-KD, and TransformAR-KD+. Evaluations on the ARIQA dataset show state-of-the-art performance, with TransformAR-KD+ achieving the best metrics and ablation studies highlighting the contribution of each component. The work advances AR-IQA by addressing data scarcity and visual confusion through explicit content representation and distortion-driven reasoning, with implications for QoE optimization and dataset development for AR technologies.

Abstract

Augmented Reality (AR) is a major immersive media technology that enriches our perception of reality by overlaying digital content (the foreground) onto physical environments (the background). It has far-reaching applications, from entertainment and gaming to education, healthcare, and industrial training. Nevertheless, challenges such as visual confusion and classical distortions can result in user discomfort when using the technology. Evaluating AR quality of experience becomes essential to measure user satisfaction and engagement, facilitating the refinement necessary for creating immersive and robust experiences. Though, the scarcity of data and the distinctive characteristics of AR technology render the development of effective quality assessment metrics challenging. This paper presents a deep learning-based objective metric designed specifically for assessing image quality for AR scenarios. The approach entails four key steps, (1) fine-tuning a self-supervised pre-trained vision transformer to extract prominent features from reference images and distilling this knowledge to improve representations of distorted images, (2) quantifying distortions by computing shift representations, (3) employing cross-attention-based decoders to capture perceptual quality features, and (4) integrating regularization techniques and label smoothing to address the overfitting problem. To validate the proposed approach, we conduct extensive experiments on the ARIQA dataset. The results showcase the superior performance of our proposed approach across all model variants, namely TransformAR, TransformAR-KD, and TransformAR-KD+ in comparison to existing state-of-the-art methods.

Enhancing Content Representation for AR Image Quality Assessment Using Knowledge Distillation

TL;DR

This paper tackles AR image quality assessment under visual confusion by introducing TransformAR, a lightweight transformer-based FR-IQA framework that leverages content-aware encoders, shift representations, and cross-attention decoders to capture distortion-related quality information. The approach is enhanced with knowledge distillation, ground-truth class supervision, label smoothing, and elastic-net regularization, yielding three variants: TransformAR, TransformAR-KD, and TransformAR-KD+. Evaluations on the ARIQA dataset show state-of-the-art performance, with TransformAR-KD+ achieving the best metrics and ablation studies highlighting the contribution of each component. The work advances AR-IQA by addressing data scarcity and visual confusion through explicit content representation and distortion-driven reasoning, with implications for QoE optimization and dataset development for AR technologies.

Abstract

Augmented Reality (AR) is a major immersive media technology that enriches our perception of reality by overlaying digital content (the foreground) onto physical environments (the background). It has far-reaching applications, from entertainment and gaming to education, healthcare, and industrial training. Nevertheless, challenges such as visual confusion and classical distortions can result in user discomfort when using the technology. Evaluating AR quality of experience becomes essential to measure user satisfaction and engagement, facilitating the refinement necessary for creating immersive and robust experiences. Though, the scarcity of data and the distinctive characteristics of AR technology render the development of effective quality assessment metrics challenging. This paper presents a deep learning-based objective metric designed specifically for assessing image quality for AR scenarios. The approach entails four key steps, (1) fine-tuning a self-supervised pre-trained vision transformer to extract prominent features from reference images and distilling this knowledge to improve representations of distorted images, (2) quantifying distortions by computing shift representations, (3) employing cross-attention-based decoders to capture perceptual quality features, and (4) integrating regularization techniques and label smoothing to address the overfitting problem. To validate the proposed approach, we conduct extensive experiments on the ARIQA dataset. The results showcase the superior performance of our proposed approach across all model variants, namely TransformAR, TransformAR-KD, and TransformAR-KD+ in comparison to existing state-of-the-art methods.

Paper Structure

This paper contains 27 sections, 19 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: AR image (foreground) superimposed on the background image and the viewport captured during the subjective test, and the cropped region is used for the assessment of the objective metric.
  • Figure 2: Illustration showcasing the encoder-decoder transformer architecture. (left) Two transformer encoder blocks are stacked, each preceded by Layernorm and followed by MSA. In addition, feed-forward neural networks and skip connections are used. (right) The decoder consists of a single transformer decoder block, preceded by Layernorm and followed by cross-attention mechanism, along with feed-forward neural networks and skip connections.
  • Figure 3: Overview of the TransformAR-KD+ model. Content-aware encoders $\mathcal{F}^i(\cdot)$ extract abstract representations $f^i \in \mathbb{R}^{N\times C}$ from each input image $I^i$, where $i = \{a, b, s\}$. $MLP^a$ and $MLP^b$ are used to generate ground-truth representations $\hat{f^a_{cls}}$ and $\hat{f^b_{cls}}$, predicting the classes of references. Subsequently, the cosine similarity between $f^s_{cls}$ and the ground-truth representations is maximized, and the $l_1$-distance is employed to compute the shift caused by distortions. Quality-aware decoders $G^{as}(\cdot)$ and $G^{bs}(\cdot)$ align the information in the reference representations with relevant information in the shift representations using the cross-attention mechanism, while regressors $R^{as}(\cdot)$ and $R^{bs}(\cdot)$ map the quality representations to quality scores. These scores are aggregated to produce $pMOS$.
  • Figure 4: Impact of label smoothing: (a) Scores before (blue dots) and after (red dots), and (b) Difference between scores.
  • Figure 6: Evolution of SRCC performance across epochs within a single fold.
  • ...and 8 more figures