Table of Contents
Fetching ...

Xray-Visual Models: Scaling Vision models on Industry Scale Data

Shlok Mishra, Tsung-Yu Lin, Linda Wang, Hongli Xu, Yimin Liu, Michael Hsu, Chaitanya Ahuja, Hao Yuan, Jianpeng Cheng, Hong-You Chen, Haoyuan Xu, Chao Li, Abhijeet Awasthi, Jihye Moon, Don Husa, Michael Ge, Sumedha Singla, Arkabandhu Chowdhury, Phong Dingh, Satya Narayan Shukla, Yonghuan Yang, David Jacobs, Qi Guo, Jun Xiao, Xiangjun Fan, Aashu Singh

TL;DR

This work introduces a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities and demonstrates that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments.

Abstract

We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.

Xray-Visual Models: Scaling Vision models on Industry Scale Data

TL;DR

This work introduces a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities and demonstrates that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments.

Abstract

We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.
Paper Structure (76 sections, 1 equation, 8 figures, 17 tables)

This paper contains 76 sections, 1 equation, 8 figures, 17 tables.

Figures (8)

  • Figure 1: XRay achieves 89.3% Top-1 accuracy with 336px and 288 tokens, vs. baselines at 448px/1024 tokens. This is a 71.9% token reduction, 43.75% pixel area reduction, and 84.2% reduction in the product proxy ( 6.3× lower combined cost).
  • Figure 2: Comparison of training data scale across vision models. (a) XrayVisual leverages the largest curated dataset for vision encoder training to date, (b) We utilize 10$\times$ more video data than state-of-the-art world models such as V-JEPA assran2025vjepa2selfsupervisedvideo.
  • Figure 3: XRV Architecture: XrayVisual uses a Vision Transformer (ViT) vit backbone with 3D tokenization for joint image-video training. Images are repeated along the temporal dimension with zero-padding to match the 3D convolution kernel and positional embeddings, while videos are processed directly without modification.
  • Figure 4: Our three-stage training pipeline consists of MAE pre-training, followed by hashtag classification, and followed by CLIP-style contrastive learning.
  • Figure 5: Scaling analysis. ImageNet zero-shot accuracy improves consistently as the number of unique training examples increases from 1B to 5B. EViT-2B models trained with denoising loss exhibit strong scaling behavior across all data regimes. All models are trained for 500K iterations.
  • ...and 3 more figures