Table of Contents
Fetching ...

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Youshao Xiao, Shangchun Zhao, Zhenglei Zhou, Zhaoxin Huan, Lin Ju, Xiaolu Zhang, Lin Wang, Jun Zhou

TL;DR

This work tackles the inefficiency of distributed optimization-based meta-learning for large-scale DLRMs, where two update loops and massive embedding parameters hinder scalable training on GPU clusters. It introduces G-Meta, a GPU-cluster framework that combines hybrid parallelism (AlltoAll and AllReduce) with a high-throughput Meta-IO pipeline to address both computation/communication bottlenecks and I/O bottlenecks in meta-learning. The main contributions are a distributed inner/outer loop design with an optimized outer update rule, network-aware optimizations using RoCE and NVLink, and a data ingestion pipeline tailored for task-consistent batching, validated by large-scale experiments and real-world deployment. Online deployment in Alipay demonstrates practical impact, achieving fourfold faster model delivery and measurable gains in CVR and CPM due to larger training data and task diversity.

Abstract

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

TL;DR

This work tackles the inefficiency of distributed optimization-based meta-learning for large-scale DLRMs, where two update loops and massive embedding parameters hinder scalable training on GPU clusters. It introduces G-Meta, a GPU-cluster framework that combines hybrid parallelism (AlltoAll and AllReduce) with a high-throughput Meta-IO pipeline to address both computation/communication bottlenecks and I/O bottlenecks in meta-learning. The main contributions are a distributed inner/outer loop design with an optimized outer update rule, network-aware optimizations using RoCE and NVLink, and a data ingestion pipeline tailored for task-consistent batching, validated by large-scale experiments and real-world deployment. Online deployment in Alipay demonstrates practical impact, achieving fourfold faster model delivery and measurable gains in CVR and CPM due to larger training data and task diversity.

Abstract

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.
Paper Structure (18 sections, 4 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: The Distributed Training Architecture of G-Meta
  • Figure 2: Dataflow of Meta-IO
  • Figure 3: Model Performance between G-Meta vs. DMAML using MAML vuorio2019multimodal, MeLU lee2019melu, and CBML song2021cbml in Movielens dataset.
  • Figure 4: The throughput given different experiment settings, like using I/O optimization or Network optimization in in-house data.