Table of Contents
Fetching ...

TF-Replicator: Distributed Machine Learning for Researchers

Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, Fabio Viola, Dan Belov

TL;DR

TF-Replicator presents a TensorFlow-based abstraction that makes distributed deep learning approachable for researchers by defining a simple replica API (input_fn and step_fn) and deploying across CPU/GPU/TPU clusters with either in-graph or between-graph replication. It introduces multiple Replicator backends and MPI-like cross-replica primitives to support data- and model-parallel workloads under synchronous or asynchronous regimes, including gradient averaging via wrap_optimizer. The authors validate the approach on three diverse domains—ResNet-50 on ImageNet, SN-GAN, and D4PG RL—demonstrating strong weak and strong scaling across GPUs and TPUs with minimal distributed-systems expertise required. The work aims to accelerate research iteration and reproducibility by providing an open-source framework integrated with TensorFlow 2.0 tooling.

Abstract

We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be open-sourced as part of TensorFlow 2.0 (see https://github.com/tensorflow/community/pull/25).

TF-Replicator: Distributed Machine Learning for Researchers

TL;DR

TF-Replicator presents a TensorFlow-based abstraction that makes distributed deep learning approachable for researchers by defining a simple replica API (input_fn and step_fn) and deploying across CPU/GPU/TPU clusters with either in-graph or between-graph replication. It introduces multiple Replicator backends and MPI-like cross-replica primitives to support data- and model-parallel workloads under synchronous or asynchronous regimes, including gradient averaging via wrap_optimizer. The authors validate the approach on three diverse domains—ResNet-50 on ImageNet, SN-GAN, and D4PG RL—demonstrating strong weak and strong scaling across GPUs and TPUs with minimal distributed-systems expertise required. The work aims to accelerate research iteration and reproducibility by providing an open-source framework integrated with TensorFlow 2.0 tooling.

Abstract

We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be open-sourced as part of TensorFlow 2.0 (see https://github.com/tensorflow/community/pull/25).

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Replication patterns for data parallelism in distributed TensorFlow: (a) in-graph replication, and (b) between-graph replication. C = client, M = master service, W = worker service.
  • Figure 2: TF-Replicator ImageNet classification performance that results from scaling a ResNetv1-50 model across different devices. Replicators used: MultiGpuReplicator (1, 8 GPUs), MultiWorkerReplicator (32, 64 GPUs), TpuReplicator (TPU v2).
  • Figure 3: Sample quality (Inception score and FID) and example samples of a class-conditional SN-GAN trained on 128x128 ImageNet samples. All models were trained using 8 NVIDIA Tesla V100 GPUs unless otherwise stated.
  • Figure 4: TF-Replicator D4PG strong-scalability performance (total environment reward) on various DeepMind Control Suite tasks tassa2018deepmind, trained from pixel observations with a fixed total batch size of 256. Results presented for: (green) a single TPUv2 device (4 chips, 8 cores); (yellow) 8x NVIDIA Tesla V100 GPUs; (blue) 1 TPUv2 chip (2 cores); and (red) a single V100 GPU.