Multi-GPU Training

Due to the high computational complexity, it often takes hours or even days to fully train deep learning models using a single GPU. In PocketFlow, we adopt multi-GPU training to speed-up this time-consuming training process. Our implementation is compatible with:

Horovod: a distributed training framework for TensorFlow, Keras, and PyTorch.
TF-Plus: an optimized framework for TensorFlow-based distributed training (only available within Tencent).

We have provide a wrapper class, MultiGpuWrapper, to seamlessly switch between the above two frameworks. It will sequentially check whether Horovod and TF-Plus can be used, and use the first available one as the underlying framework for multi-GPU training.

The main reason that using Horovod or TF-Plus instead TensorFlow's original distributed training routine is that these frameworks provide many easy-to-use APIs and require far less code changes to change from single-GPU to multi-GPU training, as we shall see later.

From Single-GPU to Multi-GPU

To extend a single-GPU based training script to the multi-GPU scenario, at most 7 steps are needed:

Import the Horovod or TF-Plus module.

from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw

Initialize the multi-GPU training framework, as early as possible.

mgw.init()

For each worker, create a session with a distinct GPU device.

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(mgw.local_rank())
sess = tf.Session(config=config)

(Optional) Let each worker use a distinct subset of training data.

filenames = tf.data.Dataset.list_files(file_pattern, shuffle=True)
filenames = filenames.shard(mgw.size(), mgw.rank())

Wrapper the optimizer for distributed gradient communication.

optimizer = tf.train.AdamOptimizer(learning_rate=lrn_rate)
optimizer = mgw.DistributedOptimizer(optimizer)
train_op = optimizer.minimize(loss)

Synchronize master's parameters to all the other workers.

bcast_op = mgw.broadcast_global_variables(0)
sess.run(tf.global_variables_initializer())
sess.run(bcast_op)

(Optional) Save checkpoint files at the master node periodically.

if mgw.rank() == 0:
  saver.save(sess, save_path, global_step)

Usage Example

Here, we provide a code snippet to demonstrate how to use multi-GPU training to speed-up training. Please note that many implementation details are omitted for clarity.

import tensorflow as tf
from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw

# initialization
mgw.init()

# create the training graph
with tf.Graph().as_default():
  # create a TensorFlow session
  config = tf.ConfigProto()
  config.gpu_options.visible_device_list = str(mgw.local_rank())
  sess = tf.Session(config=config)

  # use tf.data.Dataset() to traverse images and labels
  filenames = tf.data.Dataset.list_files(file_pattern, shuffle=True)
  filenames = filenames.shard(mgw.size(), mgw.rank())
  images, labels = get_images_n_labels(filenames)

  # define the network and its loss function
  logits = forward_pass(images)
  loss = calc_loss(labels, logits)

  # create an optimizer and setup training-related operations
  global_step = tf.train.get_or_create_global_step()
  optimizer = tf.train.AdamOptimizer(learning_rate=lrn_rate)
  optimizer = mgw.DistributedOptimizer(optimizer)
  train_op = optimizer.minimize(loss, global_step=global_step)
  bcast_op = mgw.broadcast_global_variables(0)

# multi-GPU training
sess.run(tf.global_variables_initializer())
sess.run(bcast_op)
for idx_iter in range(nb_iters):
  sess.run(train_op)
  if mgw.rank() == 0 and (idx_iter + 1) % save_step == 0:
    saver.save(sess, save_path, global_step)