A Practical Layer-Parallel Training Algorithm for Residual Networks

3 Sep 2020 · Qi Sun, Hexin Dong, Zewei Chen, Weizhen Dian, Jiacheng Sun, Yitong Sun, Zhenguo Li, Bin Dong ·

Gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters, which are time-consuming for deep ResNets. To break the dependencies between modules in both the forward and backward modes, auxiliary-variable methods such as the penalty and augmented Lagrangian (AL) approaches have attracted much interest lately due to their ability to exploit layer-wise parallelism. However, we observe that large communication overhead and lacking data augmentation are two key challenges of these methods, which may lead to low speedup ratio and accuracy drop across multiple compute devices. Inspired by the optimal control formulation of ResNets, we propose a novel serial-parallel hybrid training strategy to enable the use of data augmentation, together with downsampling filters to reduce the communication cost. The proposed strategy first trains the network parameters by solving a succession of independent sub-problems in parallel and then corrects the network parameters through a full serial forward-backward propagation of data. Such a strategy can be applied to most of the existing layer-parallel training methods using auxiliary variables. As an example, we validate the proposed strategy using penalty and AL methods on ResNet and WideResNet across MNIST, CIFAR-10 and CIFAR-100 datasets, achieving significant speedup over the traditional layer-serial training methods while maintaining comparable accuracy.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Data Augmentation

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Edit

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

1x1 Convolution • Average Pooling • Batch Normalization • Bottleneck Residual Block • Convolution • Dropout • Global Average Pooling • Kaiming Initialization • Max Pooling • ReLU • Residual Block • Residual Connection • ResNet • Wide Residual Block • WideResNet

Edit Social Preview

A Practical Layer-Parallel Training Algorithm for Residual Networks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove