Redefining Self-Normalization Property

1 Jan 2021 · Zhaodong Chen, Zhao WeiQin, Lei Deng, Guoqi Li, Yuan Xie ·

The approaches that prevent gradient explosion and vanishing have boosted the performance of deep neural networks in recent years. A unique one among them is the self-normalizing neural network (SNN), which is generally more stable than initialization techniques without explicit normalization. The self-normalization property of SNN in previous studies comes from the Scaled Exponential Linear Unit (SELU) activation function, which has achieved competitive accuracy on moderate-scale benchmarks. However, previous study also reveals that in deeper neural networks, SELU either leads to gradient explosion or loses its self-normalization property. Besides, its accuracy on large-scale benchmarks like ImageNet is also less satisfying. In this paper, we analyze the forward and backward passes of SNN with mean-field theory and block dynamical isometry. A new definition for self-normalization property is proposed that is easier to use both analytically and numerically. We further develop two new activation functions, leaky SELU (lSELU) and scaled SELU (sSELU), that have stronger self-normalization property. The optimal parameters in them can be easily solved with a constrained optimization program. Moreover, analysis on the activation's mean in the forward pass reveals that the self-normalization property gets weaker with larger fan-in of each layer, which explains the performance degradation on large benchmarks like ImageNet. This can be solved with explicit centralization of weight or mixup data augmentation. On moderate-scale benchmarks like CIFAR-10, CIFAR-100, and Tiny ImageNet, the direct application of lSELU and sSELU achieves up to 2.13% higher accuracy. On Conv MobileNet V1 - ImageNet, sSELU along with Mixup reaches 71.77% top-1 accuracy that is even better than Batch Normalization. (code in Supplementary Material)

PDF Abstract