
ICLR 2022
Lead Author Spotlight
Chen Liang, PhD in Machine Learning student
#TRAINING LARGE TRANSFORMER MODELS
We propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter’s contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule.

Q&A with Chen Liang
If you bumped into the plenary speaker at the conference, how would you describe your paper to them in 30 seconds?
Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter’s contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting.
How did this work push you to grow technically, as part of a team, or in applying your work to the real world?
I worked on developing parameter-efficient large neural models from a novel perspective. Instead of pruning redundant parameters, we train them sufficiently so that they also contribute to the model performance. Such a new perspective gives me inspiration for future research.
What key takeaways would you share with people to help them remember your work?
- There exists a significant number of redundant parameters in large Transformer models, which can hurt the model generalization performance.
- We propose a novel training strategy, SAGE, which encourages all parameters to receive sufficient training and become useful ultimately.
- SAGE adaptively adjusts the learning rate for each parameter based on its sensitivity, a gradient-based measure reflecting this parameter’s contribution to the model performance.
- A parameter with low sensitivity is redundant and we improve its fitting by increasing its learning rate. A parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate.
- SAGE significantly improves model generalization for both NLP and CV tasks and in both fine-tuning and training-from-scratch settings.
Career or personal advice you would want from your future self?
Career advice: What is the best decision you have made in the workplace?
Personal advice: How to better manage work-life balance?