**ICLR 2022**

Lead Author Spotlight

**Tianrong Chen**, *PhD student*

**Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory**

**#SCORE-BASED GENERATIVE MODEL**

In this work, we present a novel computational framework for likelihood training of Schrödinger Bridge (SB) models grounded on Forward-Backward Stochastic Differential Equations Theory – a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10. Our code is available at https://github.com/ghliu/SB-FBSDE.

**Q&A with Tianrong Chen**

## If you bumped into the plenary speaker at the conference, how would you describe your paper to them in 30 seconds?

Schrödinger Bridge (SB) is an entropy-regularized optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory.

## How did this work push you to grow technically, as part of a team, or in applying your work to the real world?

In this work, I collaborate with Guan (who is co-author of this paper) closely to pressure the ICLR in around four months. Accomplishing this work in such a short time is not an easy task, so I have to learn to communicate and collaborate efficiently.

## What key takeaways would you share with people to help them remember your work?

Our findings provide new theoretical insights by generalizing previous theoretical results for Score-based Generative Model and facilitate applications of modern generative training for SB. We validate our method on various image generative tasks, e.g. MNIST, CelebA, and CIFAR10, showing encouraging results in synthesizing high-fidelity samples while retaining the rigorous mathematical framework.

## Career or personal advice you would want from your future self?

Life can only be understood going backward, but it must be lived going forward.

**ICLR 2022**

Lead Author Spotlight

**Xinshi Chen**, *PhD student*

**Provable Learning-based Algorithm For Sparse Recovery**

**#LEARNING TO LEARN**

Recovering sparse parameters from observational data is a fundamental problem in machine learning with wide applications. Many classic algorithms can solve this problem with theoretical guarantees, but their performances rely on choosing the correct hyperparameters. Besides, hand-designed algorithms do not fully exploit the particular problem distribution of interest. In this work, we propose a deep learning method for algorithm learning called PLISA (Provable Learning-based Iterative Sparse recovery Algorithm). PLISA is designed by unrolling a classic path-following algorithm for sparse recovery, with some components being more flexible and learnable. We theoretically show the improved recovery accuracy achievable by PLISA. Furthermore, we analyze the empirical Rademacher complexity of PLISA to characterize its generalization ability to solve new problems outside the training set.

**Q&A with Xinchi Chen**

## If you bumped into the plenary speaker at the conference, how would you describe your paper to them in 30 seconds?

My paper is about how to use deep learning to learn algorithms for solving sparse estimation problems. It falls into the category of learning to learn. What’s new in the paper includes a more generic method that can be applied to more general settings and novel theoretical guarantees for such methods.

## How did this work push you to grow technically, as part of a team, or in applying your work to the real world?

This is interdisciplinary research that covers topics in both statistics, optimization, and deep learning. It allows me to learn more about optimization theory and learning theory as we derived the theoretical results in this work.

## What key takeaways would you share with people to help them remember your work?

- How to learn algorithms for solving various sparse estimation problems?
- How to understand the behaviors of these learning-based algorithms?
- How are the generalization and representation abilities of these learning-based algorithms related to their algorithmic properties such as convergence rate?”

## Career or personal advice you would want from your future self?

What is the biggest regret?

**ICLR 2022**

Lead Author Spotlight

**Sachin G. Konan**, *BS in Computer Science student*

**Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming**

**#MULTI-AGENT REINFORCEMENT LEARNING**

The majority of prior work in Multi-Agent Reinforcement Learning (MARL) does not support iterated rationalizability and only encourage inter-agent communication, resulting in a suboptimal equilibrium cooperation strategy. In this work, we show that reformulating an agent’s policy to be conditional on the policies of its neighboring teammates inherently maximizes Mutual Information (MI) lower-bound when optimizing under Policy Gradient (PG). Building on the idea of decision-making under bounded rationality and cognitive hierarchy theory, we show that our modified PG approach not only maximizes local agent rewards but also implicitly reasons about MI between agents without the need for any explicit ad-hoc regularization terms.

**Q&A with Sanchin G. Konan**

## If you bumped into the plenary speaker at the conference, how would you describe your paper to them in 30 seconds?

Our paper focuses on integrating cognitive k-level hierarchy theory — which is how humans deeply reason about decisions in real-world settings — into multi-agent reinforcement learning. We mathematically show that our k-level algorithm, InfoPG, encourages collaboration between agents, and empirically, we demonstrate this leads to better performance relative to standard baselines.

## How did this work push you to grow technically, as part of a team, or in applying your work to the real world?

This work helped me grow my collaboration skills because in order to communicate my mathematical formulations or algorithmic ideas, I had to learn how to dissect content and present the main ideas. This skill developed during weekly meetings with my advisor Professor Gombolay and Ph.D. student, Esmaeil Seraj.

## What key takeaways would you share with people to help them remember your work?

Our work provides a key connection between two independently researched ideas in Multi-Agent RL: Mutual Information Maximization and K-Level Cognitive Hierarchy Theory. We mathematically show that by equipping each agent with a k-level policy, we can implicitly increase mutual information with other agents’ policies, which is something that previous works could only perform explicitly.

## Career or personal advice you would want from your future self?

I would want advice on whether industry or academia is the best place to continue researching.

**ICLR 2022**

Lead Author Spotlight

**Yan Li**, *PhD in Machine Learning student*

**Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits**

**#OPTIMIZATION FOR REPRESENTATION LEARNING**

To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match the performance of adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system.

**Q&A with Yan Li**

Most Adagrad/Adam variants struggle to achieve provable benefits in nonconvex settings, despite the fact that they demonstrate superior performances compared to SGD especially for learning quality embedding vectors in modern NLP/recommendation systems. It seems that the fancy updates of these adaptive methods create nontrivial challenges in terms of algorithmic analysis.

The FA-SGD we propose can be viewed as a simplified variant of these adaptive methods. It comes with great empirical performances – with almost no memory overhead compared to Adagrad/Adam – while achieving on par or even better model performance. Better yet, it is the first method that shows provable benefits over SGD to date.

From the development process of FA-SGD, the biggest personal takeaway as a junior researcher is to approach the current understanding/development/status of the current ML methods with not only appreciation, but more importantly, with a critical mindset. By doing so, I often find new understandings/perspectives can arise which sometimes leads to new methods.

## What key takeaways would you share with people to help them remember your work?

The simple intuition behind FA-SGD — adjusting the learning rate of each token depending on their occurrence frequency — is something we would really like to highlight. When proposing new optimization methods for representation learning, one might consider exploiting the problem structure first, which can lead to methods with algorithmic simplicity, intuitive updates, and favorable theoretical properties.

## Career or personal advice you would want from your future self?

I am still working on developing my career as a junior researcher 🙂

**ICLR 2022**

Lead Author Spotlight

**Chen Liang**, *PhD in Machine Learning student*

**#TRAINING LARGE TRANSFORMER MODELS**

We propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter’s contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule.

**Q&A with Chen Liang**

Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter’s contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting.

I worked on developing parameter-efficient large neural models from a novel perspective. Instead of pruning redundant parameters, we train them sufficiently so that they also contribute to the model performance. Such a new perspective gives me inspiration for future research.

## What key takeaways would you share with people to help them remember your work?

- There exists a significant number of redundant parameters in large Transformer models, which can hurt the model generalization performance.
- We propose a novel training strategy, SAGE, which encourages all parameters to receive sufficient training and become useful ultimately.
- SAGE adaptively adjusts the learning rate for each parameter based on its sensitivity, a gradient-based measure reflecting this parameter’s contribution to the model performance.
- A parameter with low sensitivity is redundant and we improve its fitting by increasing its learning rate. A parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate.
- SAGE significantly improves model generalization for both NLP and CV tasks and in both fine-tuning and training-from-scratch settings.

## Career or personal advice you would want from your future self?

Career advice: What is the best decision you have made in the workplace?

Personal advice: How to better manage work-life balance?

**ICLR 2022**

Lead Author Spotlight

**Sangdon Park**, *Postdoctoral Researcher in Computer Science*

**PAC Prediction Sets Under Covariate Shift**

**#PROBABLY APPROXIMATE CORRECT (PAC)**

An important challenge facing modern machine learning is how to rigorously quantify the uncertainty of model predictions. Conveying uncertainty is especially important when there are changes to the underlying data distribution that might invalidate the predictive model. Yet, most existing uncertainty quantification algorithms break down in the presence of such shifts. We propose a novel approach that addresses this challenge by constructing \emph{probably approximately correct (PAC)} prediction sets in the presence of covariate shift. Our approach focuses on the setting where there is a covariate shift from the source distribution (where we have labeled training examples) to the target distribution (for which we want to quantify uncertainty). Our algorithm assumes given importance weights that encode how the probabilities of the training examples change under the covariate shift.

**Q&A with Sangdon Park **

A prediction set is a promising way to quantify the uncertainty of predictions with some correctness guarantee; under the i.i.d. assumption, a prediction set can be constructed with the probably approximately correct (PAC) guarantee, but the guarantee violates under covariate shift. We provide an algorithm that returns a prediction set with PAC guarantee even in covariate shift under some smoothness conditions on distributions.

This work forces me to understand and correctly use rigorous mathematical tools to provides the end-to-end correctness guarantee.

## What key takeaways would you share with people to help them remember your work?

You can construct a prediction set with correctness guarantee even under covariate shift if you’re using our tool (for sure it works under some assumption).

## Career or personal advice you would want from your future self?

Fill out this form carefully to sell your work!

**ICLR 2022**

Lead Author Spotlight

**Namjoon Suh**, *PhD in Machine Learning student*

**#OVERPARAMETERIZED DEEP NEURAL NETWORK**

We study the generalization properties of the overparameterized deep neural network (DNN) with Rectified Linear Unit (ReLU) activations. Under the non-parametric regression framework, it is assumed that the ground-truth function is from a reproducing kernel Hilbert space (RKHS) induced by a neural tangent kernel (NTK) of ReLU DNN, and a dataset is given with the noises. Without a delicate adoption of early stopping, we prove that the overparametrized DNN trained by vanilla gradient descent does not recover the ground-truth function. It turns out that the estimated DNN’s prediction error is bounded away from . As a complement of the above result, we show that the -regularized gradient descent enables the overparametrized DNN achieve the minimax optimal convergence rate of the prediction error, without early stopping. Notably, the rate we obtained is faster than known in the literature.

**Q&A with Namjoon Suh**

We study the generalization of overparameterized deep feedforward neural network with ReLU activation under non-parametric statistical setting wherein the labels are generated using a ground truth function, which is assumed to be in the RKHS associated with the NTK, perturbed with noise. We find out that gradient-descent based training without early stopping fails whereas L2-regularized gradient descent achieves minimax optimal convergence rate of L2 prediction risk.

While working on this project, I read many papers on deep learning from the statistical viewpoint. This helps me to deepen my understandings both on theoretical analysis on deep neural network and on non-parametric regression problem in statistics.

## What key takeaways would you share with people to help them remember your work?

The linearized deep feedforward neural network via over-parametrization behaves similarly with kernel regression method in terms of generalization, and the adoption of early-stopping or regularization can help the generalization.

## Career or personal advice you would want from your future self?

Keep up the hard work!

**ICLR 2022**

Lead Author Spotlight

**Yuqing Wang**, *PhD in Mathematics student*

**Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect**

**#LARGE LEARNING RATE**

Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, We prove a convergence theory for constant large learning rates well beyond , where is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed `balancing’, meaning that magnitudes of and at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory.

**Q&A with Yuqing Wang**

Our paper studies the effect of large learning rate on solving matrix factorization problem that helps gradient descent find the minimum with better curvature. Especially when the learning rate is larger than the typical upper bound 2/L (can be at most approximately 4/L), gradient descent is proved to converge to a flatter global minimum.

Instead of abstract mathematical theories, this work is a demonstration of a novel phenomenon that draws much attention in real world applications. It’s inspiring to at least partly bridge the gap between rigorous math and the practice.

## What key takeaways would you share with people to help them remember your work?

Large learning rate is theoretically proven to help search for better minimum.

## Career or personal advice you would want from your future self?

- Be 100 percent sure that you believe in what you will be doing before proceeding.
- Don’t judge. Just calculate.

**ICLR 2022**

Lead Author Spotlight

**Qinsheng Zhang**, *PhD in Robotics student*

**Path Integral Sampler: A Stochastic Control Approach For Sampling**

**#PATH INTEGRAL SAMPLER**

We present Path Integral Sampler~(PIS), a novel algorithm to draw samples from unnormalized probability density functions. The PIS is built on the Schr\”odinger bridge problem which aims to recover the most likely evolution of a diffusion process given its initial distribution and terminal distribution. The PIS draws samples from the initial distribution and then propagates the samples through the Schr\”odinger bridge to reach the terminal distribution. Applying the Girsanov theorem, with a simple prior diffusion, we formulate the PIS as a stochastic optimal control problem whose running cost is the control energy and terminal cost is chosen according to the target distribution. By modeling the control as a neural network, we establish a sampling algorithm that can be trained end-to-end.

**Q&A with Qinsheng Zhang**

We present Path Integral Sampler~(PIS), a novel algorithm to draw samples from unnormalized probability density functions. The PIS is built on the Schrödinger bridge problem which aims to recover the most likely evolution of a diffusion process given its initial distribution and terminal distribution.

Sometimes thinking out of the box and good ideas just work.

## What key takeaways would you share with people to help them remember your work?

A non-MCMC sampling algorithm with guaranteed performance.

## Career or personal advice you would want from your future self?

Multidisciplinary has lots of opportunities.

**ICLR 2022**

Lead Author Spotlight

**Shixiang Zhu**, *PhD in Machine Learning student*

**Neural Spectral Marked Point Processes**

**#METRIC LEARNING AND KERNEL LEARNING**

Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.

**Q&A with Shixiang Zhu**

Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. In this paper, we devise a non-stationary influence kernel in the point process model to capture more complex spatio-temporal dependence while providing theoretical performance guarantees.

In this study, we collaborate with researchers and students extensively and bring their expertise together to make things happen.

## What key takeaways would you share with people to help them remember your work?

In this work, the kernel function is represented by a spectral decomposition of the influence kernel with a finite-rank truncation in practice. Such a kernel representation will enable us to capture the most general non-stationary process as well as high-dimensional marks. The model also allows the distribution of marks to depend on time, which is drastically different from the separable kernels considered in the existing literature.

## Career or personal advice you would want from your future self?

How to become the leader in the forefront of machine learning research?