-
University of Texas at Austin
Learning the learning rate in stochastic gradient descent
Finding a proper learning rate in stochastic optimization is an important problem. Choosing a learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can cause the loss function to fluctuate around the minimum or even to diverge. In practice, the learning rate is often tuned by hand for different problems at hand. Several methods have been proposed recently for automatic adjustment of the learning rate according to gradient data that is received along the way. We review these methods, and propose a simple method, inspired by reparametrization of the loss function in polar coordinates. We prove that the proposed method achieves optimal oracle convergence rates in batch and stochastic settings, but without having to know certain parameters of the loss function in advance.
Rachel Ward is an Associate Professor of Mathematics in the Department of Mathematics at University of Texas at Austin and has been a member of the faculty since 2011. She is currently a visiting research scientist at Facebook AI Research. She received a PhD in Computational and Applied Mathematics at Princeton in 2009 and was a Courant Instructor at the Courant Institute, NYU, from 2009-2011. Her research interests span signal processing, machine learning, and optimization. She has received the Sloan research fellowship, NSF CAREER award, and the 2016 IMA prize in mathematics and its applications.