A fully automated momentum tuner for synchronous and asynchronous learning systems

based on ongoing work by Jian Zhang, Ioannis Mitliagkas and Chris Ré

TL;DR We use our recent theory on the connection between asynchrony and momentum to design a momentum tuner. It only uses information from gradients generated by the system and a new momentum-sensing method to compensate for asynchrony-induced momentum in the system.

Stochastic Gradient Descent (SGD) and its variants are the optimization method of choice for many large-scale learning problems including deep learning. A popular approach to running these systems removes locks and synchronization barriers. Such methods are called “asynchronous-parallel methods” or Hogwild! and are used on many systems by companies like Microsoft and Google.

However, the effectiveness of asynchrony has been a bit of a mystery. During our recent work we discovered and unexpected theoretical link between system and algorithm dynamics. We showed that asynchrony introduces momentum-like dynamics into your optimization update. The optimal setting for momentum depends on the data and the hardware. So, tuning is critical! More details on that in our blog post.

Following up on this line of work we have been working on an fully automated momentum tuner—codenamed YellowFin—which uses our theoretical and practical understaning of asynchronous dynamics to tune optimally tune momentum on the fly.

Robustness Properties of the momentum operator

Robustntess We study the robustness properties of the momentum operator; we point out that the rate of convergence for simple objectives is robust to learning rate mis-specification as well as curvature variations.

Synchronous momentum tuner

Synchronous tuner We propose a simple momentum tuner that only uses the gradients of a running system to pick a target momentum value. Our momentum rule computes a quantiy similar to the condition number, that also encapsulates the stochastic noise present in SGD updates.

Preliminary results suggest that our tuner outperforms the best hand-tuned Momentum SGD update and Adam on ResNets.

Asynchronous momentum tuner

Synchronous tuner We use our theory to measure the total amount of momentum in a running asynchronous system. Then we close the loop: We do some simple control theory to efficiently tune the total momentum in the system, so that it runs at maximum statistical efficiency.

What’s next?