Asynchrony begets Momentum

In a recent note, we show that asynchrony in SGD introduces an extra momentum term. In the companion systems paper, we use this theory to understand statistical efficiency and train deep networks faster.

Theory discussed more extensively in this blog post.

Momentum from asynchrony