(Winter 2019) IFT 6085: Theoretical principles for deep learning

This is an old version of the class. For the latest iteration of the class check here.

Class discussion group: please sign up to receive announcements and participate in discussion.

Description

Research in deep learning produces state-of-the-art results on a number of machine learning tasks. Most of those advances are driven by intuition and massive exploration through trial and error. As a result, theory is currently lagging behind practice. The ML community does not fully understand why the best methods work.

Why can we reliably optimize non-convex objectives?
How expressive are our architectures, in terms of the hypothesis class they describe?
Why do some of our most complex models generalize to unseen examples when we use datasets orders of magnitude smaller than what the classic statistical learning theory deems sufficient?

A symptom of this lack of understanding is that deep learning methods largely lack guarantees and interpretability, two necessary properties for mission-critical applications. More importantly, a solid theoretical foundation can aid the design of a new generation of efficient methods—sans the need for blind trial-and-error-based exploration.

In this class we will go over a number of recent publications that attempt to shed light onto these questions. Before discussing the new results in each paper we will first introduce the necessary fundamental tools from optimization, statistics, information theory and statistical mechanics. The purpose of this class is to get students engaged with new research in the area. To that end, the majority of credit will be given for a class project report and presentation on a relevant topic.

Prerequisites: This is meant to be an advanced graduate class for students who want to engage in theory-driven deep learning research. We will introduce some theoretical tools necessary, but start with the assumption that students are comfortable with:

Probability
Linear algebra
Machine learning

If you have not taken classes on all of the above topics, this class is not designed for you. You should take another class instead.

People

Lecturer: Ioannis Mitliagkas,

Class info

Winter 2019 semester:

Wednesday 9h30-11h15  
Thursday 9-10h45

Room: The new Mila auditorium, 6650 St. Urbain.

Office hours: 10:45am-11:45am on Thursday right after class.

Communication

We will use the class discussion group for all questions. If you have something personal/sensitive to discuss, feel free to email me. Starting your email subject with ‘IFT6085:’ will ensure that your email is not miscategorized.

Evaluation

Class project: 40% Paper presentation: 25% Surprise quizzes, midterm: 20% Scribing: 10% Class participation: 5%

Use this Latex template for scribing.

Tentative topics–to be updated as we go along

Generalization: theoretical analysis and practical bounds 
Information theory and its applications in ML (information bottleneck, lower bounds etc.) 
Generative models beyond the pretty pictures: a tool for traversing the data manifold, projections, completion, substitutions etc. 
Taming adversarial objectives: Wasserstein GANs, regularization approaches and controlling the dynamics 
The expressive power of deep networks (deep information propagation, mean-field analysis of random networks etc.)

Extensive class bibliography

Schedule

NOTE: For the first half of the class we will be closely following the previous iteration of the class.

January 9th Class introduction [slides, quiz]

Crash course in optimization

January 10th Basics of convex analysis and gradient descent [new scribed notes]

Reading:

Convex analysis basics from ‘Convex Optimization’ by Boyd, Vandenberge ([5] under references):

Chapter 2 (required: beginning of chapter to 2.1.4, recommended: 2.1.5 to end of section)
Chapter 3 (required: beginning of chapter to 3.1, recommended: 3.2, 3.3 and 3.4)

Convergence proofs: from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)

Required: Convergence proof of Theorem 3.2 (note that we studied the unconstrained case. In this case the projection operator PiX(x) is the identity operator)

January 16th-17th The different rates of gradient descent: from Lipschitz to strongly convex [new scribed notes]

Reading:

Convergence proofs from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)

Required: Convergence proof of Theorem 3.12

January 23rd Black box models and lower bounds [new scribed notes]

Reading: [1, Theorem 3.15], [6]

January 24th Accelerated methods: Polyak’s momentum (the heavy ball method) [new scribed notes] [slides]

Reading: [6], [7, pages 67-76], [8], [9]

January 30th Nesterov’s Accelerated Gradient, Stochastic gradient descent [new scribed notes]

Reading: Section 6 until 6.2 of [1], Section 14.3 of [4]

Crash course in statistical learning theory

January 31st Elements of statistical learning theory [new scribed notes]

Reading: Sections 2 (if you need the intro), 3, 4 and 6 of [4].

February 6th PAC-Bayes bounds [new scribed notes]

Reading: [12]

Reading (harder): Section 6 of [2]

February 7th Stability and generalization [new scribed notes]

Reading: [13,14]

Seminar part of class

February 13th: Guest lecturer, Guillaume Rabusseau Expressivity and universal approximation theorems [new scribed notes]

February 20th Applications of stability [new scribed notes]

Reading: [14]

February 21st Generative models [new scribed notes]

Reading: [16,17]

February 27th Wasserstein GANs [new scribed notes]

Reading: [18,19]

February 28th The Numerics of GANs [new scribed notes] [slides]

March 6th BREAK No class

March 7th BREAK No class

March 13th Midterm exam

March 14th: Guest lecturer, Guillaume Lajoie Intro to dynamical systems with application to neural networks [new scribed notes]

Reading: [31, 32, 33]

March 20th, morning Student paper presentations, A

Evaluation form

The expressive power of neural networks: a view from the width Presented by: Andrew Williams, Gabriel Laberge, Ian Porada, Mingde Zhao
“Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach” Presented by: Fabrice Normandin, Rémi Piché-Taillefer, Théo Moins, Sanae Lotfi
Stabilizing GAN Training with Multiple Random Projections Presented by: Moustafa ElAraby, Kun Ni, Stephan Anh Vu Tran

March 20th, afternoon Student paper presentations, B

Evaluation form

Adaptive Gradient Methods with Dynamic Bound of Learning Rate Presented by: Anirudh Goyal, Alex Lamb
Neural Ordinary Differential Equations Presented by: Parviz Haggi Jacob Miller Eugene Vorontsov
Metalearning and Universality Presented by: Bhairav Mehta, Bhargav Kanuparthi, Amir Raza, Maximilien Le Clei

March 21st Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning [new scribed notes]

Reading: [23]

March 27th, morning Basic results on reinforcement learning [new scribed notes]

Reading: [34], Chapters 3 and 4

March 27th, afternoon Student paper presentations, D

Evaluation form

Hyperspherical Variational Auto-Encoders Presented by: Jonathan Guymont, Marzieh Mehdizadeh, Jonathan Pilault
A Distributional Perspective on Reinforcement Learning Presented by: Dylan Troop, William St-Arnaud, Yann Bouteiller, Guillaume LeBerre
SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Presented by: Charles Guille-Escuret, Zumer Jérémie, Adam Ibrahim, Sauvé Léonard

March 28th, morning Student paper presentations, C

Evaluation form

Entropy-SGD: Biasing gradient descent into wide valleys Presented by: Chin-Wei Huang, Philippe Lacaille, Shawn Tan, Jonathan Plante
Wasserstein Auto-Encoders Presented by: Jose Gallego, Ankit Vani, Andre Cianflone
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? Presented by: Zafarali Ahmed, Veronica Chelu, Nishanth Anand, Nadeem Ward

April 3rd Distributional reinforcement learning [new scribed notes]

Bonus lectures

This section contains useful material (older lectures) that have been replaced by new material, but are still very interesting and useful. Students will not be examined on this material.

March 21st, 2018 Variance reduction techniques for stochastic optimization [scribed notes]

Reading: [22], Section 5.3 of [21]

March 29th, 2018 PacGAN: The power of two samples in generative adversarial networks

Reading: [24]

April 4th, 2018 Some results on non-convex optimization

Reading: [25,26]

April 5th, 2018 Escaping saddle points [scribed notes] slides by Yang Yuan

Reading: [27, 28]

April 11th, 2018 The theory of spin glasses

Guest lecture by Alex Fribergh, UdeM Math.

Reading: [29]

April 12th, 2018 The Loss Surfaces of Multilayer Networks

Reading: [30]

Resources

Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.
Theory of classification: a survey of some recent advances Stephane Boucheron, Olivier Bousquet and Gabor Lugosi
iPython notebook demonstrating basic ideas of gradient descent and stochastic gradient descent, simple and complex models as well as generalization.
Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.
Convex Optimization, Stephen Boyd and Lieven Vandenberghe.
Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization, blog post by Sebastien Bubeck.
Introductory lectures on convex optimization, Yurii Nesterov.
Why momentum really works, blog post by Gabriel Goh (this blog post uses a slightly different parametrization of the momentum algorithm. The version we discuss in class, only applies the learning rate on the gradient.)
YellowFin and the Art of Momentum Tuning, preprint J. Zhang, I. Mitliagkas.
Large-scale Machine Learning and Optimization (class), Dimitris Papailiopoulos, University of Wisconsin.
Advanced Machine Learning Systems (class), Chris De Sa, Cornell University.
A PAC-Bayesian Tutorial with A Dropout Bound, David McAllester.
Stability and generalization, O. Bousquet, A. Elisseeff.
Train faster, generalize better: Stability of stochastic gradient descent, M. Hardt, B. Recht, Y. Singer.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Gintare Karolina Dziugaite, Daniel M. Roy
Lecture notes on generative learning algorithms, Andrew Ng
Generative Adversarial Nets, Ian Goodfellow et al.
Wasserstein GAN, Martin Arjovsky, Soumith Chintala, Léon Bottou
Read-through: Wasserstein GAN, Alex Irpan
The Numerics of GANs, Lars Mescheder, Sebastian Nowozin, Andreas Geiger
Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, Rie Johnson, Tong Zhang
Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning, Ali Rahimi, Ben Recht
PacGAN: The power of two samples in generative adversarial networks, Zinan Lin, Ashish Khetan, Giulia Fanti, Sewoong Oh
Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, Saeed Ghadimi, Guanghui Lan
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, Hamed Karimi, Julie Nutini, Mark Schmidt
Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition, Rong Ge, Furong Huang, Chi Jin, Yang Yuan
How to Escape Saddle Points Efficiently, Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan
RANDOM MATRICES AND COMPLEXITY OF SPIN GLASSES, ANTONIO AUFFINGER, GERARD BEN AROUS, AND JIRI CERNY
The Loss Surfaces of Multilayer Networks, Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun
Learning Long-Term Dependencies with Gradient Descent is Difficult, Y Bengio, P Simard, P Frasconi
On the difficulty of training Recurrent Neural Networks, Razvan Pascanu, Tomas Mikolov, Yoshua Bengio
Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks, David Sussillo and Omri Barak
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

(Winter 2019) IFT 6085: Theoretical principles for deep learning

Description

People

Class info

Communication

Evaluation

Tentative topics–to be updated as we go along

Schedule

Crash course in optimization

Crash course in statistical learning theory

Seminar part of class

Bonus lectures

Resources

Recent discussion