(Winter 2020) IFT 6085: Theoretical principles for deep learning

This is an old version of the class. For the latest iteration of the class check here.

Warning: The Winter 2020 version of the class is going to be more demanding than previous years. For example, the introductory “crash courses” are going to be compressed significantly. You will be expected to either already know this material or get up-to-speed with it in the first couple of weeks.


Research in deep learning produces state-of-the-art results on a number of machine learning tasks. Most of those advances are driven by intuition and massive exploration through trial and error. As a result, theory is currently lagging behind practice. The ML community does not fully understand why the best methods work.

A symptom of this lack of understanding is that deep learning methods largely lack guarantees and interpretability, two necessary properties for mission-critical applications. More importantly, a solid theoretical foundation can aid the design of a new generation of efficient methods—sans the need for blind trial-and-error-based exploration.

In this class we will go over a number of recent publications that attempt to shed light onto these questions. Before discussing the new results in each paper we will first introduce the necessary fundamental tools from optimization, statistics, information theory and statistical mechanics. The purpose of this class is to get students engaged with new research in the area. To that end, the majority of credit will be given for a class project report and presentation on a relevant topic.

Prerequisites: This is meant to be an advanced graduate class for students who want to engage in theory-driven deep learning research. We will introduce some theoretical tools necessary, but start with the assumption that students are comfortable with:

If you have not taken classes on all of the above topics, this class is not designed for you. You should take another class instead.


Instructor: Ioannis Mitliagkas

TA: Jose Gallego

Class info

Winter 2020 semester:

Room: The new Mila auditorium, 6650 St. Urbain.

Office hours: 10:45am-11:45am on Thursday right after class.


We will use the class’s Studium to make announcements. We will use a Slack workspace for discussion. Please keep an eye out for an invitation to the Slack. We will use Gradescope for quizzes and report submissions. Email is not a good way to reach us. However, if you have something personal/sensitive to discuss, feel free to email me or the TA. Starting your email subject with ‘IFT6085:’ will ensure that your email is not miscategorized.


Use this Latex template for scribing.

Tentative topics–to be updated as we go along

Extensive class bibliography

You can find a large number of recent (mostly) research papers related to the class’s objectives here. You can use those for your in-class paper presentations and projects.


NOTE: For the first half of the class we will be closely following the previous iteration of the class.

January 8th Class introduction [slides, quiz]

Crash course in optimization

January 9th Basics of convex analysis and gradient descent [new scribed notes]


Convex analysis basics from ‘Convex Optimization’ by Boyd, Vandenberge ([5] under references):

Convergence proofs: from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)

January 15th The different rates of gradient descent: from Lipschitz to strongly convex [new scribed notes]


Convergence proofs from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)

January 16th


Black box models and lower bounds [scribed notes]


January 22nd Accelerated methods: Polyak’s momentum (the heavy ball method) [new scribed notes] [slides]


January 23rd Nesterov’s Accelerated Gradient, Stochastic gradient descent [new scribed notes]


Crash course in statistical learning theory

January 29th Elements of statistical learning theory [new scribed notes]

Recommended reading: Section 2 (if you need the intro) and Section 5 of [4].

Required reading: Sections 3, 4 and 6 of [4].

January 30th PAC-Bayes bounds [new scribed notes]

Required reading: [12]

Recommended reading: Section 31 of [4]

Recommended reading (harder): Section 6 of [2]

February 5th Stability and generalization [new scribed notes]

Required reading:

Optional reading: Proofs that are not covevered in the scribed notes.

Seminar part of class

February 6th Guest lecture - Jose Gallego [this is a new topic; no scribes available from last year]

Uniform convergence may be unable to explain generalization in deep learning

February 12th Expressivity and universal approximation theorems [new rough scribed notes]


February 13th Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning [new scribed notes]

February 19th Generative models, GANs, WGANs [scribed notes] [scribed notes]

February 20th Student presentations, A

  1. Distribution-Independent PAC Learning of Halfspaces with Massart Noise
    • Presented by: JP Letendre, Howard Huang
    • Required reading: Section 1 (without 1.3) and Section 2 until lemma 2.3 (not included)
  2. Adversarial Examples Are Not Bugs, They Are Features
    • Presented by: Charles Huard
    • Optional intro: Paper 1 and paper 2.
    • Required reading: Introduction, Section 4 (“A Theoretical Framework for Studying (Non)-Robust Features”) and the proof of Theorem 2 (in pages 35 and 36)

February 26th The Numerics of GANs [new scribed notes] [slides]

February 27th Student presentations, B

  1. Random Matrices and applications to ML [slides]
    • Written and presented by: Manuela Girotti
    • Recommended: Section 1
    • Required: Section 3 (without proofs), Section 4.1 and 5.1
    • Optional reading: Section 6
  2. Entropy-sgd: Biasing gradient descent into wide valleys
    • Presented by: Oleksiy Ostapenko, Emre Onur Kahya
    • Required reading: Sections 1,3 and 4

Winter break

March 4th Winter break No class

March 5th Winter break No class


March 12th Student presentations, C

COVID-19 disruption

March 18th No class

March 19th No class

March 25th Surprising results on the generalization error of neural networks [new lecture]

March 26th 1-1 project chats

April 1st Where is the Bayes risk hiding?

April 2nd Student presentations, D

April 8th Reproducing kernel hilbert spaces - Guest lecture by Jose Gallego [new lecture - no scribed notes]

April 9th Neural Tangent Kernel [new lecture - no scribed notes]

April 15th (Distributional) reinforcement learning [new scribed notes]

April 22nd End of semester (online) project presentations


  1. Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.
  2. Theory of classification: a survey of some recent advances Stephane Boucheron, Olivier Bousquet and Gabor Lugosi
  3. iPython notebook demonstrating basic ideas of gradient descent and stochastic gradient descent, simple and complex models as well as generalization.
  4. Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.
  5. Convex Optimization, Stephen Boyd and Lieven Vandenberghe.
  6. Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization, blog post by Sebastien Bubeck.
  7. Introductory lectures on convex optimization, Yurii Nesterov.
  8. Why momentum really works, blog post by Gabriel Goh (this blog post uses a slightly different parametrization of the momentum algorithm. The version we discuss in class, only applies the learning rate on the gradient.)
  9. YellowFin and the Art of Momentum Tuning, SysML 2019, J. Zhang, I. Mitliagkas.
  10. Large-scale Machine Learning and Optimization (class), Dimitris Papailiopoulos, University of Wisconsin.
  11. Advanced Machine Learning Systems (class), Chris De Sa, Cornell University.
  12. A PAC-Bayesian Tutorial with A Dropout Bound, David McAllester.
  13. Stability and generalization, O. Bousquet, A. Elisseeff.
  14. Train faster, generalize better: Stability of stochastic gradient descent, M. Hardt, B. Recht, Y. Singer.
  15. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Gintare Karolina Dziugaite, Daniel M. Roy
  16. Lecture notes on generative learning algorithms, Andrew Ng
  17. Generative Adversarial Nets, Ian Goodfellow et al.
  18. Wasserstein GAN, Martin Arjovsky, Soumith Chintala, Léon Bottou
  19. Read-through: Wasserstein GAN, Alex Irpan
  20. The Numerics of GANs, Lars Mescheder, Sebastian Nowozin, Andreas Geiger
  21. Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal
  22. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, Rie Johnson, Tong Zhang
  23. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning, Ali Rahimi, Ben Recht
  24. PacGAN: The power of two samples in generative adversarial networks, Zinan Lin, Ashish Khetan, Giulia Fanti, Sewoong Oh
  25. Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, Saeed Ghadimi, Guanghui Lan
  26. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, Hamed Karimi, Julie Nutini, Mark Schmidt
  27. Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition, Rong Ge, Furong Huang, Chi Jin, Yang Yuan
  28. How to Escape Saddle Points Efficiently, Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan
  30. The Loss Surfaces of Multilayer Networks, Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun
  31. Learning Long-Term Dependencies with Gradient Descent is Difficult, Y Bengio, P Simard, P Frasconi
  32. On the difficulty of training Recurrent Neural Networks, Razvan Pascanu, Tomas Mikolov, Yoshua Bengio
  33. Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks, David Sussillo and Omri Barak
  34. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
  35. Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation, Moritz Hardt.
  36. Uniform convergence may be unable to explain generalization in deep learning , Vaishnavh Nagarajan, J. Zico Kolter
  37. Tutorial on Practical Prediction Theory for Classification, John Langford, 2005.
  38. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Gintare Karolina Dziugaite, Daniel M. Roy, 2017.
  39. Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach , Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, Peter Orbanz, 2018.
  40. Representation benefits of deep feedforward, Matus Telgarsky, 2015.