IFT 6085: Theoretical principles for deep learning
Winter 2021
The class will be taught again in Winter 2022 in person.
Description
Research in deep learning produces state-of-the-art results on a number of machine learning tasks. Most of those advances are driven by intuition and massive exploration through trial and error. As a result, theory is currently lagging behind practice. The ML community does not fully understand why the best methods work.
- Why can we reliably optimize non-convex objectives?
- How expressive are our architectures, in terms of the hypothesis class they describe?
- Why do some of our most complex models generalize to unseen examples when we use datasets orders of magnitude smaller than what the classic statistical learning theory deems sufficient?
A symptom of this lack of understanding is that deep learning methods largely lack guarantees and interpretability, two necessary properties for mission-critical applications. More importantly, a solid theoretical foundation can aid the design of a new generation of efficient methods—sans the need for blind trial-and-error-based exploration.
In this class we will go over a number of recent publications that attempt to shed light onto these questions. Before discussing the new results in each paper we will first introduce the necessary fundamental tools from optimization, statistics, information theory and statistical mechanics. The purpose of this class is to get students engaged with new research in the area. To that end, the majority of credit will be given for a class project report and presentation on a relevant topic.
Prerequisites: This is meant to be an advanced graduate class for students who want to engage in theory-driven deep learning research. We will introduce some theoretical tools necessary.
- Probability
- Linear algebra
- Machine learning
People
Instructor: Ioannis Mitliagkas
TA: Jose Gallego
Class info
Winter 2022 semester:
- Wednesday 9h30-11h15
- Thursday 9h00-10h45
Location: Mila (exact room TBA)
First class: Wednesday, January 12th, 2022.
Office hours: 10:45am-11:45am on Thursday right after class.
Communication
We will use the class’s Studium to make announcements. We will use a Slack workspace for discussion. Please keep an eye out for an invitation to the Slack. We will use Gradescope for quizzes and report submissions. Email is not a good way to reach us. However, if you have something personal/sensitive to discuss, feel free to email me or the TA. Starting your email subject with ‘IFT6085:’ will ensure that your email is not miscategorized.
Evaluation
- Class project on a research topic: 35%
- Paper presentations: 15%
- Midterm exam: 15%
- Quizzes: 15%
- Scribing: 10%
- Early-semester exam: 5%
- Class participation: 5%
Use this Latex template for scribing.
Tentative topics–to be updated as we go along
- Generalization: theoretical analysis and practical bounds
- Information theory and its applications in ML (information bottleneck, lower bounds etc.)
- Generative models beyond the pretty pictures: a tool for traversing the data manifold, projections, completion, substitutions etc.
- Taming adversarial objectives: Wasserstein GANs, regularization approaches and controlling the dynamics
- The expressive power of deep networks (deep information propagation, mean-field analysis of random networks etc.)
Extensive class bibliography
You can find a large number of recent (mostly) research papers related to the class’s objectives here. You can use those for your in-class paper presentations and projects.
Schedule
For the first half of the class we will be closely following the previous iteration of the class.
January 14th Class introduction [slides, old quiz]
Crash course in optimization
January 20th Basics of convex analysis and gradient descent [scribed notes]
Reading:
Convex analysis basics from ‘Convex Optimization’ by Boyd, Vandenberge ([5] under references):
- Chapter 2 (required: beginning of chapter to 2.1.4, recommended: 2.1.5 to end of section)
- Chapter 3 (required: beginning of chapter to 3.1, recommended: 3.2, 3.3 and 3.4)
Convergence proofs: from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)
- Required: Convergence proof of Theorem 3.2 (note that we studied the unconstrained case. In this case the projection operator PiX(x) is the identity operator)
January 21st The different rates of gradient descent: from Lipschitz to strongly convex [scribed notes]
Reading:
Convergence proofs from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)
- Required: Sections 3.1, 3.2, 3.4, with emphasis on the Convergence proof of Theorem 3.12
- Optional: Section 3.3 and anything related to constrained optimization. Keep in mind that we still discuss in class simplified unconstrained versions of some constrained results from Bubeck. Anything in our scribed notes is also required.
January 27th
EARLY TAKE-HOME EXAM
Black box models and lower bounds [scribed notes]
Reading:
- Required: [1, Theorem 3.15]
- Recommended: [6]
January 28th Accelerated methods: Polyak’s momentum (the heavy ball method) [scribed notes] [slides]
Reading:
- Required: [8], [9, Sections 1, 2, 3, 4]
- Recommended: [35]
February 3rd Nesterov’s Accelerated Gradient, Stochastic gradient descent [scribed notes]
Reading:
- Required: [6], [7, pages 67-76]
- Required: [1, Section 6 until 6.2], [4, Section 14.3]
Crash course in statistical learning theory
February 4th SGD [continuation of previous lecture] and Elements of statistical learning theory [scribed notes]
Recommended reading: Section 2 (if you need the intro) and Section 5 of [4].
Required reading: Sections 3, 4 and 6 of [4].
February 10th Elements of statistical learning theory [continuation of previous lecture]
February 11th PAC-Bayes bounds [scribed notes]
Required reading: [12]
Recommended reading: Section 31 of [4]
Recommended reading (harder): Section 6 of [2]
February 17th Stability and generalization [scribed notes]
Required reading:
- Main body of the papers [13,14] without all the proofs
- The two sets of scribed notes provided here, including the covered proofs
Optional reading: Proofs that are not covevered in the scribed notes.
Seminar part of class
February 18th Expressivity and universal approximation theorems [scribed notes]
Reading:
- Optional reminder: Subsections 6.1, 6.2, 6.3 from [4]
- Required: Simple construction for universal approximation theorem: A visual proof that neural nets can compute any function By Michael Nielsen
- Required: Beginning of Section 20 until 20.4 (inclusive) in [4]
- Required: All of [40]
February 24th Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning [scribed notes]
- Recommended reading: Section 16 (Kernel methods) of [4]
- Required reading: [23]
- Recommended viewing: video of talk by A. Rahimi. The discussion pertaining to this paper, starts at 31:10 (though the intro might be necessary) and ends at about 48:10.
February 25th Generative models, GANs, WGANs [scribed notes] [scribed notes]
- Recommended: Great tutorial and collection resources from Stefano Ermon and Aditya Grover at IJCAI.
- Recommended: Practical guide into Kantorovich-Rubinstein duality, Vincent Herrmann
- Required reading: Main bodies of [16,17,18,19] without the proofs
- Required reaading: Two sets of scribed notes above, including discussed proofs.
- Optional: Applications of GANs
Winter break
March 3rd Winter break No class
March 4th Winter break No class
March 10th Student presentations, A
- Performative Prediction Presented by: Alan Chan
- Expressivity, Complexity, Learnability Presented by: Martin Weiss
March 11th The Numerics of GANs [scribed notes] [slides]
- Required reading: [20]
March 17th Surprising results on the generalization error of neural networks [new lecture]
- Recommended reading: In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
- Recommended reading: Understanding deep learning requires rethinking generalization
- Recommended reading: On the Bias-Variance Tradeoff: Textbooks Need an Update, blog post by Brady Neal
March 18th MIDTERM EXAM
March 24th Student presentations, B
- Neural SDEs as Infinite-Dimensional GANs Presented by: Tim DeLise
- Results on algorithmic stability Presented by: Magid Sabbagh
March 25th Student presentations, C
- Learning Invariant Representations for Reinforcement Learning without Reconstruction Presented by: Guillaume Huguet, Semih Canturk
- Emergence of Invariance and Disentanglement in Deep Representations Presented by: Abderrahim Fathan, Hai Phan
March 31st Where is the Bayes risk hiding?
- Recommended reading (refresher from last week): Understanding deep learning requires rethinking generalization
- Recommended reading: To Understand Deep Learning We Need to Understand Kernel Learning
- Recommended reading: Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
April 1st Student presentations, D
- Presented by: Eeshan Gunesh Dhekane, Sarthak Mittal
- Presented by: Hattie Zhou, Zacharie Legault
- Presented by: Amirhossein Kazemnejad, Abderrahim Fathan
April 7th 1-1 project chats
April 8th Neural Tangent Kernel [new lecture - no scribed notes]
- Recommended reading: Neural Tangent Kernel: Convergence and Generalization in Neural Networks
- Recommended reading: Blog summary of paper
Recommended reading on reproducing kernel hilbert spaces:
- Guest lecture by Jose Gallego from last yearvideo
- [15min] Watch this great video on duality.
- [5min] Get familiar with the statement of the Riesz representation theorem and think about the relationship with the video above.
- [40min] Read the Wikipedia entry on RKHSs - Can skip sections 4, 8 and 9.
- [15min] Optional: Read sections 1 and 2 of the Wikipedia entry on Kernel embedding of distributions
April 14th (Distributional) reinforcement learning [scribed notes]
- Recommended reading: Scribed notes above
- Recommended reading: A Distributional Perspective on Reinforcement Learning, by Bellemare et al.
April 15th Determinantal point processes for ML (guest lecture by Jose Gallego)
- Recommended reading: Notes from Stefanie Jegelka’s class at MIT, written by Swati Gupta.
- Recommended reading: Determinantal Point Processes in Randomized Numerical Linear Algebra, Derezinski and Mahoney.
April 21nd NO LECTURE
April 22nd End of semester (online) project presentations
Resources
- Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.
- Theory of classification: a survey of some recent advances Stephane Boucheron, Olivier Bousquet and Gabor Lugosi
- iPython notebook demonstrating basic ideas of gradient descent and stochastic gradient descent, simple and complex models as well as generalization.
- Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.
- Convex Optimization, Stephen Boyd and Lieven Vandenberghe.
- Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization, blog post by Sebastien Bubeck.
- Introductory lectures on convex optimization, Yurii Nesterov.
- Why momentum really works, blog post by Gabriel Goh (this blog post uses a slightly different parametrization of the momentum algorithm. The version we discuss in class, only applies the learning rate on the gradient.)
- YellowFin and the Art of Momentum Tuning, SysML 2019, J. Zhang, I. Mitliagkas.
- Large-scale Machine Learning and Optimization (class), Dimitris Papailiopoulos, University of Wisconsin.
- Advanced Machine Learning Systems (class), Chris De Sa, Cornell University.
- A PAC-Bayesian Tutorial with A Dropout Bound, David McAllester.
- Stability and generalization, O. Bousquet, A. Elisseeff.
- Train faster, generalize better: Stability of stochastic gradient descent, M. Hardt, B. Recht, Y. Singer.
- Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Gintare Karolina Dziugaite, Daniel M. Roy
- Lecture notes on generative learning algorithms, Andrew Ng
- Generative Adversarial Nets, Ian Goodfellow et al.
- Wasserstein GAN, Martin Arjovsky, Soumith Chintala, Léon Bottou
- Read-through: Wasserstein GAN, Alex Irpan
- The Numerics of GANs, Lars Mescheder, Sebastian Nowozin, Andreas Geiger
- Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal
- Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, Rie Johnson, Tong Zhang
- Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning, Ali Rahimi, Ben Recht
- PacGAN: The power of two samples in generative adversarial networks, Zinan Lin, Ashish Khetan, Giulia Fanti, Sewoong Oh
- Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, Saeed Ghadimi, Guanghui Lan
- Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, Hamed Karimi, Julie Nutini, Mark Schmidt
- Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition, Rong Ge, Furong Huang, Chi Jin, Yang Yuan
- How to Escape Saddle Points Efficiently, Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan
- RANDOM MATRICES AND COMPLEXITY OF SPIN GLASSES, ANTONIO AUFFINGER, GERARD BEN AROUS, AND JIRI CERNY
- The Loss Surfaces of Multilayer Networks, Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun
- Learning Long-Term Dependencies with Gradient Descent is Difficult, Y Bengio, P Simard, P Frasconi
- On the difficulty of training Recurrent Neural Networks, Razvan Pascanu, Tomas Mikolov, Yoshua Bengio
- Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks, David Sussillo and Omri Barak
- Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
- Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation, Moritz Hardt.
- Uniform convergence may be unable to explain generalization in deep learning , Vaishnavh Nagarajan, J. Zico Kolter
- Tutorial on Practical Prediction Theory for Classification, John Langford, 2005.
- Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Gintare Karolina Dziugaite, Daniel M. Roy, 2017.
- Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach , Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, Peter Orbanz, 2018.
- Representation benefits of deep feedforward, Matus Telgarsky, 2015.