# (Winter 2018) IFT 6085: Theoretical principles for deep learning

**This is an old version of the class. For the latest iteration of the class check here.**

**Class discussion group:** please sign up to receive announcements and participate in discussion.

## Description

Research in deep learning produces state-of-the-art results on a number of machine learning tasks. Most of those advances are driven by intuition and massive exploration through trial and error. As a result, theory is currently lagging behind practice. The ML community does not fully understand why the best methods work.

- Why can we reliably optimize non-convex objectives?
- How expressive are our architectures, in terms of the hypothesis class they describe?
- Why do some of our most complex models generalize to unseen examples when we use datasets orders of magnitude smaller than what the classic statistical learning theory deems sufficient?

A symptom of this lack of understanding is that deep learning methods largely lack guarantees and interpretability, two necessary properties for mission-critical applications. More importantly, a solid theoretical foundation can aid the design of a new generation of efficient methods—sans the need for blind trial-and-error-based exploration.

In this class we will go over a number of recent publications that attempt to shed light onto these questions. Before discussing the new results in each paper we will first introduce the necessary fundamental tools from optimization, statistics, information theory and statistical mechanics. The purpose of this class is to get students engaged with new research in the area. To that end, the majority of credit will be given for a class project report and presentation on a relevant topic.

**Prerequisites:**
This is meant to be an advanced graduate class for students who want to engage in theory-driven deep learning research. We will introduce the theoretical tools necessary, but start with the assumption that students are comfortable with basic probability and linear algebra.

## People

Lecturer: Ioannis Mitliagkas, Office: 3359, André-Aisenstadt

## Class info

Winter 2018 semester:

- Wednesday 9h30-11h15
- Thursday 9h30-11h15

Room: André-Aisestadt 3195

Office hours: 11:15am-12:15pm on Thursday right after class.

## Evaluation

Class project: 60% Paper presentation: 25% Scribing: 10% Class participation: 5%

Use this Latex template for scribing.

## Tentative topics–to be updated as we go along

- Generalization: theoretical analysis and practical bounds
- Information theory and its applications in ML (information bottleneck, lower bounds etc.)
- Generative models beyond the pretty pictures: a tool for traversing the data manifold, projections, completion, substitutions etc.
- Taming adversarial objectives: Wasserstein GANs, regularization approaches and controlling the dynamics
- The expressive power of deep networks (deep information propagation, mean-field analysis of random networks etc.)

## Schedule

**January 10th**
Class introduction
[slides,
quiz]

### Crash course in optimization

**January 11th**
Basics of convex analysis and gradient descent
[scribed notes]

Reading:

Convex analysis basics from ‘Convex Optimization’ by Boyd, Vandenberge ([5] under references):

- Chapter 2 (required: beginning of chapter to 2.1.4, recommended: 2.1.5 to end of section)
- Chapter 3 (required: beginning of chapter to 3.1, recommended: 3.2, 3.3 and 3.4)

Convergence proofs: from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)

- Required: Convergence proof of Theorem 3.2 (note that we studied the unconstrained case. In this case the projection operator PiX(x) is the identity operator)

**January 17th**
The different rates of gradient descent: from Lipschitz to strongly convex
[scribed notes]

Reading:

Convergence proofs from Chapter 3 of [1] (‘Convex Optimization…’ by S.Bubeck under References)

- Required: Convergence proof of Theorem 3.12

**January 18th**
Black box models and lower bounds
[scribed notes]

Reading: [1, Theorem 3.15], [6]

**January 24th**
Accelerated methods
[scribed notes]

Reading: [6], [7, pages 67-76], [8], [9]

**January 25th**
Nesterov’s Accelearted Gradient, Stochastic gradient descent
[scribed notes]

Reading: Section 6 until 6.2 of [1], Section 14.3 of [4]

### Crash course in statistical learning theory

**January 31st**
Elements of statistical learning theory
[scribed notes]

Reading: Sections 2 (if you need the intro), 3, 4 and 6 of [4].

**February 1st**
PAC-Bayes bounds
[scribed notes]

Reading: [12]

Reading (harder): Section 6 of [2]

**February 7th**
Stability and generalization
[scribed notes]

Reading: [13]

**February 8th**
Stability and generalization: Part II
[scribed notes]

Reading: [13,14]

### Seminar part of class

**February 14th**
Applications of stability and PAC Bayes
[scribed notes]

Reading: [14,15]

**February 15th**
**NO CLASS** - Instructor is travelling

**February 21st**
Student paper presentations A

- Understanding deep learning requires rethinking generalization, Zhang C, Bengio S, Hardt M, Recht B, Vinyals O.
**Presented by**: Aldo Lamarre, Matthew C. Scicluna - Emergence of invariance and disentanglement in Deep Representations,
Alessandro Achille, Stefano Soatto
**Presented by**: Aristide Baratin, Brady Neal, Nithin Vasisth - High-dimensional dynamics of generalization error in neural networks, Madhu S. Advani, Andrew M. Saxe
**Presented by**: Reyhane Askari, Arian Hosseini, Mohammad Pezeshki

**February 22nd**
Generative models
[scribed notes]

Reading: [16,17]

**February 28th**
Student paper presentations B

- Optimizing Neural Networks with Kronecker-factored Approximate Curvature, James Martens, Roger Grosse
**Presented by**: Josh Romoff & Riashat Islam - Opening the black box of Deep Neural Networks via Information, Ravid Shwartz-Ziv, Naftali Tishby
**Presented by**: Philip Amortila and Nicolas Gagné - Why and When Can Deep – but Not Shallow – Networks Avoid the Curse of Dimensionality, Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, Qianli Liao
**Presented by**: William Fedus, Christos Tsirigotis, Breandan Considine - Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models, Aditya Grover, Manik Dhar, Stefano Ermon
**Presented by**: Amy Zhang, Kyle Kastner, Lluís Castrejón

**March 1st**
Wasserstein GANs
[**new** scribed notes]

Reading: [18,19]

**March 7th**
**BREAK** No class

**March 8th**
**BREAK** No class

**March 14th**
Student paper presentations C

- Compressed Sensing using Generative Models, Ashish Bora, Ajil Jalal, Eric Price, Alexandros G. Dimakis
- Generalization and Equilibrium in Generative Adversarial Nets (GANs), Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang
- Do GANs actually learn the distribution? An empirical study, Sanjeev Arora, Yi Zhang
- Demystifying MMD GANs, Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, Arthur Gretton

**March 15th**
The Numerics of GANs
[scribed notes]

Reading: [20]

**March 21st**
Variance reduction techniques for stochastic optimization
[scribed notes]

Reading: [22], Section 5.3 of [21]

**March 22nd**
Weighted Sums of Random Kitchen Sinks: Replacing
minimization with randomization in learning

Reading: [23]

**March 28th**
**NO CLASS** - Instructor is travelling

**March 29th**
PacGAN: The power of two samples in generative adversarial networks

Reading: [24]

**April 4th**
Some results on non-convex optimization

Reading: [25,26]

**April 5th**
Escaping saddle points
[scribed notes]
slides by Yang Yuan

Reading: [27, 28]

**April 11th**
The theory of spin glasses

Guest lecture by **Alex Fribergh**, UdeM Math.

Reading: [29]

**April 12th**
The Loss Surfaces of Multilayer Networks

Reading: [30]

**April 25th**
**Class project poster presentation**

## Resources

- Convex Optimization: Algorithms and Complexity, Sebastien Bubeck.
- Theory of classification: a survey of some recent advances Stephane Boucheron, Olivier Bousquet and Gabor Lugosi
- iPython notebook demonstrating basic ideas of gradient descent and stochastic gradient descent, simple and complex models as well as generalization.
- Understanding Machine Learning: From Theory to Algorithms, by Shai Shalev-Shwartz and Shai Ben-David.
- Convex Optimization, Stephen Boyd and Lieven Vandenberghe.
- Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization, blog post by Sebastien Bubeck.
- Introductory lectures on convex optimization, Yurii Nesterov.
- Why momentum really works, blog post by Gabriel Goh (this blog post uses a slightly different parametrization of the momentum algorithm. The version we discuss in class, only applies the learning rate on the gradient.)
- YellowFin and the Art of Momentum Tuning, preprint J. Zhang, I. Mitliagkas.
- Large-scale Machine Learning and Optimization (class), Dimitris Papailiopoulos, University of Wisconsin.
- Advanced Machine Learning Systems (class), Chris De Sa, Cornell University.
- A PAC-Bayesian Tutorial with A Dropout Bound, David McAllester.
- Stability and generalization, O. Bousquet, A. Elisseeff.
- Train faster, generalize better: Stability of stochastic gradient descent, M. Hardt, B. Recht, Y. Singer.
- Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data, Gintare Karolina Dziugaite, Daniel M. Roy
- Lecture notes on generative learning algorithms, Andrew Ng
- Generative Adversarial Nets, Ian Goodfellow et al.
- Wasserstein GAN, Martin Arjovsky, Soumith Chintala, Léon Bottou
- Read-through: Wasserstein GAN, Alex Irpan
- The Numerics of GANs, Lars Mescheder, Sebastian Nowozin, Andreas Geiger
- Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, Jorge Nocedal
- Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, Rie Johnson, Tong Zhang
- Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning, Ali Rahimi, Ben Recht
- PacGAN: The power of two samples in generative adversarial networks, Zinan Lin, Ashish Khetan, Giulia Fanti, Sewoong Oh
- Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming, Saeed Ghadimi, Guanghui Lan
- Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition, Hamed Karimi, Julie Nutini, Mark Schmidt
- Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition, Rong Ge, Furong Huang, Chi Jin, Yang Yuan
- How to Escape Saddle Points Efficiently, Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan
- RANDOM MATRICES AND COMPLEXITY OF SPIN GLASSES, ANTONIO AUFFINGER, GERARD BEN AROUS, AND JIRI CERNY
- The Loss Surfaces of Multilayer Networks, Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun