Statistical machine learning

MATH-412

Media

MATH-412 Statistical Machine Learning

49, Lecture 13 A: Vapnik-Chervonenkis dimension

14.12.2020, 17:00

Lecture 13A: third part

- Learning bounds for the missclassification error

- Rademacher complexities for the 0-1 loss of binary classifiers

- Massart's lemma

- Dichotomies and growth function of a hypothesis space

- Shattering and Vapnik-Chervonenkis dimension

- Examples: linear classifiers, rectangles

- Sauer's lemma

- Polynomial upper bound

- Learning bound based on VC dimension

48, Lecture 13A: Summary of result for linear regression

14.12.2020, 16:59

Lecture 13A:

This video summarizes the results that were established by hand in the previous video.

47, Lecture 13A: Applying the learning bound to kernel regression

14.12.2020, 16:56

In this lecture, we are applying the main theorem of the course to the case of least square regression in a RKHS.

46, Lecture 12B: Proof of the bounds on empirical process deviations

07.12.2020, 20:43

Lecture 12B: third part

- Proof of the Master theorem of slide 16.

45, Lecture 12B: About McDiarmid

07.12.2020, 20:40

Lecture 12 B: part 2

Some comments on the two ways of writing McDiarmid's inequality.

44, Lecture 12 B: High probability learning guarantees

07.12.2020, 20:30

Lecture 12 B: first part

- Showing the corollary from the previous proof

- Talagrand's lemma

- Bound on the empirical Rademacher complexity for linear functions

- Bound on the empirical Rademacher complexity for RKHS functions

High probability bounds

- McDiarmid's inequality

-  Concentration of the empirical Rademacher complexity

- High prob. bounds on the empirical process deviations

- High probability bounds on the Risk

43, Lecture 12 A: Proof for Rademacher control of the expected estimation error

07.12.2020, 19:11

Lecture 12 A: second part

- This video proves the Lemma entitled: Empirical process control with the Rademacher complexity on slide 5.

Please refer to that slides for the notations.

42, Lecture 12 A SLT: Rademacher complexity control for the empirical risk

07.12.2020, 19:01

- Introduction of the goals of this lecture on statistical learning theory

- An upper bound on the expected estimation error 

- Definition of Rademacher complexity

- A lemma on empirical process control with the Rademacher complexity

- Bounding the expected estimation error with the Rademacher complexity

41, Lecture 11B: Learning with deep NNs

30.11.2020, 11:49

Lecture 11B: second part

- Why is deep learning working?

- Deep networks overfit and generalize at the same time

- Deep learning is not well explained by classical statistical learning theory

- Why deep learning generalizes

- The double descent phenomenon

- Conclusions

40, Lecture 11B: Convolutional Neural Networks

30.11.2020, 11:48

- Convolution, cross-correlation and equivariance to translations

- Discrete 2D convolutions, zero padding

- Max Pooling

- Structure of a convolutional layer

- Multiple channels per layer

- Examples: LeNet5, AlexNet

- Recurrent Neural Networks

39, Lecture 11A: Training Neural Networks

30.11.2020, 11:48

- Using SGD to learn NNs

- Chain rules

- Forward chain rule

- Backward chain rule and backpropagation

- Gradients w.r.t. the parameters

- Weight decay

- Dropout

38, Lecture 11A: Neural Networks and Deep Learning

30.11.2020, 11:43

Lecture 11 A: first part

- Formal Neuron Model

- Two layer neural network

- Multilayer NN

- Activation Functions

- ReLU and linear splines

- Approximation results for NNs

37, Lecture 10B: EM demo

22.11.2020, 21:21

- Demo of EM on the GMM

36, Lecture 10: EM final form for the GMM

22.11.2020, 21:19

- Summary of the calculations corresponding the E-step

- Final results for the M-step

- Pseudo-code of the EM algorithm for the Gaussian Mixture Model

35, Lecture 10B: End of the M-step derivation

22.11.2020, 21:17

- M-step derivation for mu_k and Sigma_k

34, Lecture 10B: Derivation of EM for the GMM

22.11.2020, 21:14

- Presentation of the GMM

- Review of the EM main inequality

- Derivation of the E-step

- Part of the M-step for the estimation of the proportions pi_1,..., pi_K

33, Lecture 10A: K-means and the Gaussian Mixture Model

22.11.2020, 21:07

- K-means

- Jensen's inequality

- The Kullback-Leibler divergence and the entropy

- The Gaussian mixture model

- Abstract form of EM

32, Lecture 10A: K-means Demo

22.11.2020, 21:04

K-means demo

31, Lecture 9B: Boosting

17.11.2020, 17:20

Lecture 9B: Boosting

- Weak classifiers

- The Adaboost algorithm

- Forward stagewise additive modelling

- Deriving Adaboost as FSAM on the exponential loss

- More general FSAM

- Boosted trees: general algorithmic structure

30, Lecture 9A: Bagging and Random Forests

16.11.2020, 21:21

- Ensembling/Aggregation

- Boostrap and bagging definition

- Bagging and variance reduction

- Random forests

- OOB samples and error estimates

- Variable Importance

- Final remarks

29, Lecture 8A: Decision and regression trees

09.11.2020, 14:05

- Example of decision tree learning for binary classification in 2d

- Decision trees as histogram predictors

- An impurity measure: the entropy associated with a loss function

- Empirical entropies

- Expressing the ER of decision trees with entropies

- Impurity decrease associated with a split

- Decision trees learning algorithm: pseudocode

- Regression trees learning algorithm: pseudocode

- Implementations and criticisms

- Conditional Inference Trees

28, Lecture 7C: Kernel Methods 4

09.11.2020, 13:53

- Theorem of Kimmeldorf and Wahba in a RKHS + proof

- Application to regularized ERM in an RKHS

- Illustration with kernelized SVMs

- Kernel ridge regression

- Scalability issues for kernel methods

- On the difference between convolution kernels and Mercer kernels

27, Lecture 7B: Kernel Methods 3

03.11.2020, 06:59

Lecture 7B: third part

- the RKHS: a Hilbert space with continuous evaluation functionals

- Kernel associated with a RKHS and reproducing property

- Proof that RKHS reproducing kernels are positive definite function

- Moore-Aronszajn theorem: the RKHS reproducing kernels are the positive definite function

- The linear, polynomial and Gaussian RKHS

26, Lecture 7B: Kernel Methods 2

03.11.2020, 06:57

Lecture 7B: second part

A simple example where the dot product in feature space can be computed without computing the feature map. 

25, Lecture 7B: Kernel Methods 1

03.11.2020, 06:52

- A simple version of the representer theorem of Kimmeldorf and Wahba

- It's main consequence: the possibility to rewrite learning problem, and in particular the regularized ERM using kernel matrices

- Particular case of the kernel matrix associated with the linear mapping

24, Lecture 7A: Splines

02.11.2020, 10:54

Lecture 7A

- B-splines: introduction of the fundamental concepts and construction of the basis

- Natural cubic splines: imposing that the splines function has linear extrapolation outside of the data range

- Solving learning with splines just requires a change of the design matrix

- Smoothing splines: natural cubic splines actually are the solution of least square problem with minimal mean square curvature

-Pratical choices and linear computational complexity of splines

- Multivariate splines based on a tensor basis function

23, Lecture 6B: SVMs

27.10.2020, 09:42

Lecture 6B: Support Vector Machines

Hard-margin SVM (for separable data)

- Geometric construction

- Calculation of the margin

- Optimization problem

- Support vectors

Soft-margin SVM (for data which is not necessarily separable)

- Slack variable: geometric idea

- Optimization problem for the soft-margin SVM

- Reformulation via the hinge loss and interpretation as regularized ER

- Quadratic hinge loss

Imbalanced classification

and an SVM formulation for it.

22, Lecture 6A: Randomized classifiers in the ROC plane

26.10.2020, 18:01

Lecture 6A: second part

- Detailed explanation of why randomized classifiers allow to attain convex combinations in the ROC plane

- Proof that a randomized classifier has an FPR equal to a convex combination of the FPRs of its base classifiers

21, Lecture 6A: Evaluation of binary classifiers

26.10.2020, 17:57

Lecture 6A: first part

- Recall, Sensitivity, Specificity, false positive and false negative rates, precision, etc

- The ROC plane and the ROC curves

- Convex combinations in the ROC plane

- Relation between metrics measures in the ROC plane and misclassification error, AUC, and TAUC

- Precision recall curve

20, Lecture 5B: Perceptron, SGD, and Fisher's Linear Discriminant

19.10.2020, 10:35

Lecture 5B

- Perceptron: formal neuron model and loss function

- Perceptron: Learning algorithm in the separable case

- Perceptron: Learning algorithm in the non-separable case

- Stochastic Gradient Descent: Abstract form

- SGD: Theoretical result

- Using SDG for Risk minimization vs for Empirical Risk Minimization

-  Fisher's Linear Discriminant: concept of Generative model

-  Fisher's Linear Discriminant: overview of the equations and comparison with logistic regression.

19, Lecture 5A: Logistic regression for {-1,1} labels

19.10.2020, 10:33

Lecture 5A: second part

- Logistic regression for {-1,1} label

- Reinterpretation of logistic regression as ERM with a new loss function

18, Lecture 5A: Linear Binary Classification

19.10.2020, 10:31

Lecture 5A: first part

- Extension of the 0-1 loss to real valued score functions and plug-in principle

- Harness of the ERM for the 0-1 loss

- Plug-in predictor for classification from least square regression predictors

- Logistic regression: first formulation

17, Lecture 4B: Local Averaging predictors

13.10.2020, 12:31

Lecture 4B:

- The general form of local averaging predictors is presented

- Examples:

- K-nn

- Histogram based

- Nadaraya_Watson predictors

-Properties of local averaging predictors

- General framework of local ERM

- Recovering local averaging predictors as solution to local ERM for the quadratic risk when the predictors are "locally constant"

- Local linear regression (LOWESS)

16, Lecture 4A-2: Using cross-validation, building a final predictor and nested CV

06.10.2020, 21:54

Lecture 4A: second part

- Interpretation of the CV risk estimates obtained

- Building a final predictor from the CV results

- Separate test sets and nested cross-validation

- Summary

15, Lecture 4A-1: Simple validation, cross-validation, leave-one-out

06.10.2020, 21:49

Lecture 4A: first part

- Simple validation, cross-validation (CV) and leave-one-out CV are introduced

- The way to use these techniques for hyperparameter tuning is described

14, Lecture 3B: Model selection with AIC and BIC

05.10.2020, 18:55

Lecture 3B: second part

- How to use AIC for model selection?

- Presentation of BIC

- AIC vs BIC

- The Generalized Information Criterion and connection to sparsity.

13, Lecture 3B: The Information Criteria of Takeuchi and Akaike

05.10.2020, 18:36

Lecture 3B: first part

- In this lecture we concentrate on estimation of the optimism for predictors that are learned via statistical models and using the maximum likelihood principle

- The derivation of Takeuchi's Information Criterion is sketched

- Akaike's information criterion is then obtain as a particular case

- AIC is applied to linear regression as an example

- AICc, the corrected AIC criterion is presented for the case where p is large.

12, Lecture 3A: Risk estimation: Mallows' CL and Cp

29.09.2020, 00:58

- Optimism is defined as the amount of underestimation of the risk by the empirical risk

- The fixed design setting is introduced

- The form of the optimism for the quadratic risk is derived

- This defines the effective degrees of freedom

- Linear estimators are introduced with linear regression and ridge as examples

- The degrees of freedom can be computed for linear estimators and take the form of Mallows' CL

- The particular case of linear regression is covered by Mallows' Cp

- The lecture end by comments on how to use the CL and the Cp in practice.

11, Lecture 2B-3: Comparing L1 and L0 + Greedy algorithms

25.09.2020, 17:34

Lecture 2B third part

- The particular case of linear regression with orthogonal designs is considered to compare the Lasso with L0 penalization. The similarities and differences are highlighted by soft-thresholding vs hard-thresholding

- Greedy algorithms: Forward selection and Orthogonal Matching Pursuit

- An empirical comparison of L0, L1 and L2

10, Lecture 2B-2: Geometry of the Lasso

25.09.2020, 17:33

Lecture 2B second part

- A geometric explanation of why Lasso solutions are sparse

9, Lecture 2B-1: Other regularizations + the Lasso

25.09.2020, 17:32

Lecture 2B first part:

- General remarks on other regularization

- On the possibility of penalizing with the L0 quasi-norm

- The Lasso

8, Lecture 2A: Complexity

21.09.2020, 23:21

Lecture 2A third part:

- Explicit vs implict control of complexity

- Approximation-estimation trade-off

7, Lecture2A: Regularization

21.09.2020, 23:19

Second part of Lecture 2A

- Tikhonov regularization 

- Ridge regression

- Example of the application of ridge regression to polynomial regression

6, Lecture 2A: Overfitting

21.09.2020, 23:17

First part of Lecture 2A:

The case of polyniomial regression is considered to discuss overfitting.

5, Lecture 1B: Theoretical Properties of the linear regression estimator

15.09.2020, 21:56

Lecture 1B: second part

This section reviews a number of results from classical statistics on linear regression:

- The geometric point of view and the hat matrix

- The Gauss-Markov theorem

- The relation with Maximum Likelihood in Gaussian conditional models

- The distributional properties of the estimator when the data is exactly Gaussian.

4, Lecture 1B-1 Linear regression basics from the ERM perspective

15.09.2020, 21:46

Lecture 1B: first part

A brief review of Linear Regression introduced from the perspective of the empirical risk minimization principle applied to the square loss.

3, Lecture 1A-3: PAC learning, Empirical Risk Minimization, Inductive bias and Hypothesis Space

15.09.2020, 21:36

Lecture 1A: third part

This section start by using the concepts from decision theory to propose way of quantifying how much an algorithm or a learning scheme "learns" with the PAC learning framework, then it introduces the Empirical Risk Minimization principle, discusses why learning is an ill-posed problem in the sense of Hadamard and why an inductive bias is needed and introduces different predictor/hypothesis spaces.

2, Lecture 1A-2: Examples of decision models for Supervised learning

15.09.2020, 21:23

Lecture 1A second part:

In this section, several examples of decision models are presented:

- for least squares regression

- for multiclass classification

- for scoring/ranking of pairs

- for an abstract OCR model

1, Lecture 1A Introduction to Supervised Learning and concepts from Decision Theory

15.09.2020, 21:15

First part of lecture 1A:

- Supervised learning is defined

- Key concepts from decision theory such as loss function, risk, target function, conditional risk and excess risk are introduced.


Media

MATH-412 Statistical Machine Learning

49, Lecture 13 A: Vapnik-Chervonenkis dimension

14.12.2020, 17:00

Lecture 13A: third part

- Learning bounds for the missclassification error

- Rademacher complexities for the 0-1 loss of binary classifiers

- Massart's lemma

- Dichotomies and growth function of a hypothesis space

- Shattering and Vapnik-Chervonenkis dimension

- Examples: linear classifiers, rectangles

- Sauer's lemma

- Polynomial upper bound

- Learning bound based on VC dimension

48, Lecture 13A: Summary of result for linear regression

14.12.2020, 16:59

Lecture 13A:

This video summarizes the results that were established by hand in the previous video.

47, Lecture 13A: Applying the learning bound to kernel regression

14.12.2020, 16:56

In this lecture, we are applying the main theorem of the course to the case of least square regression in a RKHS.

46, Lecture 12B: Proof of the bounds on empirical process deviations

07.12.2020, 20:43

Lecture 12B: third part

- Proof of the Master theorem of slide 16.

45, Lecture 12B: About McDiarmid

07.12.2020, 20:40

Lecture 12 B: part 2

Some comments on the two ways of writing McDiarmid's inequality.

44, Lecture 12 B: High probability learning guarantees

07.12.2020, 20:30

Lecture 12 B: first part

- Showing the corollary from the previous proof

- Talagrand's lemma

- Bound on the empirical Rademacher complexity for linear functions

- Bound on the empirical Rademacher complexity for RKHS functions

High probability bounds

- McDiarmid's inequality

-  Concentration of the empirical Rademacher complexity

- High prob. bounds on the empirical process deviations

- High probability bounds on the Risk

43, Lecture 12 A: Proof for Rademacher control of the expected estimation error

07.12.2020, 19:11

Lecture 12 A: second part

- This video proves the Lemma entitled: Empirical process control with the Rademacher complexity on slide 5.

Please refer to that slides for the notations.

42, Lecture 12 A SLT: Rademacher complexity control for the empirical risk

07.12.2020, 19:01

- Introduction of the goals of this lecture on statistical learning theory

- An upper bound on the expected estimation error 

- Definition of Rademacher complexity

- A lemma on empirical process control with the Rademacher complexity

- Bounding the expected estimation error with the Rademacher complexity

41, Lecture 11B: Learning with deep NNs

30.11.2020, 11:49

Lecture 11B: second part

- Why is deep learning working?

- Deep networks overfit and generalize at the same time

- Deep learning is not well explained by classical statistical learning theory

- Why deep learning generalizes

- The double descent phenomenon

- Conclusions

40, Lecture 11B: Convolutional Neural Networks

30.11.2020, 11:48

- Convolution, cross-correlation and equivariance to translations

- Discrete 2D convolutions, zero padding

- Max Pooling

- Structure of a convolutional layer

- Multiple channels per layer

- Examples: LeNet5, AlexNet

- Recurrent Neural Networks

39, Lecture 11A: Training Neural Networks

30.11.2020, 11:48

- Using SGD to learn NNs

- Chain rules

- Forward chain rule

- Backward chain rule and backpropagation

- Gradients w.r.t. the parameters

- Weight decay

- Dropout

38, Lecture 11A: Neural Networks and Deep Learning

30.11.2020, 11:43

Lecture 11 A: first part

- Formal Neuron Model

- Two layer neural network

- Multilayer NN

- Activation Functions

- ReLU and linear splines

- Approximation results for NNs

37, Lecture 10B: EM demo

22.11.2020, 21:21

- Demo of EM on the GMM

36, Lecture 10: EM final form for the GMM

22.11.2020, 21:19

- Summary of the calculations corresponding the E-step

- Final results for the M-step

- Pseudo-code of the EM algorithm for the Gaussian Mixture Model

35, Lecture 10B: End of the M-step derivation

22.11.2020, 21:17

- M-step derivation for mu_k and Sigma_k

34, Lecture 10B: Derivation of EM for the GMM

22.11.2020, 21:14

- Presentation of the GMM

- Review of the EM main inequality

- Derivation of the E-step

- Part of the M-step for the estimation of the proportions pi_1,..., pi_K

33, Lecture 10A: K-means and the Gaussian Mixture Model

22.11.2020, 21:07

- K-means

- Jensen's inequality

- The Kullback-Leibler divergence and the entropy

- The Gaussian mixture model

- Abstract form of EM

32, Lecture 10A: K-means Demo

22.11.2020, 21:04

K-means demo

31, Lecture 9B: Boosting

17.11.2020, 17:20

Lecture 9B: Boosting

- Weak classifiers

- The Adaboost algorithm

- Forward stagewise additive modelling

- Deriving Adaboost as FSAM on the exponential loss

- More general FSAM

- Boosted trees: general algorithmic structure

30, Lecture 9A: Bagging and Random Forests

16.11.2020, 21:21

- Ensembling/Aggregation

- Boostrap and bagging definition

- Bagging and variance reduction

- Random forests

- OOB samples and error estimates

- Variable Importance

- Final remarks

29, Lecture 8A: Decision and regression trees

09.11.2020, 14:05

- Example of decision tree learning for binary classification in 2d

- Decision trees as histogram predictors

- An impurity measure: the entropy associated with a loss function

- Empirical entropies

- Expressing the ER of decision trees with entropies

- Impurity decrease associated with a split

- Decision trees learning algorithm: pseudocode

- Regression trees learning algorithm: pseudocode

- Implementations and criticisms

- Conditional Inference Trees

28, Lecture 7C: Kernel Methods 4

09.11.2020, 13:53

- Theorem of Kimmeldorf and Wahba in a RKHS + proof

- Application to regularized ERM in an RKHS

- Illustration with kernelized SVMs

- Kernel ridge regression

- Scalability issues for kernel methods

- On the difference between convolution kernels and Mercer kernels

27, Lecture 7B: Kernel Methods 3

03.11.2020, 06:59

Lecture 7B: third part

- the RKHS: a Hilbert space with continuous evaluation functionals

- Kernel associated with a RKHS and reproducing property

- Proof that RKHS reproducing kernels are positive definite function

- Moore-Aronszajn theorem: the RKHS reproducing kernels are the positive definite function

- The linear, polynomial and Gaussian RKHS

26, Lecture 7B: Kernel Methods 2

03.11.2020, 06:57

Lecture 7B: second part

A simple example where the dot product in feature space can be computed without computing the feature map. 

25, Lecture 7B: Kernel Methods 1

03.11.2020, 06:52

- A simple version of the representer theorem of Kimmeldorf and Wahba

- It's main consequence: the possibility to rewrite learning problem, and in particular the regularized ERM using kernel matrices

- Particular case of the kernel matrix associated with the linear mapping

24, Lecture 7A: Splines

02.11.2020, 10:54

Lecture 7A

- B-splines: introduction of the fundamental concepts and construction of the basis

- Natural cubic splines: imposing that the splines function has linear extrapolation outside of the data range

- Solving learning with splines just requires a change of the design matrix

- Smoothing splines: natural cubic splines actually are the solution of least square problem with minimal mean square curvature

-Pratical choices and linear computational complexity of splines

- Multivariate splines based on a tensor basis function

23, Lecture 6B: SVMs

27.10.2020, 09:42

Lecture 6B: Support Vector Machines

Hard-margin SVM (for separable data)

- Geometric construction

- Calculation of the margin

- Optimization problem

- Support vectors

Soft-margin SVM (for data which is not necessarily separable)

- Slack variable: geometric idea

- Optimization problem for the soft-margin SVM

- Reformulation via the hinge loss and interpretation as regularized ER

- Quadratic hinge loss

Imbalanced classification

and an SVM formulation for it.

22, Lecture 6A: Randomized classifiers in the ROC plane

26.10.2020, 18:01

Lecture 6A: second part

- Detailed explanation of why randomized classifiers allow to attain convex combinations in the ROC plane

- Proof that a randomized classifier has an FPR equal to a convex combination of the FPRs of its base classifiers

21, Lecture 6A: Evaluation of binary classifiers

26.10.2020, 17:57

Lecture 6A: first part

- Recall, Sensitivity, Specificity, false positive and false negative rates, precision, etc

- The ROC plane and the ROC curves

- Convex combinations in the ROC plane

- Relation between metrics measures in the ROC plane and misclassification error, AUC, and TAUC

- Precision recall curve

20, Lecture 5B: Perceptron, SGD, and Fisher's Linear Discriminant

19.10.2020, 10:35

Lecture 5B

- Perceptron: formal neuron model and loss function

- Perceptron: Learning algorithm in the separable case

- Perceptron: Learning algorithm in the non-separable case

- Stochastic Gradient Descent: Abstract form

- SGD: Theoretical result

- Using SDG for Risk minimization vs for Empirical Risk Minimization

-  Fisher's Linear Discriminant: concept of Generative model

-  Fisher's Linear Discriminant: overview of the equations and comparison with logistic regression.

19, Lecture 5A: Logistic regression for {-1,1} labels

19.10.2020, 10:33

Lecture 5A: second part

- Logistic regression for {-1,1} label

- Reinterpretation of logistic regression as ERM with a new loss function

18, Lecture 5A: Linear Binary Classification

19.10.2020, 10:31

Lecture 5A: first part

- Extension of the 0-1 loss to real valued score functions and plug-in principle

- Harness of the ERM for the 0-1 loss

- Plug-in predictor for classification from least square regression predictors

- Logistic regression: first formulation

17, Lecture 4B: Local Averaging predictors

13.10.2020, 12:31

Lecture 4B:

- The general form of local averaging predictors is presented

- Examples:

- K-nn

- Histogram based

- Nadaraya_Watson predictors

-Properties of local averaging predictors

- General framework of local ERM

- Recovering local averaging predictors as solution to local ERM for the quadratic risk when the predictors are "locally constant"

- Local linear regression (LOWESS)

16, Lecture 4A-2: Using cross-validation, building a final predictor and nested CV

06.10.2020, 21:54

Lecture 4A: second part

- Interpretation of the CV risk estimates obtained

- Building a final predictor from the CV results

- Separate test sets and nested cross-validation

- Summary

15, Lecture 4A-1: Simple validation, cross-validation, leave-one-out

06.10.2020, 21:49

Lecture 4A: first part

- Simple validation, cross-validation (CV) and leave-one-out CV are introduced

- The way to use these techniques for hyperparameter tuning is described

14, Lecture 3B: Model selection with AIC and BIC

05.10.2020, 18:55

Lecture 3B: second part

- How to use AIC for model selection?

- Presentation of BIC

- AIC vs BIC

- The Generalized Information Criterion and connection to sparsity.

13, Lecture 3B: The Information Criteria of Takeuchi and Akaike

05.10.2020, 18:36

Lecture 3B: first part

- In this lecture we concentrate on estimation of the optimism for predictors that are learned via statistical models and using the maximum likelihood principle

- The derivation of Takeuchi's Information Criterion is sketched

- Akaike's information criterion is then obtain as a particular case

- AIC is applied to linear regression as an example

- AICc, the corrected AIC criterion is presented for the case where p is large.

12, Lecture 3A: Risk estimation: Mallows' CL and Cp

29.09.2020, 00:58

- Optimism is defined as the amount of underestimation of the risk by the empirical risk

- The fixed design setting is introduced

- The form of the optimism for the quadratic risk is derived

- This defines the effective degrees of freedom

- Linear estimators are introduced with linear regression and ridge as examples

- The degrees of freedom can be computed for linear estimators and take the form of Mallows' CL

- The particular case of linear regression is covered by Mallows' Cp

- The lecture end by comments on how to use the CL and the Cp in practice.

11, Lecture 2B-3: Comparing L1 and L0 + Greedy algorithms

25.09.2020, 17:34

Lecture 2B third part

- The particular case of linear regression with orthogonal designs is considered to compare the Lasso with L0 penalization. The similarities and differences are highlighted by soft-thresholding vs hard-thresholding

- Greedy algorithms: Forward selection and Orthogonal Matching Pursuit

- An empirical comparison of L0, L1 and L2

10, Lecture 2B-2: Geometry of the Lasso

25.09.2020, 17:33

Lecture 2B second part

- A geometric explanation of why Lasso solutions are sparse

9, Lecture 2B-1: Other regularizations + the Lasso

25.09.2020, 17:32

Lecture 2B first part:

- General remarks on other regularization

- On the possibility of penalizing with the L0 quasi-norm

- The Lasso

8, Lecture 2A: Complexity

21.09.2020, 23:21

Lecture 2A third part:

- Explicit vs implict control of complexity

- Approximation-estimation trade-off

7, Lecture2A: Regularization

21.09.2020, 23:19

Second part of Lecture 2A

- Tikhonov regularization 

- Ridge regression

- Example of the application of ridge regression to polynomial regression

6, Lecture 2A: Overfitting

21.09.2020, 23:17

First part of Lecture 2A:

The case of polyniomial regression is considered to discuss overfitting.

5, Lecture 1B: Theoretical Properties of the linear regression estimator

15.09.2020, 21:56

Lecture 1B: second part

This section reviews a number of results from classical statistics on linear regression:

- The geometric point of view and the hat matrix

- The Gauss-Markov theorem

- The relation with Maximum Likelihood in Gaussian conditional models

- The distributional properties of the estimator when the data is exactly Gaussian.

4, Lecture 1B-1 Linear regression basics from the ERM perspective

15.09.2020, 21:46

Lecture 1B: first part

A brief review of Linear Regression introduced from the perspective of the empirical risk minimization principle applied to the square loss.

3, Lecture 1A-3: PAC learning, Empirical Risk Minimization, Inductive bias and Hypothesis Space

15.09.2020, 21:36

Lecture 1A: third part

This section start by using the concepts from decision theory to propose way of quantifying how much an algorithm or a learning scheme "learns" with the PAC learning framework, then it introduces the Empirical Risk Minimization principle, discusses why learning is an ill-posed problem in the sense of Hadamard and why an inductive bias is needed and introduces different predictor/hypothesis spaces.

2, Lecture 1A-2: Examples of decision models for Supervised learning

15.09.2020, 21:23

Lecture 1A second part:

In this section, several examples of decision models are presented:

- for least squares regression

- for multiclass classification

- for scoring/ranking of pairs

- for an abstract OCR model

1, Lecture 1A Introduction to Supervised Learning and concepts from Decision Theory

15.09.2020, 21:15

First part of lecture 1A:

- Supervised learning is defined

- Key concepts from decision theory such as loss function, risk, target function, conditional risk and excess risk are introduced.


This file is part of the content downloaded from Statistical machine learning.
Course summary

General

Contact:

Lecturer:  Guillaume Obozinski (guillaume.obozinski@epfl.ch), Yoav Zemel (yoav.zemel@epfl.ch)

Teaching Assistants: Yun Ho (ho.yun@epfl.ch), Shivang Sachar (shivang.sachar@epfl.ch)

Schedule:

Exam: Tuesday 28.01.2025 09h15-12h15 at CE16


Lectures: Tuesday, 13:15-15:00, in MA A330

The lectures are in person in MA A330.
The lectures are available as pre-recorded videos on the course video channel (see below)

Exercises: Tuesday, 15:15-17:00 in MA A330.

Course video channel:

The videos of the version of this course taught in the Fall of 2020 are available on the following channel:

tube.switch.ch/channels/f03abc7c

Note that although the current version of the course is relatively close, some video lectures (e.g. lecture 3 in the videos) have been removed from the current course version.

Main references: (free e-copy available on the book's webpage, click on the links below)

An Introduction to Statistical Learning, with Applications in R, by James, G., Witten, D., Hastie, T. and Tibshirani, R. Springer, 2013.

The Elements of Statistical Learning, Data Mining, Inference, and Prediction, by Hastie, T., Tibshirani, R., and Friedman, J., Springer, 2009.

Pattern recognition and machine learning, Bishop, C. M., Springer, 2006.

References on Hilbert Spaces:

Resources for learning the R programming language:

The following ebooks should be more than adequate for this course.

1. R Cookbook by James Long and Paul Teetor

2. Advanced R by Hadley Wickham

Page of the course:

https://edu.epfl.ch/coursebook/en/statistical-machine-learning-MATH-412


Week 1: Introduction, decision theory, linear regression


Week 2: Linear regression and the linear model


Week 3: Linear regression, regularization


Week 4: Cross-validation and Local averaging


Week 5: Linear Binary Classification


Week 6 - Classifier evaluation


Week 7: Splines and kernel methods


Week 8: Kernels and principal component analysis


Week 9 PCA and clustering/Gaussian mixture model



Week 10: Clustering and Gaussian Mixture Model / 2


Week 11: Trees, Bagging and Random forest 1



Week 12: Trees, Bagging and Random forest 2, neural networks 1


Oral project presentations of Week 13 and 14 (Dec 10 & Dec 17)

The project presentations will take place on December 10th and December 17th each time from 14:15 to 17:00. There will be a shorter lecture on those days from 13:15 to 14:00.

Guidelines for projects presentations:

  • will be based on your prepared slides,

  • should last 8 minutes

  • should be split between all members of the team who should all present

  • should be a pedagogical presentation accessible to the rest of the class

  • should present (a) the general idea or problem, (b) some methodology/model/algorithm, (c) experimental results and insights from them.

  • will be followed by a few questions from Yoav and myself (3 minutes)

  • should be attended by all the students presenting on the same day (except for specific constraints)

For fairness and efficiency, all slides will have to be submitted on Moodle at the latest on December 9th at 23:59 whether your team presents on the 10th or on the 17th. The dates and times of your presentation will be randomized (based on your availabilities.).




Week 13 Neural Networks and deep learning (13/12)



Week 14: Deep learning (second part, 20/12)


Not covered material: Risk estimation, Model Assessment and Model Selection (08/12))


Not covered material: Statistical Learning Theory I