These exercises are due on or before November 19th 2021 and should be submitted on the drop box available online. They can be done in groups of two students. The write-up can be in English or in French. Please submit your answers as a (unique) zip file that you will name MVA DM1 <your name>.zip if you worked alone or MVA DM1 <name1> <name2>.zip with both of your names if you worked as a group of two. Indicate your name(s) as well in the documents. Note that the zip file should weight no more than 16Mb. Only such zip files with such names will be evaluated. Your solutions as well as your codes should be present in the zip file.

The files trainA, trainB and trainC contain samples of data \((x_n,y_n)\) where \(x_n \in \mathbb{R}^2\) and \(y_n \in \{0,1\}\) (each line of each file contains the 2 components of \(x_n\) then \(y_n\)). The goal of this exercise is to implement linear classification methods and to test them on the three data sets. The code should be written in R or Python. The source code should be handed in along with results. However, all the requested figures should be printed on paper or part of a pdf file which is turned in, with clear titles that indicate what the figures represent. Therefore, we recommend to write your code and report thanks to a Markdown file (in R or Python). The discussions may of course be handwritten.

Given the class variable, the data are assumed to be Gaussian with different means for different classes but with the same covariance matrix. \[y \sim \mathrm{Bernoulli}(\pi), x|{y = i} ∼ \mathrm{Normal}(\mu_i, \Sigma).\]

Derive the form of the maximum likelihood estimator for this model.

What is the form of the conditional distribution p(y = 1|x)? Compare with the form of logistic regression.

Implement the MLE for this model and apply it to the data. Represent graphically the data as a point cloud in \(\mathbb{R}^2\) and the line defined by the equation \[p(y = 1|x) = 0.5.\]

Implement logistic regression for an affine function \(f(x) = w^{\intercal}x + b\) (do not forget the constant term).

- Give the numerical values of the parameters learnt.
- Represent graphically the data as a cloud point in \(\mathbb{R}^2\) as well as the line defined by the equation \[p(y = 1|x) = 0.5.\]

Consider class y as a real valued variable taking the values \(0\) and \(1\) only. Implement linear regression (for an affine function \(f(x) = w^{\intercal}x+b)\) by solving the normal equations.

- Provide the numerical values of the learnt parameters.
- Represent graphically the data as a point cloud in \(\mathbb{R}^2\) as well as the line defined by the equation \[p(y = 1|x) = 0.5.\]

Data in the files testA, testB and testC are respectively drawn from the same distribution as the data in the files trainA, trainB and trainC. Test the different models learnt from the corresponding training data on these test data.

- Compute for each model the misclassification error (i.e. the fraction of the data misclassified) on the training data and compute it as well on the test data.
- Compare the performances of the different methods on the three datasets. Is the misclassification error larger, smaller, or similar on the training and test data? Why? Which methods yield very similar/dissimilar results? Which methods yield the best results on the different datasets ? Provide an interpretation.

We consider the following Gaussian mixture model (GMM) for multivariate data: \[ X_0 \sim \sum_{k=1}^{K}\pi_k \mathcal{N}(\mu_k, \Sigma_k),\] with \(\pi_k \in [0, 1]\), \(\sum_{k=1}^{K}\pi_k=1\), and \(X_0 \in \mathbb{R}^p\). We are provided with a realisation \((x_1, \dots, x_n)\) of a random sample of size \(n\).

Write the EM algorithm for such model, giving all the exact update formula for the E and M steps. Justify (prove) the use of these updates.

Write a complete set of functions in R or Python implementing this EM algorithm. The main function should take the data as input as well as the number of components \(K\), and provide the estimates of the parameters as outputs. This function should also output the maximum a posteriori cluster memberships. All the functions should be stand alone. You are not allowed to rely on any packages.

Use your EM algorithm to analyse the decathlon data set and to look for \(K=3\) clusters. Interpret the results. The data used here refer to athletes’ performance during decathlon events.

The source code should be handed in along with results. However, all the requested figures should be printed on paper or part of a pdf file which is turned in, with clear titles that indicate what the figures represent. Therefore, again, we recommend to write your code and report thanks to a Markdown file (in R or Python). The discussions may of course be handwritten.