These exercises are due on or before December 16th 2022 and should be submitted on the drop box available online. They can be done in groups of two students. The write-up can be in English or in French. Please submit your answers as a (unique) zip file that you will name MVA DM1 <your name>.zip if you worked alone or MVA DM1 <name1> <name2>.zip with both of your names if you worked as a group of two. Indicate your name(s) as well in the documents. Note that the zip file should weight no more than 16Mb. Only such zip files with such names will be evaluated. Your solutions as well as your codes should be present in the zip file.
The files trainA, trainB and trainC contain samples of data \((x_n,y_n)\) where \(x_n \in \mathbb{R}^2\) and \(y_n \in \{0,1\}\) (each line of each file contains the 2 components of \(x_n\) then \(y_n\)). The goal of this exercise is to implement linear classification methods and to test them on the three data sets. The code should be written in R or Python. The source code should be handed in along with results. However, all the requested figures should be printed on paper or part of a pdf file which is turned in, with clear titles that indicate what the figures represent. Therefore, we recommend to write your code and report thanks to a Markdown file (in R or Python). The discussions may of course be handwritten.
Given the class variable, the data are assumed to be Gaussian with different means for different classes but with the same covariance matrix. \[y \sim \mathrm{Bernoulli}(\pi), x|{y = i} ∼ \mathrm{Normal}(\mu_i, \Sigma).\]
Derive the form of the maximum likelihood estimator for this model.
What is the form of the conditional distribution p(y = 1|x)? Compare with the form of logistic regression.
Implement the MLE for this model and apply it to the data. Represent graphically the data as a point cloud in \(\mathbb{R}^2\) and the line defined by the equation \[p(y = 1|x) = 0.5.\]
Implement logistic regression for an affine function \(f(x) = w^{\intercal}x + b\) (do not forget the constant term).
Consider class y as a real valued variable taking the values \(0\) and \(1\) only. Implement linear regression (for an affine function \(f(x) = w^{\intercal}x+b)\) by solving the normal equations.
Data in the files testA, testB and testC are respectively drawn from the same distribution as the data in the files trainA, trainB and trainC. Test the different models learnt from the corresponding training data on these test data.
We consider the following Gaussian mixture model (GMM) for multivariate data: \[ X_0 \sim \sum_{k=1}^{K}\pi_k \mathcal{N}(\mu_k, \Sigma_k),\] with \(\pi_k \in [0, 1]\), \(\sum_{k=1}^{K}\pi_k=1\), and \(X_0 \in \mathbb{R}^p\). We are provided with a realisation \((x_1, \dots, x_n)\) of a random sample of size \(n\).
Write the EM algorithm for such model, giving all the exact update formula for the E and M steps. Justify (prove) the use of these updates.
Write a complete set of functions in R or Python implementing this EM algorithm. The main function should take the data as input as well as the number of components \(K\), and provide the estimates of the parameters as outputs. This function should also output the maximum a posteriori cluster memberships. All the functions should be stand alone. You are not allowed to rely on any packages.
Use your EM algorithm to analyse the decathlon data set and to look for \(K=3\) clusters. Interpret the results. The data used here refer to athletes’ performance during decathlon events.
The source code should be handed in along with results. However, all the requested figures should be printed on paper or part of a pdf file which is turned in, with clear titles that indicate what the figures represent. Therefore, again, we recommend to write your code and report thanks to a Markdown file (in R or Python). The discussions may of course be handwritten.