The goal of the AML course is to help students develop good methodologies for data science problems, in particular those that involve machine learning. This is a reverse class, in which there is no frontal lecture: instead, studens are directly immersed in laboratory sessions, where they will attack a number of “challenges” that define a real-world problem involving data, and target predictions.
The concept of a machine learning challenge or competition is nowadays widespread. Typically, a competition involves a team, who cooperate toward a "submission", which takes the form of a set of predictions for a test set for which the ground-truth is undisclosed to participants. Then, an automatic system computes a ranking based on a given performance metric, which is used to compile a leaderboard, together with the attribution of "honor" badges and even monetary prizes.
In the AML course we will not rank teams based on performance score, and instead expect an academic kind of submission. There will be no automatic scoring system: groups are expected to define their performance metric (or adopt the one suggested in a challenge) and work out how to test their methods.
Teaching and Learning Methods: Laboratory sessions (group of 2 students).
Course Policies: Attendance to the Lab. sessions are mandatory.
Book: JAMES G., WITTEN D., HASTIE T., TIBSHIRANI R. An Introduction to Statistical Learning. Springer, 2013, 440p.
Book: BISHOP C. Pattern Recognition and Machine Learning. Springer-Verlag, 2006, 768p.
Book: K.P. MURPHY. Machine Learning: A probabilistic Perspective. The MIT Press, 2012
Book: GOODFELLOW I., BENGIO Y., COURVILLE A. Deep Learning. MIT Press, 2016, 800p.
Book: BISHOP M. C., BISHOP H., Deep learning, Springer, 2024
This course blends methodological and computer science skills. Students are expected to be comfortable with Python programming, and with common libraries used in the context of data science and machine learning problems. Moreover, students are assumed to be comfortable with machine learning methodologies.
The skills above are acquired mostly in the MALIS and the Deep Learning courses. In such courses, students gain familiarity both with Python, Jupyter Notebooks, machine learning libraries such as sk-learn, TensorFlow and PyTorch, on the computer science side. Additionally, students are exposed to most of the important machine learning concepts, methods and theory. Optionally, the ASI course can give a special twist to address problems in a probabilistic manner.
If you are enrolled to the AML course, but didn't follow MALIS, it could be very problematic. If you didn't follow the Deep Learning course, your modeling approaches could be limited.
Another underlying prerequisite to participate to AML is familiarity with a cloud-hosted computing platform, such as Kaggle, Google Colab, HuggingFace. You are free to use your own resources, such as personal laptop: while this is great for development and documentation, unless you have GPUs for training, your computational power might be limited, in case you wish to use heavy models.
Description
Typically, we will have 3 challenges, for which you will be given at least 2 weeks (sometimes more) to complete your work. During these 2 weeks, you will have to write code, run experiments, and prepare a technical report that is the only item that will be evaluated (see evaluation below).
In general, we have the following kind of challenges:
- Computer Vision (CV) Challenge: a typical CV challenge deals with image data, and the frequent task of classification. This challenge is meant to deepen your understanding and improve your practical skills in using deep learning models such as convolutional neural networks, and vision Transformers. In general, these are “simple challenges”, but some “surprises” might come from the data, which is not guaranteed to be balanced, clean and readily usable.
- Density Estimation Challenge: a typical density estimation challenge deals with anomaly detection problems, which can be applied to any kind of data, including audio, for example. This challenge is meant to deepen your understanding and improve your practical skills in using deep learning models such as (variational) autoencoders, or other density estimation models. Typically, this kind of challenge is more difficult, and might require you to find and study relevant scientific literature, to go beyond the simple baselines that can be found numerous on the web.
- Sequential Data Challenge: a typical sequential data challenge deals with language modeling, which can be applied to problems such as sentiment analysis, for example. This challenge is meant to deepen your understanding and improve your practical skills in using deep learning models such as recurrent neural networks, transformer networks, state-space models in general. Typically, this kind of challenge is very difficult as it not only requires you to get up to speed with a fast-paced scientific literature (if you want to produce an original solution to the challenge), but it also requires you to face computational issues. Playing with modern deep learning models requires GPU and memory, which you will need to “find”. In particular, students will be required to deal with all the problems related to the “free-tier” online GPU services, and their limitations. This will prove useful in real-life, as nowadays it is still very frequent for companies (even large ones!) to be underprovisioned in terms of computational facilities.
Learning outcomes:
- Understand a data science problem statement, and identify the theoretical tools and algorithmic implementation to solve the problem.
- Design and implement “end-to-end” software methods to analyze and prepare data, use data to learn a statistical model, and use the model to make inferences;
- Validate and assess the quality of an end-to-end software method to address a data science problem;
- Prepare a technical report to summarize your work and present the salient aspects of your findings.
Nb hours: 21.00
Evaluation:
All activities in the Algorithmic Machine Learning course are graded. This is a form of continuous monitoring and performance charting for students. You will have to submit using the Moodle system, a report summarizing your work in a given challenge.
A report is typically 5-7 pages long, consisting in:
- Sections for each item you want to illustrate about your work. You need an introduction (to clearly frame the problem you are solving, and to give context), data analysis and preparation, modeling approach, results, etc... You may be inspired by having a look at a typical research article produced using LaTex.
- Each section should contain text, no code-only sections. You can of course report code snippets if they are useful to describe one of your achievements, but remember that code takes a lot of space on a page!
- Results section should contain plots, as illustrations of your work. Do not produce dozens of plots, try to be concise and only show what is relevant
In other words, a report resembles to a short research article, which is supposed to be read by a technical audience. This is meant to simulate a realistic scenario, in which a data scientists should present findings to the team and the hierarchy in their company..