{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Supervised learning\n",
"\n",
"Supervised ML is the task of learning a function that maps an input to an output based on example input-output pairs. \n",
"Formally, we are given with a set $\\mathcal{D}$ consists of (data,labels) pairs: \n",
"$$\\mathcal{D} = \\{ (x_i , y_i) \\}_{i=1}^{m}$$ \n",
"\n",
"where $x_i \\in \\mathcal{X}$ are the datapoints and $y_i \\in \\mathcal{Y}$ are the labels. For simplicity, we assume here that the \"labels space\" $\\mathcal{Y}$ is a finite set $y_i$ that are discrete, univariate variables, i.e., classification settings.\n",
"\n",
"The goal in supervised learning is to **fit** a function $f : \\mathcal{X} \\to \\mathcal{Y}$ such that $f(x_i) =y_i$ for all $i=1,\\dots,m$\\ .
\n",
"Traditionaly, the data points $x_i$ are elements of some *vector space*\\, meaning that, each point can be expressed using a $p$-tuple (vector) of numbers\n",
"$$x_i = (x_{i1}, x_{i2}, \\dots, x_{ip})$$\n",
"\n",
"hence we have $f : \\mathbb{R}^p \\to \\mathcal{Y}$. \n",
"However, when working with multi-way data, and time-series data in particular, the situation is different. \n",
"\n",
"## The case of matrix valued inputs\n",
"\n",
"To adhere with the empirical results and demonstrations shown in Mor et al., we restrict the discussion to the case where data points are gathered across multiple timepoints. That is, each sample $x_i$ is in fact, an $n$ by $p$ matrix, where each of the $p$ columns represents a feature and the rows correspond to different timepoints in which the samples were gathered.\n",
"\n",
"\n",
"The matrix input setting is not uncommon, for example, consider the task of gray-scale image classification, where each image is a set of $n \\times p$ pixels and each pixel holds a value from 0 to 1 determining its brightness.
\n",
"In the case of longitudinal sampling data, we can think of the rows of each sample $x_i$ as discrete samples of a \"curve\" in a $p$ dimensional space.
\n",
"In formal words, let $\\gamma_i : \\mathbb{R} \\to \\mathbb{R}^p$ a smooth function, and $\\{t_j\\}_{j=1}^n \\subset \\mathbb{R}$ be a set of timepoints, such that $\\gamma_i (t_j) \\in \\mathbb{R}^p$ is the $j^{th}$ row of the matrix $x_i$\\ . "
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. figure:: ./data/single_curve_demo.png\n",
" :alt: curve\n",
" :class: with-shadow\n",
" :width: 50%\n",
" :align: center\n",
"\n",
" An illustration of the :math:`p` dimensional curve :math:`\\gamma_i` (here :math:`p=3`\\ ), were the blue line is the curve itself, points illustrate discrete sampling process, each point correspond to a timepoint :math:`t_j` for :math:`j=1,\\dots ,n`\\ ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our context, we are given with collection of such labeled curves, all of which are sampled in corresponding timepoints, and our goal is to fit a function that takes a sampled curve $x =[\\gamma (t_1), \\gamma (t_2) , \\dots , \\gamma (t_n)]$ and outputs the label associated with the curve."
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. figure:: ./data/multiple_curves_demo.png\n",
" :alt: curves\n",
" :class: with-shadow\n",
" :width: 50%\n",
" :align: center\n",
"\n",
" Illustration of the supervised curve classification task, curves are colored according to their class label. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## ML workflows\n",
"\n",
"The function $f: \\mathbb{R}^{n \\times p} \\to \\mathcal{Y}$ usually consists of a composition of \"simpler\" functions \n",
"$$f(x) = f_1 \\circ f_2 \\circ ... \\circ f_d (x)$$ \n",
"\n",
"i.e., a *pipeline* of functions, where each function $f_j$ \"captures\" a certain property of the data that is key for making an accurate prediction, and such that the mapping $f_j$ is vector valued, starting some $j \\leq d$\\ . Meaning that $f_k (x) \\in \\mathbb{R}^{\\ell_k}$ for all $k\\geq j$\\ . \n",
"Examples for common functions of use in supevised ML are \\\n"
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
"* \\ In neural networks, :math:`f_j` are of the form :math:`f_j (x) = \\alpha_j ({\\bf W}_j x + b_j)` where \n",
"\n",
" * \\ :math:`{\\bf W}_j` is a matrix representing a :math:`\\mathbb{R}^{j_1} \\to \\mathbb{R}^{j_2}` linear mapping \n",
" * \\ :math:`b_j \\in \\mathbb{R}^{j_2}` is a bias term \n",
" * \\ :math:`\\alpha_j` is a non-linear activation function such as, :math:`\\tanh, \\operatorname{SoftMax} , \\operatorname{sigmoid}` etc... \n",
" \n",
"* \\ Normalization steps, such as standard-scaling \n",
"* \\ Dimensionality reduction steps, like PCA, CCA, UMAM, etc..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An essential requirement we impose on the functions $f_j$ is that after the training process is done, and the parameters are learned, each $f_j$ must be applicable to samples outside of the training set $\\mathcal{D}$\\ ."
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
"\n",
".. admonition:: Example \\\n",
"\n",
" ✔ Consider PCA as a dimensionality reduction step, then :math:`f_j (x) = x {\\bf W}_j` where entries of the matrix :math:`{\\bf W}_j` are determined\n",
" by the dataset :math:`{\\bf X} = [x_1 ; x_2 ; \\dots ; x_m]`\\ . Given a new data point :math:`\\bar{x}`\\ , the application of :math:`f_j` \n",
" to :math:`\\bar{x}` is done by simple matrix multiplication.\n",
" \\n\n",
" \n",
" ✘ Alternatively, consider the mapping :math:`f_j` obtained by fitting a PCoA (or `multidimensional scaling
| \n", " | \n", " | k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Ruminococcaceae-g__Oscillospira-s__-OTU_310886 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Peptostreptococcaceae-g__-s__-OTU_531374 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__-g__-s__-OTU_366584 | \n", "k__Bacteria-p__Proteobacteria-c__Betaproteobacteria-o__Burkholderiales-f__Alcaligenaceae-g__Sutterella-s__-OTU_173726 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Ruminococcaceae-g__-s__-OTU_366794 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Lachnospiraceae-g__-s__-OTU_975306 | \n", "k__Bacteria-p__Bacteroidetes-c__Bacteroidia-o__Bacteroidales-f__Porphyromonadaceae-g__Parabacteroides-s__distasonis-OTU_585914 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__-g__-s__-OTU_352304 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Lachnospiraceae-g__Blautia-s__-OTU_193061 | \n", "k__Bacteria-p__Bacteroidetes-c__Bacteroidia-o__Bacteroidales-f__Prevotellaceae-g__Prevotella-s__-OTU_4302472 | \n", "... | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Lachnospiraceae-g__Roseburia-s__-OTU_313593 | \n", "k__Bacteria-p__Synergistetes-c__Synergistia-o__Synergistales-f__-g__-s__-OTU_505549 | \n", "k__Bacteria-p__Actinobacteria-c__Coriobacteriia-o__Coriobacteriales-f__Coriobacteriaceae-g__-s__-OTU_4403259 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Lachnospiraceae-g__Epulopiscium-s__-OTU_4371562 | \n", "k__Bacteria-p__Bacteroidetes-c__Bacteroidia-o__Bacteroidales-f__S24.7-g__-s__-OTU_1115121 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Ruminococcaceae-g__-s__-OTU_300870 | \n", "k__Bacteria-p__Bacteroidetes-c__Bacteroidia-o__Bacteroidales-f__Porphyromonadaceae-g__Parabacteroides-s__distasonis-OTU_186233 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Lachnospiraceae-g__.Ruminococcus-.s__-OTU_300877 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__.Tissierellaceae-.g__-s__-OTU_869796 | \n", "k__Bacteria-p__Firmicutes-c__Clostridia-o__Clostridiales-f__Ruminococcaceae-g__Ruminococcus-s__-OTU_196061 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SubjectID | \n", "Week | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| P_10343 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000869 | \n", "0.022900 | \n", "0.000033 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "0.000183 | \n", "0.000000 | \n", "0.008966 | \n", "0.0 | \n", "0.0 | \n", "0.020421 | \n", "0.000000 | \n", "0.000823 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000073 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| 12 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.006230 | \n", "0.000049 | \n", "0.000997 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000049 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| 52 | \n", "0.000000 | \n", "0.000097 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.002211 | \n", "0.000000 | \n", "0.000291 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000048 | \n", "0.0 | \n", "0.0 | \n", "|
| P_10897 | \n", "0 | \n", "0.000423 | \n", "0.000000 | \n", "0.000059 | \n", "0.0 | \n", "0.0 | \n", "0.077260 | \n", "0.000863 | \n", "0.017448 | \n", "0.000051 | \n", "0.0 | \n", "... | \n", "0.000008 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "0.000023 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.006095 | \n", "0.000411 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| 12 | \n", "0.000080 | \n", "0.000013 | \n", "0.010809 | \n", "0.0 | \n", "0.0 | \n", "0.005411 | \n", "0.000372 | \n", "0.000359 | \n", "0.000027 | \n", "0.0 | \n", "... | \n", "0.000027 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| 52 | \n", "0.000000 | \n", "0.000080 | \n", "0.000664 | \n", "0.0 | \n", "0.0 | \n", "0.013934 | \n", "0.000261 | \n", "0.000905 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000040 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| P_1108 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.002086 | \n", "0.000000 | \n", "0.000066 | \n", "0.000016 | \n", "0.0 | \n", "... | \n", "0.000049 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000049 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "0.000000 | \n", "0.000211 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.011007 | \n", "0.000000 | \n", "0.000409 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000384 | \n", "0.000025 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| 12 | \n", "0.000007 | \n", "0.000000 | \n", "0.000028 | \n", "0.0 | \n", "0.0 | \n", "0.038760 | \n", "0.000528 | \n", "0.003105 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000428 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "|
| 52 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.115325 | \n", "0.002798 | \n", "0.018651 | \n", "0.000000 | \n", "0.0 | \n", "... | \n", "0.000311 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "
12 rows × 1015 columns
\n", "| \n", " | SubjectID | \n", "label | \n", "
|---|---|---|
| 0 | \n", "P_10343 | \n", "0 | \n", "
| 1 | \n", "P_10343 | \n", "0 | \n", "
| 2 | \n", "P_10343 | \n", "0 | \n", "
| 3 | \n", "P_10343 | \n", "0 | \n", "