Syllabus, Logistics, and Introduction
Supervised Learning
Unsupervised Learning
Statistical Learning Overview
This lecture content is inspired by and replicates the material from An Introduction to Statistical Learning.
It is your turn! - 10 minutes
Present yourself to your left/right colleague and ask:
Collect her/his answer and submit your first Participation Assignment!
Word | Spam | |
---|---|---|
george | 0.00 | 1.27 |
you | 2.26 | 1.27 |
hp | 0.02 | 0.90 |
free | 0.52 | 0.07 |
! | 0.51 | 0.11 |
edu | 0.01 | 0.29 |
remove | 0.28 | 0.01 |
Average percentage of words or characters in an email message equal to the indicated word or character. We have chosen the words and characters showing the largest difference between spam and email.
Identify the numbers in a handwritten zip code.
Video: Winning the Netflix Prize
Shown are Sales vs TV, Radio, and Newspaper, with a blue linear-regression line fit separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model:
\[ \text{Sales} \approx f(\text{TV}, \text{Radio}, \text{Newspaper}) \]
Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\).
TV is a feature, or input, or predictor; we name it \(X_1\).
Likewise, name Radio as \(X_2\), and so on.
The input vector collectively is referred to as:
\[ X = \begin{pmatrix} X_1 \\ X_2 \\ X_3 \end{pmatrix} \]
We write our model as:
\[ Y = f(X) + \epsilon \]
where \(\epsilon\) captures measurement errors and other discrepancies.
With a good \(f\), we can make predictions of \(Y\) at new points \(X = x\).
Understand which components of \(X = (X_1, X_2, \ldots, X_p)\) are important in explaining \(Y\), and which are irrelevant.
Depending on the complexity of \(f\), understand how each component \(X_j\) affects \(Y\).
In particular, what is a good value for \(f(X)\) at a selected value of \(X\), say \(X = 4\)?
There can be many \(Y\) values at \(X=4\). A good value is:
\[ f(4) = E(Y|X=4) \]
where \(E(Y|X=4)\) means the expected value (average) of \(Y\) given \(X=4\).
This ideal \(f(x) = E(Y|X=x)\) is called the regression function.
\[ f(\mathbf{x}) = f(x_1, x_2, x_3) = \mathbb{E}[\,Y \mid X_1 = x_1,\, X_2 = x_2,\, X_3 = x_3\,]. \]
\[ f(x) = \mathbb{E}[Y \mid X = x] \quad\text{is the function that minimizes}\quad \mathbb{E}[(Y - g(X))^2 \mid X = x] \text{ over all } g \text{ and for all points } X = x. \]
\(\varepsilon = Y - f(x)\) is the irreducible error.
\[ \mathbb{E}\bigl[(Y - \hat{f}(X))^2 \mid X = x\bigr] = \underbrace{[\,f(x) - \hat{f}(x)\,]^2}_{\text{Reducible}} \;+\; \underbrace{\mathrm{Var}(\varepsilon)}_{\text{Irreducible}}. \]
Definition:
Learning a mapping from inputs \(X\) to an output \(Y\) using labeled data.
The algorithm is supervised because the correct answers are known during training.
Goal:
Definition:
Learning the structure or patterns in data without labeled outputs.
The algorithm is unsupervised because no outcome variable guides the learning.
Goal:
Often, we lack sufficient data points for exact computation of \(E(Y|X=x)\).
So, we relax the definition:
\[ \hat{f}(x) = \text{Ave}(Y|X \in \mathcal{N}(x)) \]
where \(\mathcal{N}(x)\) is a neighborhood of \(x\).
Nearest neighbor averaging can be pretty good for small \(p\) — i.e., \(p \le 4\) — and large-ish \(N\).
We will discuss smoother versions, such as kernel and spline smoothing, later in the course.
Nearest neighbor methods can be lousy when \(p\) is large.
Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.
We need to get a reasonable fraction of the \(N\) values of \(y_i\) to average in order to bring the variance down (e.g., 10%).
A 10% neighborhood in high dimensions is no longer truly local, so we lose the spirit of estimating \(\mathbb{E}[Y \mid X = x]\) via local averaging.
Top panel: \(X_1\) and \(X_2\) are uniformly distributed with edges minus one to plus one.
1-Dimensional Neighborhood
2-Dimensional Neighborhood
Botton panel: We see how far we have to go out in one, two, three, five, and ten dimensions in order to capture a certain fraction of the points.
Key Takeaway: As dimensionality increases, neighborhoods must expand significantly to capture the same fraction of data points, illustrating the curse of dimensionality.
The linear model is a key example of a parametric model to deal with the curse of dimensionality:
\[ f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p \]
A linear model is specified in terms of \(p+1\) parameters (\(\beta_0, \beta_1, \ldots, \beta_p\)).
We estimate the parameters by fitting the model to training data.
Although it is almost never correct, it serves as a good and interpretable approximation to the unknown true function \(f(X)\).
\[ \hat{f}_L(X) = \hat{\beta}_0 + \hat{\beta}_1X \]
The linear model gives a reasonable fit here.
\[ \hat{f}_Q(X) = \hat{\beta}_0 + \hat{\beta}_1X + \hat{\beta}_2X^2 \]
Quadratic models may fit slightly better than linear models in some cases.
Red points are simulated values for income from the model:
\[ \text{income} = f(\text{education}, \text{seniority}) + \epsilon \]
\(f\) is the blue surface.
Linear regression model fit to the simulated data:
\[ \hat{f}_L(\text{education}, \text{seniority}) = \hat{\beta}_0 + \hat{\beta}_1 \times \text{education} + \hat{\beta}_2 \times \text{seniority} \]
More flexible regression model \(\hat{f}_S(\text{education}, \text{seniority})\) fit to the simulated data.
Here we use a technique called a thin-plate spline to fit a flexible surface. We control the roughness of the fit.
Even more flexible spline regression model \(\hat{f}_S(\text{education}, \text{seniority})\) fit to the simulated data. We tunned the parameter all the way down to zero and this surface actually goes through every single data point.
The fitted model makes no errors on the training data! This is known as overfitting.
Prediction accuracy versus interpretability:
Good fit versus over-fit or under-fit:
Parsimony versus black-box:
Trade-offs between flexibility and interpretability:
Suppose we fit a model \(\hat{f}(x)\) to some training data \(Tr = \{x_i, y_i\}_{i=1}^N\), and we wish to evaluate its performance:
\[ \text{MSE}_{Tr} = \text{Ave}_{i \in Tr}[(y_i - \hat{f}(x_i))^2] \]
However, this may be biased toward more overfit models.
\[ \text{MSE}_{Te} = \text{Ave}_{i \in Te}[(y_i - \hat{f}(x_i))^2] \]
The trade-off we manage
Decomposition of expected test MSE
\[
\mathbb{E}\big[(Y-\hat f(X))^2\big]
\;=\; \underbrace{\big(\mathrm{Bias}[\hat f(X)]\big)^2}_{\text{misspecification}}
\;+\; \underbrace{\mathrm{Var}[\hat f(X)]}_{\text{sensitivity}}
\;+\; \underbrace{\sigma^2}_{\text{irreducible noise}}
\]
Top Panel: Model Fits
Black Curve: The true generating function, representing the underlying relationship we want to estimate.
Data Points: Observations generated from the black curve, with added noise (error).
Fitted Models:
Key Insight:
The green model captures the data points well but risks overfitting, while the orange model is too rigid and misses the underlying structure. The blue model strikes a balance.
Botton Panel: Mean Squared Error (MSE)
Gray Curve: Training data MSE.
Red Curve: Test data MSE across models of increasing flexibility.
Key Takeaway:
There is an optimal model complexity (the “magic point”) where test data MSE is minimized. Beyond this point, models become overly complex and generalization performance deteriorates.
Here, the truth is smoother, so smoother fits and linear models perform well.
Here, the truth is wiggly and the noise is low. More flexible fits perform the best.
Suppose we have fit a model \(\hat{f}(x)\) to some training data \(\text{Tr}\), and let \((x_0, y_0)\) be a test observation drawn from the population.
If the true model is
\[ Y = f(X) + \varepsilon \quad \text{(with } f(x) = \mathbb{E}[Y \mid X = x]\text{)}, \]
then
\[ \mathbb{E}\Bigl[\bigl(y_0 - \hat{f}(x_0)\bigr)^2\Bigr] = \mathrm{Var}\bigl(\hat{f}(x_0)\bigr) + \bigl[\mathrm{Bias}\bigl(\hat{f}(x_0)\bigr)\bigr]^2 + \mathrm{Var}(\varepsilon). \]
The expectation averages over the variability of \(y_0\) as well as the variability in \(\text{Tr}\). Note that
\[ \mathrm{Bias}\bigl(\hat{f}(x_0)\bigr) = \mathbb{E}[\hat{f}(x_0)] - f(x_0). \]
Typically, as the flexibility of \(\hat{f}\) increases, its variance increases and its bias decreases. Hence, choosing the flexibility based on average test error amounts to a bias-variance trade-off.
Below is a schematic illustration of the mean squared error (MSE), bias, and variance curves as a function of the model’s flexibility.
MSE (red curve) goes down initially (as the model becomes more flexible) but eventually goes up (as overfitting sets in).
Bias (blue/teal curve) decreases with increasing flexibility.
Variance (orange curve) increases with increasing flexibility.
The vertical dotted line in each panel suggests a model flexibility that balances both bias and variance in an “optimal” region for minimizing MSE.
Statistical Learning and Predictive Analytics
Goal: Build models to predict outcomes and understand relationships between inputs (predictors) and responses.
Supervised Learning: Focuses on predicting \(Y\) (response) using \(X\) (predictors) via models like regression and classification.
Unsupervised Learning: Focuses on finding patterns in data without predefined responses (e.g., clustering).
Bias-Variance Trade-off
Key Trade-off: Model flexibility affects bias and variance:
Goal: Find the optimal flexibility that minimizes test error.
Techniques and Applications
Parametric Models:
Flexible Models:
Practical Considerations
Assessing Model Accuracy:
Key Challenges
Curse of Dimensionality:
Predictive Analytics