Syllabus, Logistics, and Introduction
Supervised Learning
Unsupervised Learning
Statistical Learning Overview
This lecture content is inspired by and replicates the material from An Introduction to Statistical Learning.
Materials:
Brightspace
Word | Spam | |
---|---|---|
george | 0.00 | 1.27 |
you | 2.26 | 1.27 |
hp | 0.02 | 0.90 |
free | 0.52 | 0.07 |
! | 0.51 | 0.11 |
edu | 0.01 | 0.29 |
remove | 0.28 | 0.01 |
Average percentage of words or characters in an email message equal to the indicated word or character. We have chosen the words and characters showing the largest difference between spam and email.
Identify the numbers in a handwritten zip code.
Video: Winning the Netflix Prize
Outcome measurement \(Y\) (also called dependent variable, response, target).
Vector of \(p\) predictor measurements \(X\) (also called inputs, regressors, covariates, features, independent variables).
In the regression problem, \(Y\) is quantitative (e.g., price, blood pressure).
In the classification problem, \(Y\) takes values in a finite, unordered set (e.g., survived/died, digit 0–9, cancer class of tissue sample).
We have training data \((x_1, y_1), \ldots, (x_N, y_N)\). These are observations (examples, instances) of these measurements.
On the basis of the training data, we would like to:
Accurately predict unseen test cases.
Understand which inputs affect the outcome, and how.
Assess the quality of our predictions and inferences.
It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
We wil understand the simpler methods first to grasp the more sophisticated ones later.
It is important to accurately assess the performance of a method, to know how well or how badly it is working.
No outcome variable, just a set of predictors (features) measured on a set of samples.
Objective is more fuzzy:
Find groups of samples that behave similarly.
Find features that behave similarly.
Find linear combinations of features with the most variation.
Difficult to know how well we are doing.
Different from supervised learning, but can be useful as a pre-processing step for supervised learning.
Shown are Sales vs TV, Radio, and Newspaper, with a blue linear-regression line fit separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model:
\[ \text{Sales} \approx f(\text{TV}, \text{Radio}, \text{Newspaper}) \]
Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\).
TV is a feature, or input, or predictor; we name it \(X_1\).
Likewise, name Radio as \(X_2\), and so on.
The input vector collectively is referred to as:
\[ X = \begin{pmatrix} X_1 \\ X_2 \\ X_3 \end{pmatrix} \]
We write our model as:
\[ Y = f(X) + \epsilon \]
where \(\epsilon\) captures measurement errors and other discrepancies.
With a good \(f\), we can make predictions of \(Y\) at new points \(X = x\).
Understand which components of \(X = (X_1, X_2, \ldots, X_p)\) are important in explaining \(Y\), and which are irrelevant.
Depending on the complexity of \(f\), understand how each component \(X_j\) affects \(Y\).
In particular, what is a good value for \(f(X)\) at a selected value of \(X\), say \(X = 4\)?
There can be many \(Y\) values at \(X=4\). A good value is:
\[ f(4) = E(Y|X=4) \]
where \(E(Y|X=4)\) means the expected value (average) of \(Y\) given \(X=4\).
This ideal \(f(x) = E(Y|X=x)\) is called the regression function.
\[ f(\mathbf{x}) = f(x_1, x_2, x_3) = \mathbb{E}[\,Y \mid X_1 = x_1,\, X_2 = x_2,\, X_3 = x_3\,]. \]
\[ f(x) = \mathbb{E}[Y \mid X = x] \quad\text{is the function that minimizes}\quad \mathbb{E}[(Y - g(X))^2 \mid X = x] \text{ over all } g \text{ and for all points } X = x. \]
\(\varepsilon = Y - f(x)\) is the irreducible error.
\[ \mathbb{E}\bigl[(Y - \hat{f}(X))^2 \mid X = x\bigr] = \underbrace{[\,f(x) - \hat{f}(x)\,]^2}_{\text{Reducible}} \;+\; \underbrace{\mathrm{Var}(\varepsilon)}_{\text{Irreducible}}. \]
Often, we lack sufficient data points for exact computation of \(E(Y|X=x)\).
So, we relax the definition:
\[ \hat{f}(x) = \text{Ave}(Y|X \in \mathcal{N}(x)) \]
where \(\mathcal{N}(x)\) is a neighborhood of \(x\).
Nearest neighbor averaging can be pretty good for small \(p\) — i.e., \(p \le 4\) — and large-ish \(N\).
We will discuss smoother versions, such as kernel and spline smoothing, later in the course.
Nearest neighbor methods can be lousy when \(p\) is large.
Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.
We need to get a reasonable fraction of the \(N\) values of \(y_i\) to average in order to bring the variance down (e.g., 10%).
A 10% neighborhood in high dimensions is no longer truly local, so we lose the spirit of estimating \(\mathbb{E}[Y \mid X = x]\) via local averaging.
Top panel: \(X_1\) and \(X_2\) are uniformly distributed with edges minus one to plus one.
1-Dimensional Neighborhood
2-Dimensional Neighborhood
Botton panel: We see how far we have to go out in one, two, three, five, and ten dimensions in order to capture a certain fraction of the points.
Key Takeaway: As dimensionality increases, neighborhoods must expand significantly to capture the same fraction of data points, illustrating the curse of dimensionality.
The linear model is a key example of a parametric model to deal with the curse of dimensionality:
\[ f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p \]
A linear model is specified in terms of \(p+1\) parameters (\(\beta_0, \beta_1, \ldots, \beta_p\)).
We estimate the parameters by fitting the model to training data.
Although it is almost never correct, it serves as a good and interpretable approximation to the unknown true function \(f(X)\).
\[ \hat{f}_L(X) = \hat{\beta}_0 + \hat{\beta}_1X \]
The linear model gives a reasonable fit here.
\[ \hat{f}_Q(X) = \hat{\beta}_0 + \hat{\beta}_1X + \hat{\beta}_2X^2 \]
Quadratic models may fit slightly better than linear models in some cases.
Red points are simulated values for income from the model:
\[ \text{income} = f(\text{education}, \text{seniority}) + \epsilon \]
\(f\) is the blue surface.
Linear regression model fit to the simulated data:
\[ \hat{f}_L(\text{education}, \text{seniority}) = \hat{\beta}_0 + \hat{\beta}_1 \times \text{education} + \hat{\beta}_2 \times \text{seniority} \]
More flexible regression model \(\hat{f}_S(\text{education}, \text{seniority})\) fit to the simulated data.
Here we use a technique called a thin-plate spline to fit a flexible surface. We control the roughness of the fit.
Even more flexible spline regression model \(\hat{f}_S(\text{education}, \text{seniority})\) fit to the simulated data. We tunned the parameter all the way down to zero and this surface actually goes through every single data point.
The fitted model makes no errors on the training data! This is known as overfitting.
Prediction accuracy versus interpretability:
Good fit versus over-fit or under-fit:
Parsimony versus black-box:
Trade-offs between flexibility and interpretability:
Suppose we fit a model \(\hat{f}(x)\) to some training data \(Tr = \{x_i, y_i\}_{i=1}^N\), and we wish to evaluate its performance:
\[ \text{MSE}_{Tr} = \text{Ave}_{i \in Tr}[(y_i - \hat{f}(x_i))^2] \]
However, this may be biased toward more overfit models.
\[ \text{MSE}_{Te} = \text{Ave}_{i \in Te}[(y_i - \hat{f}(x_i))^2] \]
Top Panel: Model Fits
Black Curve: The true generating function, representing the underlying relationship we want to estimate.
Data Points: Observations generated from the black curve, with added noise (error).
Fitted Models:
Key Insight:
The green model captures the data points well but risks overfitting, while the orange model is too rigid and misses the underlying structure. The blue model strikes a balance.
Botton Panel: Mean Squared Error (MSE)
Gray Curve: Training data MSE.
Red Curve: Test data MSE across models of increasing flexibility.
Key Takeaway:
There is an optimal model complexity (the “magic point”) where test data MSE is minimized. Beyond this point, models become overly complex and generalization performance deteriorates.
Here, the truth is smoother, so smoother fits and linear models perform well.
Here, the truth is wiggly and the noise is low. More flexible fits perform the best.
Suppose we have fit a model \(\hat{f}(x)\) to some training data \(\text{Tr}\), and let \((x_0, y_0)\) be a test observation drawn from the population.
If the true model is
\[ Y = f(X) + \varepsilon \quad \text{(with } f(x) = \mathbb{E}[Y \mid X = x]\text{)}, \]
then
\[ \mathbb{E}\Bigl[\bigl(y_0 - \hat{f}(x_0)\bigr)^2\Bigr] = \mathrm{Var}\bigl(\hat{f}(x_0)\bigr) + \bigl[\mathrm{Bias}\bigl(\hat{f}(x_0)\bigr)\bigr]^2 + \mathrm{Var}(\varepsilon). \]
The expectation averages over the variability of \(y_0\) as well as the variability in \(\text{Tr}\). Note that
\[ \mathrm{Bias}\bigl(\hat{f}(x_0)\bigr) = \mathbb{E}[\hat{f}(x_0)] - f(x_0). \]
Typically, as the flexibility of \(\hat{f}\) increases, its variance increases and its bias decreases. Hence, choosing the flexibility based on average test error amounts to a bias-variance trade-off.
Below is a schematic illustration of the mean squared error (MSE), bias, and variance curves as a function of the model’s flexibility.
MSE (red curve) goes down initially (as the model becomes more flexible) but eventually goes up (as overfitting sets in).
Bias (blue/teal curve) decreases with increasing flexibility.
Variance (orange curve) increases with increasing flexibility.
The vertical dotted line in each panel suggests a model flexibility that balances both bias and variance in an “optimal” region for minimizing MSE.
Here the response variable \(Y\) is qualitative. For example:
Email could be classified as spam or ham (good email).
Digit classification could be one of \(\{0, 1, 2, \dots, 9\}\).
Our goals are to:
Build a classifier \(C(X)\) that assigns a class label from the set \(C\) to a future unlabeled observation \(X\).
Assess the uncertainty in each classification.
Understand the roles of the different predictors among \(X = (X_1, X_2, \dots, X_p)\).
Consider a classification problem with \(K\) possible classes, numbered \(1, 2, \ldots, K\). Define
\[ p_k(x) = \Pr(Y = k \mid X = x), \quad k = 1, 2, \ldots, K. \]
These are the conditional class probabilities at \(x\); e.g. see little barplot at \(x=5\).
The Bayes optimal classifier at \(x\) is
\[ C(x) \;=\; j \quad \text{if} \quad p_j(x) = \max \{\,p_1(x),\, p_2(x),\, \dots,\, p_K(x)\}. \]
Nearest-neighbor averaging can be used as before.
Also breaks down as dimension grows. However, the impact on \(\hat{C}(x)\)is less than on \(\hat{p}_k(x)\), for \(k = 1,\ldots,K\).
Typically we measure the performance of \(\hat{C}(x)\) using the misclassification error rate:
\[ \mathrm{Err}_{\mathrm{Te}} = \mathrm{Ave}_{i\in \mathrm{Te}} \bigl[I(y_i \neq \hat{C}(x_i))\bigr]. \]
The Bayes classifier (using the true \(p_k(x)\)) has the smallest error in the population.
Support-vector machines build structured models for \(\hat{C}(x)\).
We also build structured models for representing \(p_k(x)\). For example, logistic regression or generalized additive models.
Below is an example data set in two dimensions \((X_1, X_2)\). Points shown in blue might represent one class, and points in orange the other. The dashed boundary suggests a decision boundary formed by a classifier.
Here is the same data set classified by k-nearest neighbors with \(k = 10\). The black boundary line encloses the region of the feature space predicted as orange vs. blue, showing how the decision boundary has become smoother.
Comparisons of a very low value of \(k\) (left, \(k=1\)) versus a very high value (right, \(k=100\)).
\(k=1\): Overly flexible boundary that can overfit.
\(k=100\): Very smooth boundary that can underfit.
The figure illustrates how training errors (blue curve) and test errors (orange curve) change for a K-nearest neighbors (KNN) classifier as \(\frac{1}{K}\) varies.
For small \(K\) (i.e., large \(\frac{1}{K}\)), the model can become very flexible, often driving down training error but increasing overfitting and thus test error.
For large \(K\) (i.e., small \(\frac{1}{K}\)), the model becomes smoother, which can help avoid overfitting but sometimes leads to underfitting.
The dashed horizontal line is the bayes error, used as reference for comparison.
Statistical Learning and Predictive Analytics
Goal: Build models to predict outcomes and understand relationships between inputs (predictors) and responses.
Supervised Learning: Focuses on predicting \(Y\) (response) using \(X\) (predictors) via models like regression and classification.
Unsupervised Learning: Focuses on finding patterns in data without predefined responses (e.g., clustering).
Bias-Variance Trade-off
Key Trade-off: Model flexibility affects bias and variance:
Goal: Find the optimal flexibility that minimizes test error.
Techniques and Applications
Parametric Models:
Flexible Models:
Practical Considerations
Assessing Model Accuracy:
Key Challenges
Curse of Dimensionality:
Predictive Analytics