Predictive Modeling in R
August 01, 2024
Lesson Exercise Review
Lesson Question!
Course Learning Milestones
The 8 Key Steps of a Data Mining Project
Correlation is a unit-free measure of linear association and not necessarily causation.
The coefficient can take on values between −1 and +1.
The closer the correlation is to zero, the weaker the linear relationship.
All models are wrong, but some are useful.
George Box
Let’s start with a very simple premise:
Let’s begin by giving specific names to these variables:
Dependent Variable (DV): This is our phenomenon of interest, usually denoted as \(Y\).
Independent Variable (IV): This is the phenomenon that explains/describe our dependent variable, generally denoted as \(X\).
Mathematically, we model \(Y\) as a function of \(X\). Statistically, modeling can serve two main purposes:
Prediction: The possibility of using the values of \(X\) to predict the value of \(Y\). There must be a substantive connection between these two variables for one to generate reliable predictions about the values of the other.
Explanation: Used to understand the connection and significance (both substantive and statistical) of the relationship between two variables. In this case, we aim to accurately estimate the impact of one variable on the other, preferably excluding any potential omitted variables.
Analyze the relationship between horsepower (hp
) and miles per gallon (mpg
).
Build a simple linear regression model to predict mpg
based on hp
.
Built-in dataset in R.
Contains data on fuel consumption and 10 aspects of automobile design and performance for 32 cars (1973–74 models).
Variables Used
Call:
lm(formula = mpg ~ hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.7121 -2.1122 -0.8854 1.5819 8.2360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
hp -0.06823 0.01012 -6.742 1.79e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Coefficients:
mpg
when hp
is zero.mpg
for each additional horsepower.R-squared: Proportion of variance in mpg
explained by hp
.
p-value: Significance of the relationship between hp
and mpg
.
The model considers only one predictor (hp
).
Potential influence of outliers or high-leverage points.
Other variables may affect mpg
(e.g., weight, number of cylinders).
There is a significant negative linear relationship between horsepower and miles per gallon.
The model can be used to predict fuel efficiency based on horsepower.
Precision: Refers to the consistency or reliability of the model’s predictions.
Accuracy: Refers to how close the model’s predictions are to the true values.
In the context of regression:
To achieve high precision and high accuracy, we need to meet the model assumptions.
Source: Causal Inference Animated Plots
Source: Causal Inference Animated Plots
One of the most common errors in observational studies (besides selection bias and information bias — classification or measurement error);
It occurs when we suggest that the explanation for something is “confounded” with the effect of another variable;
For example, “the sun rose because the rooster crowed,” and not because of Earth’s rotation.
Be well-versed in the literature;
Select good control variables for your model;
That is, perform a multiple regression model.
Regression analysis involving two or more independent variables (x’s).
This subject area, called multiple regression analysis, enables us to consider more independent variables (factors) and thus obtain better estimates of the relationship than are possible with simple linear regression.
The equation that describes how the dependent variable \(y\) is related to the independent variables \(x_1, x_2, \ldots x_p\) and an error term \(\epsilon\) is:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon \]
Where:
\(\beta_0, \beta_1, \beta_2, \dots, \beta_p\) are the unknown parameters.
\(\epsilon\) is a random variable called the error term with the same assumptions as in simple regression (Normality, zero mean, constant variance, independence).
\(p\) is the number of independent variables (dimension or complexity of the model).
The equation that describes how the mean value of \(y\) is related to \(x_1, x_2, \ldots x_p\) is:
\[ E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]
\(\beta_1, \ldots, \beta_p\) measure the marginal effects of the respective independent variables.
For example, \(\beta_1\) is the change in \(E(y)\) corresponding to a 1-unit increase in \(x_1\), when all other independent variables are held constant or when we control for all other independent variables.
\[ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_p x_p \]
A simple random sample is used to compute sample slopes \(b_0, b_1, b_2, \dots, b_p\) that are used as the point estimators of the population slopes \(\beta_0, \beta_1, \beta_2, \dots, \beta_p\).
Hence, \(\hat{y}\) estimates \(E(Y)\).
mpg
).mpg
based on:
hp
)wt
)am
)vs
)cyl
)'data.frame': 32 obs. of 6 variables:
$ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
Call:
lm(formula = mpg ~ hp + wt + am + vs + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.3405 -1.2158 0.0046 0.9389 4.6354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.18461 3.42002 9.118 2e-09 ***
hp -0.03475 0.01382 -2.515 0.0187 *
wt -2.37337 0.88763 -2.674 0.0130 *
amManual 2.70384 1.59850 1.691 0.1032
vsStraight 1.99000 1.76018 1.131 0.2690
cyl6 -2.09011 1.62868 -1.283 0.2112
cyl8 0.29098 3.14270 0.093 0.9270
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.397 on 25 degrees of freedom
Multiple R-squared: 0.8724, Adjusted R-squared: 0.8418
F-statistic: 28.49 on 6 and 25 DF, p-value: 5.064e-10
mpg
when all predictors are at reference levels or zero.mpg
per unit increase in horsepower, holding other variables constant.mpg
per 1000 lbs increase in weight, holding other variables constant.mpg
between manual and automatic transmission.mpg
between straight and V-shaped engines.mpg
.Significant Predictors
mpg
.Adjusted R-squared: Measures the proportion of variance in mpg
explained by the model, adjusted for the number of predictors.
F-statistic: Tests the overall significance of the model.
p-values: Assess the significance of individual predictors.
# New data frame with predictor values
new_data <- data.frame(
hp = c(110, 150),
wt = c(2.5, 3.0),
am = factor(c("Manual", "Automatic"), levels = levels(mtcars$am)),
vs = factor(c("Straight", "V-shaped"), levels = levels(mtcars$vs)),
cyl = factor(c("4", "6"), levels = levels(mtcars$cyl))
)
# Predicted mpg
predicted_mpg <- predict(model_mult, newdata = new_data)
cbind(new_data, Predicted_mpg = predicted_mpg)
hp wt am vs cyl Predicted_mpg
1 110 2.5 Manual Straight 4 26.12252
2 150 3.0 Automatic V-shaped 6 16.76187
Main Takeaways from this lecture:
Regression models can be used to predict response variables outcomes with independent variables.
Compared to the Simple Regression Model, Multiple Regression Models allows for the inclusion of several predictors to improve model accuracy.
Data Mining Lab