Predictive Model Interpretations and Predictions
August 01, 2024
Lesson Exercise Review
Lesson Question!
Course Learning Milestones
The 8 Key Steps of a Data Mining Project
Goal Setting
Data Understanding
Insights
Precision: Refers to the consistency or reliability of the model’s predictions.
Accuracy: Refers to how close the model’s predictions are to the true values.
In the context of regression:
To achieve high precision and high accuracy, we need to meet the model assumptions.
Source: Causal Inference Animated Plots
Source: Causal Inference Animated Plots
One of the most common errors in observational studies (besides selection bias and information bias — classification or measurement error);
It occurs when we suggest that the explanation for something is “confounded” with the effect of another variable;
For example, “the sun rose because the rooster crowed,” and not because of Earth’s rotation.
Be well-versed in the literature;
Select good control variables for your model;
That is, perform a multiple regression model.
Regression analysis involving two or more independent variables (x’s).
This subject area, called multiple regression analysis, enables us to consider more independent variables (factors) and thus obtain better estimates of the relationship than are possible with simple linear regression.
The equation that describes how the dependent variable \(y\) is related to the independent variables \(x_1, x_2, \ldots x_p\) and an error term \(\epsilon\) is:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon \]
Where:
\(\beta_0, \beta_1, \beta_2, \dots, \beta_p\) are the unknown parameters.
\(\epsilon\) is a random variable called the error term with the same assumptions as in simple regression (Normality, zero mean, constant variance, independence).
\(p\) is the number of independent variables (dimension or complexity of the model).
The equation that describes how the mean value of \(y\) is related to \(x_1, x_2, \ldots x_p\) is:
\[ E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]
\(\beta_1, \ldots, \beta_p\) measure the marginal effects of the respective independent variables.
For example, \(\beta_1\) is the change in \(E(y)\) corresponding to a 1-unit increase in \(x_1\), when all other independent variables are held constant or when we control for all other independent variables.
\[ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_p x_p \]
A simple random sample is used to compute sample slopes \(b_0, b_1, b_2, \dots, b_p\) that are used as the point estimators of the population slopes \(\beta_0, \beta_1, \beta_2, \dots, \beta_p\).
Hence, \(\hat{y}\) estimates \(E(Y)\).
mpg
).mpg
based on:
hp
)wt
)am
)vs
)cyl
)'data.frame': 32 obs. of 6 variables:
$ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ cyl: num 6 6 4 6 8 6 8 4 4 6 ...
Call:
lm(formula = mpg ~ hp + wt + am + vs + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.3405 -1.2158 0.0046 0.9389 4.6354
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.18461 3.42002 9.118 2e-09 ***
hp -0.03475 0.01382 -2.515 0.0187 *
wt -2.37337 0.88763 -2.674 0.0130 *
amManual 2.70384 1.59850 1.691 0.1032
vsStraight 1.99000 1.76018 1.131 0.2690
cyl6 -2.09011 1.62868 -1.283 0.2112
cyl8 0.29098 3.14270 0.093 0.9270
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.397 on 25 degrees of freedom
Multiple R-squared: 0.8724, Adjusted R-squared: 0.8418
F-statistic: 28.49 on 6 and 25 DF, p-value: 5.064e-10
mpg
when all predictors are at reference levels or zero.mpg
per unit increase in horsepower, holding other variables constant.mpg
per 1000 lbs increase in weight, holding other variables constant.mpg
between manual and automatic transmission.mpg
between straight and V-shaped engines.mpg
.Significant Predictors
mpg
.Adjusted R-squared: Measures the proportion of variance in mpg
explained by the model, adjusted for the number of predictors.
F-statistic: Tests the overall significance of the model.
p-values: Assess the significance of individual predictors.
Method: Use the regsubsets()
function from the leaps
package to evaluate all possible combinations of predictors and identify the best model. This method guarantees that the best subset of predictors is selected according to a chosen criterion (e.g., adjusted \(R^2\), AIC, BIC).
Selection: This method ensures an exhaustive search of all possible combinations, providing the best model for each subset size.
library(leaps)
# Fit the best subset model
best_model <- regsubsets(mpg ~ ., data = mtcars, nbest = 1)
# Extract the summary of the model
best_model_summary <- summary(best_model)
# Extract metrics
bic_values <- best_model_summary$bic
# Find the best model indices based on each criterion
best_bic_index <- which.min(bic_values)
# Display the best models based on the chosen criteria
cat("\nBest model based on BIC includes:\n")
print(coef(best_model, best_bic_index))
Result: The regsubsets()
function outputs the best subset of predictors for each model size, allowing you to compare and choose the optimal model based on adjusted \(R^2\), BIC, or other criteria.
In our case now, we are concerned with the optimal model for prediction, so we are using BIC as our criteria.
Method: Use k-fold cross-validation to assess the predictive performance of the model. This method helps evaluate how the model generalizes to unseen data.
Selection: Choose the model with better cross-validation metrics (e.g., lower mean squared error).
library(caret)
# Define the cross-validation method
trainControl <- trainControl(method = "cv", number = 10)
# Train the model based on adjusted R-squared criteria
original_model <- train(mpg ~ hp + wt + am + vs + cyl, data = mtcars, method = "lm", trControl = trainControl)
# print(original_model)
# Train the model based on BIC criteria
model_bic <- train(mpg ~ wt + qsec + am, data = mtcars, method = "lm", trControl = trainControl)
# print(model_bic)
# Compare RMSE, R-squared, and MAE (Mean Absolute Error) for both models
#cat("\nComparison of Prediction Performance:\n")
performance_comparison <- rbind(
"original_model" = original_model$results[, c("RMSE", "Rsquared", "MAE")],
"Model_bic" = model_bic$results[, c("RMSE", "Rsquared", "MAE")]
)
print(performance_comparison)
original_model:
Better at explaining variability (higher R-squared).
Slightly higher prediction error (RMSE and MAE).
model_bic:
More accurate predictions (lower RMSE and MAE).
Explains less variability (lower R-squared).
Recommendation
Choose original_model: If the goal is to maximize explanation of variability in mpg
.
Choose Model_bic: If the goal is to minimize prediction error for better accuracy.
Final Choice: Depends if the analysis objective prioritize explanatory power or prediction accuracy.
Now that we have our prediction model, let’s see an example on how can we use it for prediction.
To predict mpg
with our model we need to have data regarding our independent variables. To do so, let’s split our original dataset:
We run the model with our training_data
and predict the mpg
values usin our testing_data
.
By combining our predicted results into our original dataset, we can check in which extent we were able to predict the actual mpg
values:
We can plot a scatter plot with the actual mpg
values on the x-axis and predicted mpg
values on the y-axis.
The red line represents a linear regression line that helps us see how well our predictions align with the actual data.
Main Takeaways from this lecture:
regsubsets()
to find the best combination of predictors.Data Mining Lab