# Uncomment and install if needed:
# install.packages("caret")
# install.packages("dplyr")
# install.packages("pROC")
library(caret)
library(dplyr)
library(pROC)
set.seed(123) # ensures reproducibility
<- 1000
n
# 1) Generate basic customer attributes
<- round(rnorm(n, mean = 40, sd = 10))
Age <- sample(c("Male", "Female"), n, replace = TRUE)
Gender <- round(rnorm(n, mean = 60000, sd = 10000))
Income <- runif(n, min = 0, max = 1)
Engagement
# 2) Make some feature shifts to mimic demographic differences
# (Optional step to create more distinct sub-groups)
== "Female"] <- Engagement[Gender == "Female"] + 0.1
Engagement[Gender == "Female"] <- Income[Gender == "Female"] + 3000
Income[Gender == "Female"] <- Age[Gender == "Female"] + 1
Age[Gender
# 3) Define business rules / assumptions for purchase probability
# We'll combine a baseline with increments/decrements based on thresholds.
<- 0.2
baseline_prob
<- rep(baseline_prob, n)
purchase_prob
# a) Customers with higher engagement more likely to buy
>= 0.5] <- purchase_prob[Engagement >= 0.5] + 0.3
purchase_prob[Engagement
# b) Higher income is a strong indicator of potential purchase
>= 55000] <- purchase_prob[Income >= 55000] + 0.25
purchase_prob[Income
# c) Suppose 'Female' customers have a slight bump in probability
== "Female"] <- purchase_prob[Gender == "Female"] + 0.1
purchase_prob[Gender
# d) Age effect: perhaps older customers (>=45) have slightly higher probability
>= 45] <- purchase_prob[Age >= 45] + 0.05
purchase_prob[Age
# 4) Add random noise to reflect real-world unpredictability
# (some random variation that might increase or decrease probabilities)
<- rnorm(n, mean = 0, sd = 0.05)
noise <- purchase_prob + noise
purchase_prob
# 5) Clamp probabilities to valid [0, 1] range
<- pmin(pmax(purchase_prob, 0.01), 0.99)
purchase_prob
# 6) Generate the Purchase variable via Bernoulli trials
<- rbinom(n, size = 1, prob = purchase_prob)
Purchase
# 7) Combine into a data frame
<- data.frame(
df Age = Age,
Gender = factor(Gender),
Income = Income,
Engagement = Engagement,
Purchase = factor(Purchase, labels = c("No", "Yes"))
)
# Quick checks
table(df$Purchase)
head(df)
MGMT 17300: Data Mining Lab
Model Evaluation with Logistic Regression
Overview
- Introduction to Logistic Regression
- Train-Test Split
- Model Training: Fitting a Logistic Regression Model
- Predictions on the Test Set
Model Evaluation
- Confusion Matrix
- Sensitivity
- Specificity
- Precision
- F1 Score
- Balanced Accuracy
- ROC Curve & AUC
K-Fold Cross-Validation
Motivation
What is a classification problem?
Classification involves categorizing data into predefined classes or groups based on their features.
Introduction to Logistic Regression
What is Logistic Regression?
A statistical method used for binary classification tasks.
Predicts the probability of an event occurring (e.g., Yes or No).
Useful for tasks like determining if a customer will make a purchase.
Why Logistic Regression?
Handles binary outcomes effectively.
Provides interpretable results (log-odds, odds ratios, and probability).
Can handle continuous and categorical predictors.
Logistic Regression Framework
Assumptions
Binary Dependent Variable: Logistic regression predicts probabilities for two categories.
Independent Observations: Data points should be independent of each other.
Linearity in the Log-Odds: The log-odds of the dependent variable should be linearly related to the independent variables.
No Multicollinearity: Independent variables should not be highly correlated.
The Logistic Regression Model
Log-Odds Equation
\[ \text{logit}(P) = \log\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n \]
- \(P\): Probability of the outcome (e.g., Purchase = Yes).
- \(X_1, X_2, \dots, X_n\): Predictor variables.
- \(\beta_0, \beta_1, \dots, \beta_n\): Coefficients.
Converting Log-Odds to Probability
\[ P = \frac{1}{1 + e^{-\text{logit}(P)}} \]
Generating a Synthetic Business Dataset
Generating a Synthetic Business Dataset
In many business scenarios, we aim to predict whether a customer will make a purchase based on collected information, usually on demographic and engagement factors. Our synthetic dataset simulates a simplified version of real-world data, capturing:
- Age: Customer’s age, distributed around 40 years old with some variation
- Gender: Categorical variable indicating
"Male"
or"Female"
- Income: Approximate annual income in US dollars
- Engagement: A normalized score (0 to 1) reflecting how actively a customer interacts with the business (e.g., website visits, email clicks, etc.)
- Purchase: A binary outcome (
"No"
or"Yes"
) indicating whether the customer decided to buy a product
Generating a Synthetic Business Dataset
Exploratory Data Analysis (EDA)
Summary Statistics
summary(df)
Age vs. Purchase Probability
Income vs. Purchase Probability
Purchase Rate by Gender
Purchase Probability by Engagement
Correlation Matrix
Train-Test Split
Train-Test Split
Randomly split the data into train (80%) and test (20%) for unbiased evaluation. This process ensures that the model is trained on one subset of the data and evaluated on another subset, mimicking real-world scenarios where predictions are made on unseen data. It helps to assess how well the model generalizes to new information.
set.seed(123)
<- createDataPartition(df$Purchase, p = 0.8, list = FALSE)
train_index <- df[train_index, ]
train_data <- df[-train_index, ]
test_data
table(train_data$Purchase)
Baseline Prediction
The baseline prediction is the proportion of customers who made a purchase:
<- mean(train_data$Purchase == "Yes")
baseline_prob baseline_prob
Interpretation:
- This baseline represents the probability of a purchase if no predictors are used.
- It serves as a reference point to evaluate the added value of the logistic regression model.
Model Training
Fitting a Logistic Regression Model
Fitting a Logistic Regression Model
We fit a logistic regression model on the training data and model the relationship between a binary dependent variable and one or more independent variables. Here, we predict the probability of purchase (Purchase
) based on customer attributes (Age
, Gender
, Income
, and Engagement
).
This step includes specifying:
The formula (
Purchase ~ Age + Gender + Income + Engagement
), which defines the relationship between the dependent and independent variables.The method (
glm
), indicating that we are using a generalized linear model.The link function (
binomial(link = "logit")
), ensuring the model estimates probabilities using the logistic function.
set.seed(123)
<- trainControl(
train_control classProbs = TRUE,
summaryFunction = prSummary # <--- uses Precision, Recall, F, AUC
)
<- train(
model_fit ~ Age + Gender + Income + Engagement,
Purchase data = train_data,
method = "glm",
family = binomial(link = "logit"),
trControl = train_control,
metric = "F" # or "AUC", "Precision", "Recall"
)
summary(model_fit$finalModel)
The summary provides details about the model’s coefficients, which indicate the strength and direction of each predictor’s influence on the purchase probability.
Predictions on the Test Set
Predictions on the Test Set
To evaluate the model, we obtain predictions for the test set. This involves two steps:
Class Predictions: Predicting the most likely class (e.g.,
Yes
orNo
) for each observation.Probability Predictions: Predicting the probability of each class, which provides more nuanced insights.
This allows us to assess the model’s ability to generalize to unseen data and evaluate its performance using metrics such as accuracy, ROC-AUC, and confusion matrix.
<- predict(model_fit, newdata = test_data)
pred_classes <- predict(model_fit, newdata = test_data, type = "prob")
pred_probs
head(pred_classes) # Displays the predicted classes for the first few observations
head(pred_probs) # Displays the predicted probabilities for the first few observations
By analyzing these predictions, we can determine how well the model distinguishes between Yes
and No
cases and identify areas for improvement.
Model Evaluation
Confusion Matrix
The confusion matrix provides a summary of prediction results. It compares the predicted and actual classes, offering insights into the model’s classification performance.
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | True Negative (TN) | False Positive (FP) |
Actual: 1 | False Negative (FN) | True Positive (TP) |
Confusion Matrix
Key metrics derived from the confusion matrix include:
Metric | Value |
---|---|
Accuracy | 0.6950 |
Sensitivity (Recall) | 0.8492 |
Specificity | 0.4324 |
Precision | 0.7181 |
F1 Score | 0.7782 |
- Accuracy: The proportion of correct predictions.
- Sensitivity (Recall): The model’s ability to identify positive cases.
- Specificity: The model’s ability to identify negative cases.
- Precision: Among predicted positive cases, the proportion that are truly positive.
- F1 Score: The harmonic mean of Precision and Sensitivity, balancing false positives and false negatives.
Accuracy
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | True Negative (TN) | False Positive (FP) |
Actual: 1 | False Negative (FN) | True Positive (TP) |
\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]
Overall effectiveness of the model.
In the context of weather forecasting, for example, accuracy reflects how well a model predicts weather events correctly (e.g., rainy or sunny days).
High accuracy: lots of correct predictions!
Recall (Sensitivity)
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | True Negative (TN) | False Positive (FP) |
Actual: 1 | False Negative (FN) | True Positive (TP) |
\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]
Is the fraction of positives correctly identified.
In criminal justice, it would assess how well a predictive policing model identifies all potential criminal activities (True Positives) without missing any (thus minimizing False Negatives).
High recall: low false-negative rates.
Specificity
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | True Negative (TN) | False Positive (FP) |
Actual: 1 | False Negative (FN) | True Positive (TP) |
\[ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} \]
It is the true negative rate, measuring a model’s ability to correctly identify actual negatives.
Crucial in fields where incorrectly identifying a negative case as positive could have serious implications (e.g., criminal justice).
High specificity: the model is very effective at identifying true negatives.
Precision
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | True Negative (TN) | False Positive (FP) |
Actual: 1 | False Negative (FN) | True Positive (TP) |
\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Accuracy of positive predictions.
In email spam detection, it would indicate the percentage of emails correctly identified as spam (True Positives) out of all emails flagged as spam, aiming to reduce the number of legitimate emails incorrectly marked as spam (False Positives).
High precision: low false-positive rates.
F1-Score
Predicted: 0 | Predicted: 1 | |
---|---|---|
Actual: 0 | True Negative (TN) | False Positive (FP) |
Actual: 1 | False Negative (FN) | True Positive (TP) |
\[ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
Harmonic mean of Precision and Recall.
In a medical diagnosis scenario, it would help in evaluating a test’s effectiveness in correctly identifying patients with a disease (True Positives) while minimizing the misclassification of healthy individuals as diseased (False Positives and False Negatives).
High F1 score: a better balance between precision and recall.
ROC Curve & AUC
ROC Curve & AUC
The ROC (Receiver Operating Characteristic) curve is a graphical representation that illustrates the trade-off between the true positive rate (Sensitivity) and the false positive rate (1 - Specificity) across various threshold values. The AUC (Area Under the Curve) summarizes the ROC curve into a single value, indicating the model’s overall performance.
AUC Interpretation:
- AUC \(\approx\) 1: The model has near-perfect predictive capability.
- AUC \(\approx\) 0.5: The model performs no better than random guessing.
ROC Curve & AUC
The ROC curve and AUC provide a comprehensive evaluation of the model’s discriminatory power.
AUC Value: Indicates the likelihood that the model will correctly distinguish between a positive and a negative instance. Higher values reflect better model performance.
K-Fold Cross-Validation
K-Fold Cross-Validation
K-Fold Cross-Validation is a resampling technique that evaluates model performance by dividing the dataset into k subsets (folds). The model is trained on k-1 folds and validated on the remaining fold, repeating this process \(k\) times to ensure each fold serves as the validation set once.
Advantages of K-Fold Cross-Validation:
- Reduces overfitting by training and validating on different subsets.
- Provides a more reliable estimate of model performance compared to a single train-test split.
- Ensures the model generalizes well to unseen data.
K-Fold Cross-Validation
# install.packages("MLmetrics")
library(MLmetrics)
set.seed(123)
<- trainControl(
train_control method = "cv",
number = 10,
p = .8,
classProbs = TRUE,
summaryFunction = prSummary # <--- uses Precision, Recall, F, AUC
)
<- train(
model_cv ~ .,
Purchase data = train_data, # or your entire df, depending on your design
method = "glm",
family = binomial(link = "logit"),
trControl = train_control,
metric = "F" # or "AUC", "Precision", "Recall"
)
# Inspect the cross-validation results:
# model_cv # Displays cross-validated model performance
$results model_cv
Comparing Train-Test Performance and K-Fold Cross-Validation Performance
Comparing Train-Test Performance and K-Fold Cross-Validation Performance
Assessing Model Generalization
- Train-Test Performance evaluates how well a model generalizes to unseen data from a single split.
- K-Fold Cross-Validation provides a more robust performance estimate across multiple splits.
- Comparing the two ensures consistent performance on unseen data.
Detecting Overfitting or Underfitting
- Overfitting: Train-test performance is much better than K-Fold performance.
- Underfitting: Both metrics show poor performance.
- The comparison highlights issues with model complexity.
Mitigating Sampling Bias
- Single splits (train-test) can introduce sampling bias, especially with small datasets.
- Cross-validation averages performance across folds, reducing bias.
- Ensures reliability of performance estimates.
Understanding Variance in Model Evaluation
- Train-test performance can vary based on split choice.
- K-Fold Cross-Validation offers insights into the variance of the performance estimate.
- Aids in strategies to reduce variance, such as adding more data.
Model Selection and Hyperparameter Tuning
- Cross-validation is used for stable performance estimates during training.
- Train-Test Performance validates the chosen model’s final performance.
- Discrepancies indicate potential overfitting during tuning.
Real-World Applicability
- Train-Test: Simpler, faster, checks practical performance on unseen data.
- K-Fold Cross-Validation: Provides theoretical robustness.
- Bridging evaluation theory and deployment performance.
Comparing Train-Test Performance and K-Fold Cross-Validation Performance
# =========================
# Load required packages
# =========================
library(caret)
library(pROC)
# =========================
# Single Train-Test Performance
# =========================
# 1. Compute confusion matrix to get Precision, Recall, F1
<- confusionMatrix(
train_test_metrics data = pred_classes,
reference = test_data$Purchase,
positive = "Yes"
)
# 2. Calculate AUC on the test set
<- roc(test_data$Purchase, pred_probs$Yes, levels = c("No", "Yes"))
roc_obj_test <- as.numeric(auc(roc_obj_test))
auc_test
# 3. Obtain 95% CI for the AUC
<- ci(roc_obj_test, of = "auc") # typically returns c(lower, median, upper)
auc_test_ci
# 4. Extract the relevant metrics
<- list(
train_test_results AUC = auc_test,
Precision = train_test_metrics$byClass[["Pos Pred Value"]],
Recall = train_test_metrics$byClass[["Sensitivity"]],
F1 = train_test_metrics$byClass[["F1"]]
)
# =========================
# K-Fold Cross-Validation Metrics
# =========================
# Identify the row with the highest F in model_cv
<- which.max(model_cv$results$F)
best_idx <- model_cv$results[best_idx, ]
cv_metrics
# Assuming model_cv$results contains columns like:
# AUC, AUCSD, Precision, PrecisionSD, Recall, RecallSD, F, FSD
# (from a custom prSummary or similar)
<- list(
cv_results AUC = cv_metrics[["AUC"]],
AUC_lower = cv_metrics[["AUC"]] - 1.96 * cv_metrics[["AUCSD"]],
AUC_upper = cv_metrics[["AUC"]] + 1.96 * cv_metrics[["AUCSD"]],
Precision = cv_metrics[["Precision"]],
Precision_lower = cv_metrics[["Precision"]] - 1.96 * cv_metrics[["PrecisionSD"]],
Precision_upper = cv_metrics[["Precision"]] + 1.96 * cv_metrics[["PrecisionSD"]],
Recall = cv_metrics[["Recall"]],
Recall_lower = cv_metrics[["Recall"]] - 1.96 * cv_metrics[["RecallSD"]],
Recall_upper = cv_metrics[["Recall"]] + 1.96 * cv_metrics[["RecallSD"]],
F1 = cv_metrics[["F"]],
F1_lower = cv_metrics[["F"]] - 1.96 * cv_metrics[["FSD"]],
F1_upper = cv_metrics[["F"]] + 1.96 * cv_metrics[["FSD"]]
)
# =========================
# Combine Results
# =========================
<- list(
comparison Train_Test = train_test_results,
K_Fold_CV = cv_results
)
comparison
Visualizing Single Train-Test vs K-Fold CV
Summary
Summary
Main Takeaways from this lecture:
Logistic Regression Framework - Logistic regression is a powerful method for binary classification. - Key assumptions: - Binary dependent variable. - Linearity in log-odds. - No multicollinearity among predictors.
Evaluation Metrics
Confusion Matrix provides:
- Sensitivity (Recall)
- Specificity
- Precision
- F1 Score
ROC Curve and AUC evaluate model discrimination across thresholds.
Train-Test Split:
- Ensures the model is trained and tested on separate datasets.
- Baseline accuracy serves as a benchmark for improvement.
K-Fold Cross-Validation:
- Reduces overfitting and sampling bias.
- Provides a reliable estimate of model performance.
Performance Comparison
- Cross-validation metrics give a robust view of generalization.
- Train-test split results provide practical insights for deployment.
- Use both approaches for a comprehensive evaluation.