Key Steps of Performing Data Mining in R
August 01, 2024
Lesson 06 Exercise Review
Lesson Question!
The 8 Key Steps of a Data Mining Project
Step 1: Define the project’s goal based on available data
The project objective serves as the guiding light.
Car Data Mining Project Example: predict a car’s fuel efficiency (measured as miles per gallon, mpg
) based on key factors or attributes of the car.
Example Project: Car’s fuel efficiency (mpg) influenced by:
The dataset becomes the mine where insights are excavated.
Step 2: Acquire Analysis Tools
Use R for data analysis
R includes an extensive collection of packages and functions
Example R packages:
dplyr
tidyr
Data preparation: transformation, cleaning, and preprocessing
Filtering out missing or invalid values
Transforming and restructuring data
Cleaning and Filtering:
is.na()
, complete.cases()
, filter()
Data Transformation:
scale()
, normalize()
, log()
Data Restructuring:
pivot_longer()
, pivot_wider()
Data Type Conversion:
as.numeric()
, as.Date()
, as.factor()
Summarize key insights from the core of the data to identify trends, correlations, and high-level patterns
Techniques:
Visual representation of data is crucial for interpretation. It is used to communicate insights visually, identify relationships, and detect outliers.
Techniques:
Build models to uncover hidden patterns and relationships
Common techniques:
Linear Model (lm
):
mpg
using wt
(weight) and hp
(horsepower).summary(model)
:
Model Output:
Intercept: 37.23 (baseline mpg
when wt
and hp
are zero).
Coefficients:
wt
, mpg
decreases by 3.88 units.hp
, mpg
decreases by 0.03 units.R-squared:
mpg
).Significance:
wt
and hp
are significant predictors (p-values < 0.05).Validate model accuracy and generalizability to ensure that the model performs well on unseen data and avoids overfitting.
Techniques:
Cross-validation:
trainControl(method = "cv", number = 10)
sets up 10-fold cross-validation.train() function:
lm
) using wt
(weight) and hp
(horsepower) as independent variables to predict mpg
.Model Output:
mpg
explained by the model).Purpose:
Interpret results and apply findings in decision-making
Key considerations:
Implementation:
Main Takeaways from this lecture:
8 Key Steps of Data Mining:
Data Preparation:
dplyr
and tidyr
streamline these tasks.Modeling:
mpg
from wt
and hp
).Model Validation:
Interpretation & Implementation:
Data Mining Lab