Data Wrangling
“Happy families are all alike; every unhappy family is unhappy in its own way.”
— Leo Tolstoy
“Tidy datasets are all alike; every messy dataset is messy in its own way.”
— Hadley Wickham
In the real world, most data will not be in the format we desire. In this class, we will learn how to process data so that it becomes tidy according to our objectives.
All the tools we will see are part of the tidyverse, a set of packages for data manipulation.
This set of packages follows principles that are becoming the standard for data analysis in R
.
The %>%
or |>
makes the order of operations clearer, making code easier to read and understand.
Data Wrangling (Data Cleaning / Data Munging):
The process of transforming raw, messy data into a clean and structured format that is ready for analysis.
It often involves:
Improved Data Quality:
Reliable analyses start with clean, accurate, and consistent data.
Efficiency in Analysis:
Data prepared in a structured and “tidy” format reduces the time spent fixing errors and allows focus on actual insights.
Better Decision Making:
High-quality, well-structured data leads to more meaningful interpretations, stronger conclusions, and actionable insights.
It is advisable to adopt a project-based workflow in your data analysis (Workflow: projects).
Using a consistent project structure optimizes collaboration and sharing of your analysis.
For this lecture, we will use the nycflights13
package, which comes with several related datasets about flights departing from NYC in 2013.
The flights
dataset includes information on all US domestic flights departing NYC in 2013. Some key variables include:
year
, month
, day
: date of departuredep_delay
, arr_delay
: departure and arrival delays in minutescarrier
: airline carrierorigin
and dest
: origin and destination airportsIn the tidyverse
, we use verbs (functions) that describe step-by-step data manipulation tasks:
filter()
: Select rowsarrange()
: Sort observationsselect()
and rename()
: Select or rename columnsmutate()
: Create new variablescase_when()
: Recode variablessummarise()
and group_by()
: Summarize informationpivot_wider()
and pivot_longer()
: Reshape dataleft_join()
: Merge datasets%>%
or |>
select()
The flights
dataset has many columns. We can choose only the ones we need:
select()
rename()
If we want to rename a variable:
mutate()
mutate()
We can create new variables based on existing ones. For example, create a total delay variable:
transmute()
transmute()
works similarly, but keeps only the newly created variables:
case_when()
We can recode continuous variables into categories. For example, classify flights as “On Time” if arr_delay <= 0
and “Delayed” otherwise:
case_when()
We can also have more complex conditions:
filter()
filter()
We can select rows based on conditions. For example, flights in January:
Or flights in January operated by United Airlines (carrier “UA”):
&
means AND|
means OR!
means NOTfilter()
We can store results in a new object using logical operators:
arrange()
arrange()
We can sort rows. For example, sort by departure delay ascending:
arrange()
For descending order:
summarise()
and group_by()
summarise()
computes summary statistics. Paired with group_by()
, it creates summaries within groups. For example, find the average departure delay by airline carrier:
Useful functions include mean()
, sum()
, median()
, sd()
, n()
, and n_distinct()
.
group_by()
can be combined with other operations:
After grouping and summarizing, always ungroup()
to return to the normal structure.
pivot_wider()
and pivot_longer()
Let’s create a summary and then reshape it. For example, find the average arrival delay by carrier and month, then pivot to a wide format.
Our data is currently in long format.
pivot_wider()
: from long to wide# Create a summary of mean arrival delay by carrier and month
delay_summary <- flights %>%
group_by(carrier, month) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
ungroup()
# Pivot wider: carriers in rows, months in columns
delay_wide <- delay_summary %>%
pivot_wider(names_from = month, values_from = mean_arr_delay)
delay_wide
pivot_longer()
: from wide to longIf we want to go back to the long format:
“There are two certainties in life: death and bad merges”
The flights
dataset can be connected to others:
planes
by: tailnum
.airlines
by: carrier
.airports
by: origin
and dest
.weather
by: origin
, year
, month
, day
, hour
.full_join()
inner_join()
right_join()
left_join()
For instance, to attach airline names to flights
:
Some interesting resources:
%>%
or |>
to streamline codeData Mining Lab