QTM 350: Data Science Computing

Topic 04: Version Control

Professor: Davi Moreira

May 15, 2024

Topic Overview

  • Version Control

    • Data Science Workflow
    • Reproducibility
    • Git and GitHub


Project management

Taming chaos

In the data science workflow, there are two sorts of surprises and cognitive stress:

  1. Analytical (often good)
  2. Infrastructural (almost always bad)

Analytical surprise is when you learn something from or about the data.

Infrastructural surprise is when you discover that:

  • You can’t find what you did before.
  • The analysis code breaks.
  • The report doesn’t compile.
  • The collaborator can’t run your code.

Good project management lets you focus on the right kind of stress.

Keeping Future-you happy

  • It’s often tempting to set up a project assuming that you will be the only person working on it, e.g. as homework.
  • That’s almost never true.
  • Coauthors and collaborators happen to the best of us.
  • Even if not, there’s someone else who you always have to keep happy: Future-you.
  • Future-you is really the one you organize your projects for.
  • Most importantly, they are who will enjoy the fruits of your data science labor, or have to fight back your chaos.
  • So, be kind to Future-you. Establish a good workflow. You’ll thank yourself later.

Project setup

You should always think in terms of projects.

A project is a self-contained unit of data science work that can be

  • Shared
  • Recreated by others
  • Packaged
  • Dumped

A project contains

  • Content, e.g., raw data, processed data, scripts, functions, documents and other output

  • Metadata, e.g., information about tools for running it (required libraries, compilers), version history

For R projects for example:

  • Projects are folders/directories.
  • Metadata is the RStudio project (.Rproj) files (perhaps augmented with the output of renv for dependency management) and .git.

Setup: the folder structure

Structuring your working directory

  • One folder contains everything inside it.
  • Directories keep things separate that should be separated.
  • You decide on the fundamental structure. The project decides on the details.

Further thoughts

  • Ideally, your project folder can be relocated without problem.
  • Keep input separate from output. Definitely separate raw from processed data!
  • Structure should be capable of evolution. More data, cases, models, output formats shouldn’t be a problem.

Setup: the paths

Good paths

  • All internal paths are relative.
  • They are invariant to moving/sharing the project.
  • Examples:
    • "preprocessing.py"
    • "figures/model-1.png"
    • "../data/survey..csv"

Bad paths

  • Absolute paths are bad paths. Don’t feed functions with paths like "/Users/me/data/thing.sav".
  • Those paths will not work outside your computer (or maybe not even there, some days/weeks/months ahead).

The working directory

  • If you use VS, open the folder/directory where your project is.
  • If you use RStudio, open it with clicking on the script you want to work with. This will set the location of the script as working directory (which should be your working assumption, too).
  • Even better yet, have the metadata set it for you:
    • Open your session by opening (choosing, clicking on) myproject.Rproj
    • Then you’ll get the path set for you.

Setup: the code structure

Naming scripts

  • Files should have short, descriptive names that indicate their purpose.
  • I recommend the use of telling verbs.
  • Names should only include letters and numbers with dashes - or underscores _ to separate words.
  • Use numbering to indicate the order in which files should be run:
    • 00-setup.py
    • 01-import-data.py
    • 02-preprocess-data.py
    • 03-describe-uptake.py
    • 04-analyze-uptake.py
    • 05-analyze-experiment.py

Modularizing scripts

  • Write short, modular scripts. Every script serves a purpose in your pipeline.
  • This makes things easier to debug.
  • At the beginning of a script you might want to document input and output.

Setup: the code structure

Talk to Future-you

  • Describe your code, e.g. by starting with a description of what it does. If you comment/describe a lot, consider using an R Markdown (.Rmd) file instead of a simple .R script.
  • Put the setup first (e.g., library() and source()).
  • You might want to outsource the loading of packages to a separate script that is imported in the first step (source("functions.R")) or just declared the first script in the pipeline.
  • Always comment more than you usually do.

Structuring your code

  • Even with modularized code, scripts can become long. Structure helps to keep an overview.
  • Use commented lines as section/subsection heads. Many IDEs have features that help with it
  • RStudio, for exemple, creates a “table of contents” when you name your code chunks as follows (# followed by title and ---):
# Import data --------------

dat <- read_csv("dat.csv")

Setup: the rest

More things to consider

  • There’d be more to say on how to establish a good project workflow, including how to

    • store/organize raw and derived data,
    • deal with output in form of graphs and tables,
    • link everything together from start (project setup) to finish (knitting the report)
    • separate coding for the record and experimental coding.
  • There’s limited value in teaching you all that upfront.

  • The truth is: You’ll likely refine your own workflow over time. Hopefully, I just saved you some initial pain.

  • Do check out other people’s experiences and opinions, e.g., here or here or here.

Managing your project in two simple steps

Version Control

What is Version Control?

  • Version Control is a way to track your files

  • It is usually saved in a series of snapshots and branches, which you can move back and forth between

  • Version Control allows you to view how project has progressed over time

  • It allows you to:

    • Distribute your file changes over time
    • Prevent against data loss/damage by creating backup snapshots
    • Manage complex project structures (e.g. Linux)

Why version control?

phdcomics.com

More reasons to do version control

Have you ever…

  • Changed your code, realized it was a mistake and wanted to revert back?
  • Lost code or had a backup that was too old?
  • Wanted to see the difference between different versions of your code?
  • Wanted to review the history of some code?
  • Wanted to submit a change to someone else’s code?
  • Wanted to share your code, or let other people work on your code?
  • Wanted to see how much work is being done, when, and by whom?
  • Wanted to experiment with but not interfering with working code?

Git and GitHub

Git(Hub) solves this problem


  • Git is a distributed version control system.
  • Imagine if your Dropbox (or Google Drive, or MS OneDrive for that matter) and the “Track changes” feature in MS Word had a baby.
  • In fact, it’s even better than that because Git is optimized for the things that data scientists spend a lot of time working on - code!
  • There is a learning curve, but it’s worth it.
  • Being familiar with Git is taken for granted when you interact with other data scientists.
  • It is by far not the only version control software, but certainly the most popular one.
  • According to StackOverflow’s 2021 Developer Survey, more than 93% of respondents report to use Git - more than any other tool.


  • It’s important to realize that Git and GitHub are distinct things.
  • GitHub is an online hosting platform that allows you to host your code online.
  • It relies on Git and makes some of its functionality more accessible.
  • Also, it provides many more useful features to collaborate with others. (Similar platforms include Bitbucket and GitLab.)
  • Just like we don’t need Rstudio to run R code, we don’t need GitHub to use Git… But it will make our lives easier.

Git: some background

Where does Git come from?

  • Git was created in 2005 by Linux creator Linus Torvalds.
  • The initial motivation was to have a non-proprietary version control system to manage Linux kernel development.
  • Check out this (quite opinionated) talk by Linus Torvalds on Git two years after its creation.

What’s the meaning of Git?

  • Anything, apparently.
  • Also, it’s pronounced [ɡɪt], not [d͡ʒɪt].

How to interact with Git?

GitHub: some background

Where does GitHub come from?

What’s the business model?

  • GitHub offers various subscription plans and has expanded its services beyond hosting Git-based version control.

Some interesting facts

  • GitHub’s mascot is “Octocat”, a human-cat-octopus hybrid with five arms.

  • There are 56m+ developers on Github, with 60m+ new repositories created in 2020 alone.

  • Part of GitHub’s history are controversies around issues like harassment allegations or incidences of censorship.

Git(Hub) for scientific research

From software development…

  • Git and GitHub’s role in global software development is not in question.
  • There’s a high probability that your favourite app, program or package is built using Git-based tools. (RStudio is a case in point.)

… to scientific research

  • Data science involves product building, collaboration, transparency. GH helps with all that.
  • Journals have increasingly strict requirements regarding reproducibility and access. GH makes this easy (DOI integration, off-the-shelf licenses, etc.).
  • My website lives there. And this course does, too.
Democratic databases: science on GitHub” (Perkel, 2016, Nature).

Getting started with Git and GitHub

First step: register a GitHub account

Good news: It’s free!

Simply go to https://github.com to sign up.

Some things to consider:

  • As a student, you qualify for a free GitHub Pro account.
  • The Pro account comes with a couple of additional features.
  • Register for a free account first, then pursue the special offers.
  • Choose your username wisely. This isn’t Instagram, so maybe avoid puns and “funny” nicknames.

Second step: install Git

Again, Git is an independent piece of software. You need to have it installed on your machine to call it from the Command Line or RStudio.

Chances are that that’s already the case. Here’s how you can check using the command line:

which git
/usr/bin/git

And here’s how you can check the version:

git --version
git version 2.39.3 (Apple Git-146)

If you want to install (or update) Git on your Mac/Linux machine, I recommend using Homebrew, “the missing package manager for macOS (or Linux)”:

brew install git

To install/update Git for Windows, check out happygitwithr.com.

Third step: introduce yourself to Git

This is particularly important when you work with Git but without the GitHub overhead. The idea is to define how your commits are labelled. Others should easily identify your commits as coming from you.

Have you already introduced yourself to Git? Find it out:

git config --list

Still have to introduce yourself? To that end, we set our user name and email address like this:

git config --global user.name 'davimoreira'
git config --global user.email 'davi.moreira@example.com'

The user name can be (but does not have to be) your GitHub user name. The email address should definitely be the one associated with your GitHub account.

Check out these setup instructions from Software Carpentry to learn about more configuration options.

Git from the shell

Why bother with the shell?

Some benefits of the shell:

  • The shell is powerful and flexible. It lets you do things that the RStudio Git GUI can’t (we will see it later).

  • Working in the shell is potentially more appropriate for projects that aren’t primarily based in R.

  • Knowing the basic Git commands in the shell is a good thing for a data scientist.

The Git Workflow

  • Git goes through a long chain of operations and tasks before tracking a change.

  • Many of these tasks are user controlled, and are required for changes to be tracked correctly.

Repositories

  • Repositories, usually called ‘repos’, store the full history and source control of a project.

  • They can either be hosted locally, or on a shared server, such as GitHub.

  • Most repositories are stored on GitHub, while core contributors make copies of the repository on their machine and update the repository using the push/pull system.

  • Any repository stored somewhere other than locally is called a ‘remote repository’.

Repos vs Directories

  • Repositories are timelines of the entire project, including all

  • Directories, or ‘working directories’ are projects at their current state in time.

  • Any local directory interacting with a repository is technically a repository itself, however, it is better to call these directories ‘local repositories’, as they are instances of a remote repository.

Workflow Diagram


  • This diagram shows a little bit about how the basic Git workflow process works

  • The staging area is the bundle of all the modifications to the project that are going to be committed.

  • A ‘commit’ is similar to taking a snapshot of the current state of the project, then storing it on a timeline.

Hands on! Creating a New Repository

  1. I will create a new folder/directory in my computer: my_project

  2. Open the bash Terminal and move to the my_project directory

  3. I will copy the my_project directory path into a text document

  4. I will try to add this folder to the staging area.

git add .
  1. Error!

  2. We need to initialize the repository. Do not do that in your root directory!

git init

Hands on! Adding/Removing files from the repo

  1. Open the bash Terminal and move to the my_project directory

  2. To check the staging area status.

git status
  1. Let’s add the path file to the staging area. Then, check its status.
git add path.txt
git status

Hands on! Adding/Removing files from the repo

  1. Let’s create new files in the project and check staging area status:
touch file.txt
touch script.py
touch report.html
touch style.css
git status

Hands on! Adding/Removing files from the repo

  1. Instead of adding each file to the staging area at once. We can do add all them together:
git add .
git status

Hands on! Adding/Removing files from the repo

  1. In the directory, let’s delete the .py and the .css files. Check the status:
rm -f script.py
rm -f style.css
git status

Hands on! First Commit

  1. (cont.) In the directory, let’s do our “initial commit” :
git commit -m "initial commit"

Hands on! First Commit

  1. To check all commits in your repo:
git log
  • Most important things here are:

    • commit id;
    • date/time;
    • branch;
    • commit message;

Hands on! Git Checkout

  1. First, let’s make new commits in our repo:
touch script.py  # new file
touch webpage.html. # new file
touch style.css  # new file
git add . # add files to the staging area
git commit -m "adding files" # new commit
echo "Hello you" >> file.txt   # edit .txt file
git add . # add files to the staging area
git commit -m "edditing file.txt". # new commit
rm -f file.txt. # remove file
git commit -a -m "delete file.txt" # new commit
git log  # lets check

Hands on! Git Checkout

  1. We use checkout to go back in time to a given commit. Let’s go back to the “initial commit”:
git checkout 2b543ff2b3423a6d01727a11603792783315680d
git log
  • Check your folder/directory!

  • Important: doing this does not delete our commits. We just move back in time!

Hands on! Git Checkout

  1. To “move back to the future”, the most recent commit, we just need to go back to the main branch:
git checkout main
git log
  • Check your folder/directory!

  • Important: doing this does not delete our commits. We just move back and forth in time!

Hands on! .gitignore

  1. Let’s create a new text file, notes.txt, and make some edits.
touch notes.txt   
echo "Welcome to Data Science Computing" >> notes.txt
git status
git add .
git status
echo "I hope you enjoy" >> notes.txt
git status                          

Hands on! .gitignore

  1. Let’s say we do not want to track this file (can be a folder or many files).

  2. To do so, we tell Git to ignore those files or folders. We create a .gitignore file.

touch .gitignore
  1. In the .gitignore we list everything we want Git to ignore.

Hands on! .gitignore

  1. Let’s add them to the staging area and do a new commit.
git status
git add .
git status
git commit -m "added gitignore"

Hands on! .gitignore

  1. Now, let’s edit the notes.txt file.
  2. Check git status
  3. We need to update the cached files list, commit these updates and…
  4. Let’s edit the notes.txt file one more time.
  5. Finally, let’s check if the notes.txt file is being tracked.
echo "New line in the file" >> notes.txt
git status
git rm -r --cached .
git add .
git commit -m "fixed file tracking"
echo "The previous edition was not the last one" >> notes.txt
git status

Git branches

Git branches


  1. Git branches are a way to create separate development paths without overriding or creating copies of your project.

  2. Branches can be added, deleted, and merged, just like regular commits.

Git branches

Branches can be used to:

  • Create separate development paths without overriding progress
  • Separate different end goals of your project
  • Creates separate branches for each stage of development (release, development, fixes, master)
  • Creates separate branches for project collaboration!

Git branches: Hands on!

  1. Let’s create a new_project directory.
mkdir new_project
git init
mkdir src
mkdir lib
cd src/
touch file.txt
touch script.js
cd ..
cd lib/
touch lib.py
touch file2.txt
cd ..
git status

Git branches: Hands on!

  1. Let’s do our initial commit, create a new branch dev, and switch to the new branch.
  2. Then, let’s go back to the main branch, create a new one bugs, and list them
git add .
git commit -m "initial commit"
git checkout -b dev 
git add .
git commit -m "initial commit"
git checkout main
git branch bugs
git branch -a
git branch -d bugs # to delete

Git branches: Hands on!

  1. Let’s work on our dev branch.
  2. Add a version.txt file
  3. Then, let’s merge our branches.
  4. To move on in our workflow, we can move back to the dev branch and do a new commit and keep working in the dev branch.
git checkout dev
echo "new text" >> file.txt
git add .
git commit -m "new file.txt content"
echo "new text" >> myfile.txt
git add .
git commit -m "new myfile.txt content"
git log
touch version.txt
echo "1.0" >> version.txt
git status
git add .
git commit -m "release v1.0"
git checkout main
git merge dev
git log
git checkout dev
git add .
git commit -m "starting new version"
git log

Check this material for more on Merge & Branches.

Git and GitHub

GitHub: Creating a New Repository

  1. First, we create a new repo on GitHub

GitHub: Creating a New Repository

  1. We copy the remote repo address.

GitHub: Creating a New Repository

  1. We connect our local Git repo with the remote GitHub repo
cd Desktop/qtm_350_24S_02
git init
git remote add origin https://github.com/davi-moreira/qtm_350_24S_02.git
git remote -v

The Push and Pull System

  • The Git/GitHub push and pull system is central to collaborating on coding projects. It allows multiple developers to work on the same project without conflicts.

  • The Push Operation: Sends your local commits to the remote repository.

    • git push origin main - pushes commits from your local main branch to the remote main branch.
  • The Pull Operation: Fetches the latest changes from the remote repository and merges them into your local repository.

    • git pull origin main - pulls changes from the remote main branch and merges them into your local main branch.
  • Best Practices:

    • Pull often to keep your local repository up-to-date.
    • Push regularly to share your contributions with the team.

The Push and Pull System

  1. Starting with the Pull request
  2. Let’s create a README.md file in the remote repo and commit:

The Push and Pull System

  1. Let’s create a README.md file in the remote repo and commit:

The Push and Pull System

  1. Let’s create a LICENSE and .gitignore files in the remote repo and commit:

The Push and Pull System

  1. Now we can make a Pull request
git pull origin main
git status
git log

The Push and Pull System

  1. Now let’s see the Push request.
  2. Let’s delete the LICENSE.md file locally and commit.
  3. Now we can make a Push request. Conditioned on your settings, it will requires your user name and password.
  4. Let’s check the remote repo on GitHub.
rm -rf LICENSE
ls
git status
git add .
git commit -m "remove LICENSE file"
git log
git push -u origin main

The Push and Pull System

  1. Now let’s create a branch to correct the mistake in the README.md file.
git pull
git checkout -b err01
# open the file and edit
git status
git add .
git commit -m "fixed mistake in readme file"
git checkout main
git merge err01
git log
git push
# let's check the remote repo
git push origin err01
# let's check the remote repo and check the commits
# git push origin --delete err01 # to delete the branch in the remote repo

The Push and Pull System

Can anybody push to my repository?

No, all repositories are read-only for anonymous users. By default only the owner of the repository has write access. If you can push to your own repo, it’s because you are using one of the supported authentification methods (HTTPS, SSH, …).

If you want to grant someone else privileges to push to your repo, you would need to configure that access in the project settings.

To contribute to projects in which you don’t have push access, you push to your own copy of the repo, then ask for a pull-request.

source: https://stackoverflow.com/questions/17442930/can-anybody-push-to-my-project-on-github

RStudio + GitHub + Git

RStudio + GitHub + Git


Happy Git and GitHub for the useR, Heaven King video

RStudio + GitHub + Git

  1. Create a remote repo
  2. Create an RStudio Project and connect with the remote repo
  3. Commit locally
  4. Push and Pull
  5. Create a new branch
  6. Push and Pull
  7. Have Fun!

Collaboration with Git and GitHub

Collaborating Workflows

  • There is no one size fits all Git workflow.
  • It is important to develop a Git workflow that is a productivity enhancement for the team.
  • A workflow should also complement business culture.

Centralized Workflow

  • Uses a central repository to serve as the single point-of-entry for all changes to the project. The default development branch it the main branch and all changes are committed into this branch. This workflow doesn’t require any other branches besides main.

  • Also check:

    -Inviting collaborators to a personal repository

Feature Branch Workflow

Forking Workflow

Summary

Summary

  • Project Management is a central practice in Data Science Projects

  • Keep Future-you Happy!

  • The Folder Structure must be respected.

  • Guidelines for naming scripts, modularizing code, and structuring your code effectively for future readability and usability.

  • Version Control helps to track changes in files and coordinate work among multiple people.

  • Git and GitHub are the current main tools for Version Control and Data Science project collaboration

  • There is no unique Collaborating Workflow, but one of them can fit your team!

Thank you!