Topic 04: Version Control
May 15, 2024
Version Control
In the data science workflow, there are two sorts of surprises and cognitive stress:
Analytical surprise is when you learn something from or about the data.
Infrastructural surprise is when you discover that:
Good project management lets you focus on the right kind of stress.
You should always think in terms of projects.
A project is a self-contained unit of data science work that can be
A project contains
Content, e.g., raw data, processed data, scripts, functions, documents and other output
Metadata, e.g., information about tools for running it (required libraries, compilers), version history
For R projects for example:
.Rproj
) files (perhaps augmented with the output of renv for dependency management) and .git
.Structuring your working directory
Further thoughts
Good paths
"preprocessing.py"
"figures/model-1.png"
"../data/survey..csv"
Bad paths
"/Users/me/data/thing.sav"
.The working directory
VS
, open the folder/directory where your project is.RStudio
, open it with clicking on the script you want to work with. This will set the location of the script as working directory (which should be your working assumption, too).myproject.Rproj
Naming scripts
-
or underscores _
to separate words.00-setup.py
01-import-data.py
02-preprocess-data.py
03-describe-uptake.py
04-analyze-uptake.py
05-analyze-experiment.py
Modularizing scripts
Talk to Future-you
.Rmd
) file instead of a simple .R
script.library()
and source()
).source("functions.R")
) or just declared the first script in the pipeline.Structuring your code
RStudio
, for exemple, creates a “table of contents” when you name your code chunks as follows (#
followed by title and ---
):More things to consider
There’d be more to say on how to establish a good project workflow, including how to
There’s limited value in teaching you all that upfront.
The truth is: You’ll likely refine your own workflow over time. Hopefully, I just saved you some initial pain.
Do check out other people’s experiences and opinions, e.g., here or here or here.
Version Control is a way to track your files
It is usually saved in a series of snapshots and branches, which you can move back and forth between
Version Control allows you to view how project has progressed over time
It allows you to:
Have you ever…
Where does Git come from?
What’s the meaning of Git
?
How to interact with Git?
Where does GitHub come from?
GitHub.com launched in April 2008 by Tom Preston-Werner, Chris Wanstrath, P.J. Hyett and Scott Chacon.
In 2018, Microsoft acquired the company for more than US$7 billion.
What’s the business model?
Some interesting facts
GitHub’s mascot is “Octocat”, a human-cat-octopus hybrid with five arms.
There are 56m+ developers on Github, with 60m+ new repositories created in 2020 alone.
Part of GitHub’s history are controversies around issues like harassment allegations or incidences of censorship.
From software development…
… to scientific research
Good news: It’s free!
Simply go to https://github.com to sign up.
Some things to consider:
Again, Git is an independent piece of software. You need to have it installed on your machine to call it from the Command Line
or RStudio
.
Chances are that that’s already the case. Here’s how you can check using the command line:
If you want to install (or update) Git on your Mac/Linux machine, I recommend using Homebrew, “the missing package manager for macOS (or Linux)”:
To install/update Git for Windows, check out happygitwithr.com.
This is particularly important when you work with Git but without the GitHub overhead. The idea is to define how your commits are labelled. Others should easily identify your commits as coming from you.
Still have to introduce yourself? To that end, we set our user name and email address like this:
The user name can be (but does not have to be) your GitHub user name. The email address should definitely be the one associated with your GitHub account.
Check out these setup instructions from Software Carpentry to learn about more configuration options.
Some benefits of the shell:
The shell is powerful and flexible. It lets you do things that the RStudio Git GUI can’t (we will see it later).
Working in the shell is potentially more appropriate for projects that aren’t primarily based in R
.
Knowing the basic Git commands in the shell is a good thing for a data scientist.
Git goes through a long chain of operations and tasks before tracking a change.
Many of these tasks are user controlled, and are required for changes to be tracked correctly.
Repositories, usually called ‘repos’, store the full history and source control of a project.
They can either be hosted locally, or on a shared server, such as GitHub
.
Most repositories are stored on GitHub
, while core contributors make copies of the repository on their machine and update the repository using the push/pull
system.
Any repository stored somewhere other than locally is called a ‘remote repository’.
Repositories are timelines of the entire project, including all
Directories, or ‘working directories’ are projects at their current state in time.
Any local directory interacting with a repository is technically a repository itself, however, it is better to call these directories ‘local repositories’, as they are instances of a remote repository.
This diagram shows a little bit about how the basic Git workflow process works
The staging area is the bundle of all the modifications to the project that are going to be committed.
A ‘commit’ is similar to taking a snapshot of the current state of the project, then storing it on a timeline.
I will create a new folder/directory in my computer: my_project
Open the bash
Terminal and move to the my_project
directory
I will copy the my_project
directory path into a text document
I will try to add this folder to the staging area
.
Error!
We need to initialize the repository. Do not do that in your root directory!
repo
repo
repo
repo
repo
:touch script.py # new file
touch webpage.html. # new file
touch style.css # new file
git add . # add files to the staging area
git commit -m "adding files" # new commit
echo "Hello you" >> file.txt # edit .txt file
git add . # add files to the staging area
git commit -m "edditing file.txt". # new commit
rm -f file.txt. # remove file
git commit -a -m "delete file.txt" # new commit
git log # lets check
.gitignore
.gitignore
.gitignore
.gitignore
notes.txt
file.git status
notes.txt
file one more time.notes.txt
file is being tracked.Git branches are a way to create separate development paths without overriding or creating copies of your project.
Branches can be added, deleted, and merged, just like regular commits.
Branches can be used to:
dev
, and switch to the new branch.main
branch, create a new one bugs
, and list themdev
branch.version.txt
filedev
branch and do a new commit and keep working in the dev
branch.git checkout dev
echo "new text" >> file.txt
git add .
git commit -m "new file.txt content"
echo "new text" >> myfile.txt
git add .
git commit -m "new myfile.txt content"
git log
touch version.txt
echo "1.0" >> version.txt
git status
git add .
git commit -m "release v1.0"
git checkout main
git merge dev
git log
git checkout dev
git add .
git commit -m "starting new version"
git log
Check this material for more on Merge & Branches.
repo
address.The Git/GitHub push and pull system is central to collaborating on coding projects. It allows multiple developers to work on the same project without conflicts.
The Push Operation: Sends your local commits to the remote repository.
git push origin main
- pushes commits from your local main
branch to the remote main
branch.The Pull Operation: Fetches the latest changes from the remote repository and merges them into your local repository.
git pull origin main
- pulls changes from the remote main
branch and merges them into your local main
branch.Best Practices:
README.md
file in the remote repo
and commit
:README.md
file in the remote repo
and commit
:LICENSE
and .gitignore
files in the remote repo
and commit
:LICENSE.md
file locally and commit.Push
request. Conditioned on your settings, it will requires your user name and password.repo
on GitHub.README.md
file.git pull
git checkout -b err01
# open the file and edit
git status
git add .
git commit -m "fixed mistake in readme file"
git checkout main
git merge err01
git log
git push
# let's check the remote repo
git push origin err01
# let's check the remote repo and check the commits
# git push origin --delete err01 # to delete the branch in the remote repo
Can anybody push to my repository?
No, all repositories are read-only for anonymous users. By default only the owner of the repository has write access. If you can push to your own repo, it’s because you are using one of the supported authentification methods (HTTPS, SSH, …).
If you want to grant someone else privileges to push to your repo, you would need to configure that access in the project settings.
To contribute to projects in which you don’t have push access, you push to your own copy of the repo, then ask for a pull-request.
source: https://stackoverflow.com/questions/17442930/can-anybody-push-to-my-project-on-github
repo
repo
Uses a central repository to serve as the single point-of-entry for all changes to the project. The default development branch it the main
branch and all changes are committed into this branch. This workflow doesn’t require any other branches besides main.
Also check:
All feature development should take place in a dedicated branch instead of the main
branch. This encapsulation makes it easy for multiple developers to work on a particular feature without disturbing the main codebase. It also means the main branch will never contain broken code, which is a huge advantage for continuous integration environments.
Also check:
Instead of using a single server-side repository to act as the “central” codebase, it gives every developer a server-side repository. This means that each contributor has not one, but two Git repositories: a private local one and a public server-side one.
Also check:
Project Management is a central practice in Data Science Projects
Keep Future-you Happy!
The Folder Structure must be respected.
Guidelines for naming scripts, modularizing code, and structuring your code effectively for future readability and usability.
Version Control helps to track changes in files and coordinate work among multiple people.
Git and GitHub are the current main tools for Version Control and Data Science project collaboration
There is no unique Collaborating Workflow, but one of them can fit your team!
Data Science Computing