Topic 02: Computational Literacy, Command Line, and Version Control
January 23, 2024
Computational Literacy
Command Line
Version Control
In the data science workflow, there are two sorts of surprises and cognitive stress:
Analytical surprise is when you learn something from or about the data.
Infrastructural surprise is when you discover that:
Good project management lets you focus on the right kind of stress.

You should always think in terms of projects.
A project is a self-contained unit of data science work that can be
A project contains
Content, e.g., raw data, processed data, scripts, functions, documents and other output
Metadata, e.g., information about tools for running it (required libraries, compilers), version history
For R projects for example:
.Rproj) files (perhaps augmented with the output of renv for dependency management) and .git.Structuring your working directory
Further thoughts
Good paths
"preprocessing.py""figures/model-1.png""../data/survey..csv"Bad paths
"/Users/me/data/thing.sav".The working directory
VS, open the folder/directory where your project is.RStudio, open it with clicking on the script you want to work with. This will set the location of the script as working directory (which should be your working assumption, too).myproject.RprojNaming scripts
- or underscores _ to separate words.00-setup.py01-import-data.py02-preprocess-data.py03-describe-uptake.py04-analyze-uptake.py05-analyze-experiment.pyModularizing scripts
Talk to Future-you
.Rmd) file instead of a simple .R script.library() and source()).source("functions.R")) or just declared the first script in the pipeline.Structuring your code
RStudio, for exemple, creates a “table of contents” when you name your code chunks as follows (# followed by title and ---):More things to consider
There’d be more to say on how to establish a good project workflow, including how to
There’s limited value in teaching you all that upfront.
The truth is: You’ll likely refine your own workflow over time. Hopefully, I just saved you some initial pain.
Do check out other people’s experiences and opinions, e.g., here or here or here.
Version Control is a way to track your files
It is usually saved in a series of snapshots and branches, which you can move back and forth between
Version Control allows you to view how project has progressed over time
It allows you to:
Have you ever…
Where does Git come from?
What’s the meaning of Git?
How to interact with Git?
Where does GitHub come from?
GitHub.com launched in April 2008 by Tom Preston-Werner, Chris Wanstrath, P.J. Hyett and Scott Chacon.
In 2018, Microsoft acquired the company for more than US$7 billion.
What’s the business model?
Some interesting facts
GitHub’s mascot is “Octocat”, a human-cat-octopus hybrid with five arms.
There are 56m+ developers on Github, with 60m+ new repositories created in 2020 alone.
Part of GitHub’s history are controversies around issues like harassment allegations or incidences of censorship.
From software development…
… to scientific research
Good news: It’s free!
Simply go to https://github.com to sign up.
Some things to consider:
Again, Git is an independent piece of software. You need to have it installed on your machine to call it from the Command Line or RStudio.
Chances are that that’s already the case. Here’s how you can check using the command line:
If you want to install (or update) Git on your Mac/Linux machine, I recommend using Homebrew, “the missing package manager for macOS (or Linux)”:
To install/update Git for Windows, check out happygitwithr.com.
This is particularly important when you work with Git but without the GitHub overhead. The idea is to define how your commits are labelled. Others should easily identify your commits as coming from you.
Still have to introduce yourself? To that end, we set our user name and email address like this:
The user name can be (but does not have to be) your GitHub user name. The email address should definitely be the one associated with your GitHub account.
Check out these setup instructions from Software Carpentry to learn about more configuration options.
Some benefits of the shell:
The shell is powerful and flexible. It lets you do things that the RStudio Git GUI can’t (we will see it later).
Working in the shell is potentially more appropriate for projects that aren’t primarily based in R.
Knowing the basic Git commands in the shell is a good thing for a data scientist.
Git goes through a long chain of operations and tasks before tracking a change.
Many of these tasks are user controlled, and are required for changes to be tracked correctly.
Repositories, usually called ‘repos’, store the full history and source control of a project.
They can either be hosted locally, or on a shared server, such as GitHub.
Most repositories are stored on GitHub, while core contributors make copies of the repository on their machine and update the repository using the push/pull system.
Any repository stored somewhere other than locally is called a ‘remote repository’.
Repositories are timelines of the entire project, including all
Directories, or ‘working directories’ are projects at their current state in time.
Any local directory interacting with a repository is technically a repository itself, however, it is better to call these directories ‘local repositories’, as they are instances of a remote repository.
This diagram shows a little bit about how the basic Git workflow process works
The staging area is the bundle of all the modifications to the project that are going to be committed.
A ‘commit’ is similar to taking a snapshot of the current state of the project, then storing it on a timeline.
I will create a new folder/directory in my computer: my_project
Open the bash Terminal and move to the my_project directory
I will copy the my_project directory path into a text document
I will try to add this folder to the staging area.
Error!
We need to initialize the repository. Do not do that in your root directory!
reporeporeporeporepo:touch script.py # new file
touch webpage.html. # new file
touch style.css # new file
git add . # add files to the staging area
git commit -m "adding files" # new commit
echo "Hello you" >> file.txt # edit .txt file
git add . # add files to the staging area
git commit -m "edditing file.txt". # new commit
rm -f file.txt. # remove file
git commit -a -m "delete file.txt" # new commit
git log # lets check.gitignore.gitignore.gitignore.gitignorenotes.txt file.git statusnotes.txt file one more time.notes.txt file is being tracked.Git branches are a way to create separate development paths without overriding or creating copies of your project.
Branches can be added, deleted, and merged, just like regular commits.
Branches can be used to:
dev, and switch to the new branch.main branch, create a new one bugs, and list themdev branch.version.txt filedev branch and do a new commit and keep working in the dev branch.git checkout dev
echo "new text" >> file.txt
git add .
git commit -m "new file.txt content"
echo "new text" >> myfile.txt
git add .
git commit -m "new myfile.txt content"
git log
touch version.txt
echo "1.0" >> version.txt
git status
git add .
git commit -m "release v1.0"
git checkout main
git merge dev
git log
git checkout dev
git add .
git commit -m "starting new version"
git logCheck this material for more on Merge & Branches.
repo address.The Git/GitHub push and pull system is central to collaborating on coding projects. It allows multiple developers to work on the same project without conflicts.
The Push Operation: Sends your local commits to the remote repository.
git push origin main - pushes commits from your local main branch to the remote main branch.The Pull Operation: Fetches the latest changes from the remote repository and merges them into your local repository.
git pull origin main - pulls changes from the remote main branch and merges them into your local main branch.Best Practices:
README.md file in the remote repo and commit:README.md file in the remote repo and commit:LICENSE and .gitignore files in the remote repo and commit:LICENSE.md file locally and commit.Push request. Conditioned on your settings, it will requires your user name and password.repo on GitHub.README.md file.git pull
git checkout -b err01
# open the file and edit
git status
git add .
git commit -m "fixed mistake in readme file"
git checkout main
git merge err01
git log
git push
# let's check the remote repo
git push origin err01
# let's check the remote repo and check the commits
# git push origin --delete err01 # to delete the branch in the remote repoCan anybody push to my repository?
No, all repositories are read-only for anonymous users. By default only the owner of the repository has write access. If you can push to your own repo, it’s because you are using one of the supported authentification methods (HTTPS, SSH, …).
If you want to grant someone else privileges to push to your repo, you would need to configure that access in the project settings.
To contribute to projects in which you don’t have push access, you push to your own copy of the repo, then ask for a pull-request.
source: https://stackoverflow.com/questions/17442930/can-anybody-push-to-my-project-on-github
reporepoUses a central repository to serve as the single point-of-entry for all changes to the project. The default development branch it the main branch and all changes are committed into this branch. This workflow doesn’t require any other branches besides main.
Also check:
All feature development should take place in a dedicated branch instead of the main branch. This encapsulation makes it easy for multiple developers to work on a particular feature without disturbing the main codebase. It also means the main branch will never contain broken code, which is a huge advantage for continuous integration environments.
Also check:
Instead of using a single server-side repository to act as the “central” codebase, it gives every developer a server-side repository. This means that each contributor has not one, but two Git repositories: a private local one and a public server-side one.
Also check:
Project Management is a central practice in Data Science Projects
Keep Future-you Happy!
The Folder Structure must be respected.
Guidelines for naming scripts, modularizing code, and structuring your code effectively for future readability and usability.
Version Control helps to track changes in files and coordinate work among multiple people.
Git and GitHub are the current main tools for Version Control and Data Science project collaboration
There is no unique Collaborating Workflow, but one of them can fit your team!
Data Science Computing