All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived Reproducible_R folder in the Rstudio menu.
Session -> Set Working Directory -> Choose Directory
or in the console.
As you collect code for a project, you are always making updates. What version did you use when? How do organize all of these? How do I stay in lockstep with my collaborators? How do I efficiently share code with collaborators?
In 2005 Linus Torvalds [the main developer of Linux] was having issues with the version control system they used. So they designed a new one from the ground up with specific set of principles:
Making packages.
Alternative systems.
Other version control systems i.e. CVS and SVS.
Other repository hosting services i.e. Bitbucket or GitLab.
You may already have Git installed, as it installed by certain tools i.e. Xcode Command Line Tools. To interact with Git at the most basic level we will use a command prompt. We can run the this command to check.
## git version 2.26.2
If you do not have it, instructions for installation on each system is here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
We initialize repositories to store files related to a project. You can make many repositories and they are just a regular folder with some extra properties created by git.
## Initialized empty Git repository in /__w/RU_reproducibleR/RU_reproducibleR/extdata/.git/
Mac/Linux
## Descriptions
## Test.csv
## _course.yml
## customCSS
## data
## imgs
## presRaw
## scripts
## .
## ..
## .git
## Descriptions
## Test.csv
## _course.yml
## customCSS
## data
## imgs
## presRaw
## scripts
Windows
The .git folder is a hidden folder. This is where all the business of Git happens.
You can explore it. It is full of simple plain text files.
Mac/Linux
## .
## ..
## HEAD
## branches
## config
## description
## hooks
## info
## objects
## refs
Windows
When first working with git, you need to attach your information. This means you are attached to changes you make. And this will be important for connecting to GitHub later.
## BRC-RU
## brc@rockefeller.edu
I have added a new file called README. We can check the status of our git once we have added this.
## On branch master
##
## No commits yet
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## README.md
## Test.csv
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## scripts/
##
## nothing added to commit but untracked files present (use "git add" to track)
Working directory: Make and save changes to files and git will be aware, but not necessarily involved. New files would be considered untracked.
Staging: Untracked changes or files can be added to the staged area. Staged files are not added to your Git repository.
Local Repository: Once all the edits are finished and the files are staged, they can then be committed. Commits will put the changes in the staged file into their repository.
There is a culture to making edits: * Do not commit every time anything is new. * Each commit should be a nice neat little story: * Example: If you include new function in a pipeline. Save locally while developing. Update README. Stage both. Commit it together. * The aim is you/others will look back and understand what you were thinking and doing. * A good rule: There shouldn’t be multiple clauses in your commit message. This will mean you are likely doing multiple things.
It is all about balance between having a good log of your changes, without having every single thing you do logged. Ideally each step is deliberate and thoughtful. When you are looking back through changes you want the commit of interest to be easy to find, coherent and succinct. Your future self will thank you.
Add the README file to your staged area
## On branch master
##
## No commits yet
##
## Changes to be committed:
## (use "git rm --cached <file>..." to unstage)
## new file: README.md
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## Test.csv
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## scripts/
Commit the README file into your repository with a message. If you do not add a message, you will get prompted.
## [master (root-commit) 4c6925c] Made a README
## 1 file changed, 0 insertions(+), 0 deletions(-)
## create mode 100644 README.md
## On branch master
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## Test.csv
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## scripts/
##
## nothing added to commit but untracked files present (use "git add" to track)
You might have 3 different versions i.e. committed, staged, and working directory version. Diff can tell you the changes between these files.
## On branch master
## Changes not staged for commit:
## (use "git add <file>..." to update what will be committed)
## (use "git restore <file>..." to discard changes in working directory)
## modified: README.md
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## Test.csv
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## scripts/
##
## no changes added to commit (use "git add" and/or "git commit -a")
## diff --git a/README.md b/README.md
## index e69de29..ba06163 100644
## --- a/README.md
## +++ b/README.md
## @@ -0,0 +1 @@
## +Hello Friends
Always good to check what is changed in your staged files before you commit.
## On branch master
## Changes to be committed:
## (use "git restore --staged <file>..." to unstage)
## modified: README.md
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## Test.csv
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## scripts/
Always good to check what is changed in your staged files before you commit.
## diff --git a/README.md b/README.md
## index e69de29..ba06163 100644
## --- a/README.md
## +++ b/README.md
## @@ -0,0 +1 @@
## +Hello Friends
## [master 64a9828] Added welcome message
## 1 file changed, 1 insertion(+)
## commit 64a98284f3165d1c2b267cf5f2376b8cbff62315
## Author: BRC-RU <brc@rockefeller.edu>
## Date: Wed Mar 12 14:18:50 2025 +0000
##
## Added welcome message
##
## commit 4c6925c5ac4cd5b96dfb9b5f377c1581a02ca700
## Author: BRC-RU <brc@rockefeller.edu>
## Date: Wed Mar 12 14:18:50 2025 +0000
##
## Made a README
Can modify the log response to specific information that you need. This is important if you want to comb through 100s of commits.
## 64a9828 Added welcome message
## 4c6925c Made a README
## 64a9828 Added welcome message
## README.md | 1 +
## 1 file changed, 1 insertion(+)
## 4c6925c Made a README
## README.md | 0
## 1 file changed, 0 insertions(+), 0 deletions(-)
While Git is local, GitHub is online. It is a repository hosting service.
Easy to back up and share code.
Easy collaboration.
Can make repositories public, so the code supporting a project will be available to others.
Drawbacks
GitHub almost has a monopoly. Which can lead to problems.
Everything about Git is local and offline, until you tell it to go to a remote. Push it up to wherever you’re hosting you code i.e. GitHub.
The address is shown when you first set up the repo
If you want to get the address later you can check press the clone
button when you open up the repository.
Add the address to the Git. If you do not want it online on GitHub you can pick somewhere else. A server or someone else laptop.
Broadly you will want to do edits through Git as its has more capabilities. You can make commits on GitHub to single files i.e. if you have a typo, this can be a quick fix.
These two tools, of many, that make it easier to manage Git repositories on your computer without having to use the command line.
They work by accessing the same git repository as used by the command line Git tools. This means it has the same principles. It just adds a point-and-click interface, which is easier to work with for your day-to-day Git wants and needs.
RStudio has Git integration through its project management system. Simply start a project and click the version control option.
We can then just pick the Git option and enter the repository information. This is the same information we used when we added a remote on the command line.
We can then just pick the Git option and enter the repository information. This is the same information we used when we added a remote on the command line.
Once you have a new project set up this way there will be a new Git tab in the Environment pane (top right in the standard layout).
Commits are similar to GitHub Desktop. You can use the checkbox to stage them. Add a commit message. Then press commit.
Best practice workflow for collaborating with GitHub (or working solo)
Issue raised
Create a branch to address issue
Add commits
Pull request
Review changes and get feedback
Merge changes
This can be done by you, a collaborator, or if the repository is public, anyone! The workflow can used to reveal a bug. An idea for a feature. Or just simple typos in your documentation.
If you are making modifications to a repository that is actively being used, you might still want to be able to maintain a working version, until the updates you make are finished tested
Creating a branch creates an additional copy of your repository, whose history can then diverge. Later on when you finish working on the branch, you can then integrate it back into the master branch.
Master should reflect what is ‘published’. This is your core repository. Branching helps protect the master.
Git
## * master
## newbranch
Rstudio
Once you have made a branch and want to work on it, you will need to make this branch the one that is active in your Git. The checkout command allows you to switch what is active in your Git.
Git
## Switched to branch 'newbranch'
## master
## * newbranch
Rstudio
Git has the merge function to allow branches to be merged together. Pull requests are the GitHub equivalent, but have built in steps for review. This can be for you, or for others you are collaborating with. Pull requests are the cornerstone of a collaborative GitHub Environment.
When you start a new pull request, you first specify which branches you are bringing together, and what the directionality is.
Next you add a comment to describe what this merge is doing. At this point you can also add reviewers. If you are working collaboratively this is asking a specific person to review the code and approve the pull request.
Once the pull request is added other people can look at the request, including the reviewer, and add comments and feedback. If this addresses an issue you can use the # to tag it in the pull request. At this point GitHub will check that there are no conflicts ie. the branches can be merged succesfully.
Once everyone is happy, you can merge the pull request. You will then get the option to delete the old branch.
Generally it is good practice to do a pull request as opposed to a merge. Pull requests are preserved conversations, that include code. Even if something is not accepted, the rationale for why is saved. This means you have a record of the ‘culture’ surrounding the repository.
Even working with your own private repository, where you are a core/only contributor pull requests are good to use. You can still follow the workflow: make an issue take a branch, address the issue, then pull. If nothing else it is good record keeping and good practice for when you do work on a collaborative project.
Occasionally you may get a conflict error, typically from a pull or a merge. This is because Git is unsure how to merge two files together. Maybe two collaborators have edited the same line of code and tried to merge them back into master branch.
There will be an error message, and you can also see it displayed in status. To resolve the conflict you will have to open the problem file/s.
The conflicting code with information from both sides of the merge will be present in the problem file/s. Once you have opened the file and you will find the structure somewhere:
<<<<<<< HEAD
master code i.e.
y=1
=======
branch code i.e.
y=2
>>>>>>> branch name
To fix this, you just have to pick which code is appropriate. Delete anything superfluous including the “<, = and >”.
If the conflict is found on GitHub, it will walk you through this.
v
How do I work on a project that belongs to someone else on GitHub?
Fork it first. This is like the GitHub equivalent of branching. This will create a whole new copy in your GitHub. This gives you the opportunity to edit/add/remove files without any risk.
Forking is a cornerstone of the open source nature of GitHub. You can fork any public repository. Adapt it to your needs and then deploy it in whatever analysis or tool you are working on. You should just check for a license first.
Alternatively, when collaborating you can take a fork. This is often how bugs are fixed in public tools, as a user may find a way to fix it before the developer. The user can take a fork, fix the bug, create a pull request, then the developer can approve it for integration into the repository.
Sometimes you have large or private files you do not want to upload to github. A .gitignore file contains any files or directories you want Git to ignore.
You can use wild cards to exclude whole file types i.e. *.bams, or *.bw
Typically a .gitignore will be in top level of the directory. You can also have additional .gitignores in sub-directories. These take precedent. You can use ! to allow something to be seen (i.e. if all log files blocked in parent directory, but allow a specific log file to be displayed in daughter directory)
If you move a file Git can get confused and treats it like you have deleted and created a file. This means the new file will not have the history. RStudios git interface figures it out if you stage the “deletion” and “creation” at the same time.
If you move a file on the command line, there is a special Git move function:
The above helps with moving a single file. If you have a bunch of files to move, you can just move them in finder/explorer. Then post-hoc fix the staging to fix the move using:
rm - A simple command to remove files, but with parameters that help remove it from being tracked by git, or to remove the files entire git history.
rebase - A powerful alternative to marge. Perfect for when the master has updates that are not present in a branch. You can use rebase to update the branch, with the new master commits.
reset - Rewind back staging and commits
checkout - Though we covered checking out a branch, this can also be used to travel back to a specific commit in history or roll back unstaged changes to a specific file.
Follow up with the Git book for more detail on implementing these commands.
GitHub is a great resource, but there are risks with storing all your resources in a private enterprise.
Zenodo was developed and funded by the European Commission (a key driver of the EU agenda), for the purpose of promoting and maintaining open science.
As most folks are using GitHub for collating code, Zenodo and GitHub have worked together to make adding a persistent identifier to you repository very easy.
Let’s have a look
Simple series of steps:
After you have created a new release Zenodo will automatically then create a DOI. It also gives you a nice little badge, which you can add to your GitHub repository README.
When this is done Zenodo will have stored a static version of the repository based on the release you created. This will be stored for the foreseeable future.