All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived Reproducible_R folder in the Rstudio menu.
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/PathToMyDownload/Reproducible_R-master/r_course")
# e.g. setwd('~/Downloads/Reproducible_R-master/r_course')
As you collect code for a project, you are always making updates. What version did you use when? How do organize all of these? How do I stay in lockstep with my collaborators? How do I efficiently share code with collaborators?
In 2005 Linus Torvalds [the main developer of Linux] was having issues with the version control system they used. So they designed a new one from the ground up with specific set of principles:
Making packages.
Alternative systems.
Other version control systems i.e. CVS and SVS.
Other repository hosting services i.e. Bitbucket or GitLab.
You may already have Git installed, as it installed by certain tools i.e. Xcode Command Line Tools. To interact with Git at the most basic level we will use a command prompt. We can run the this command to check.
git --version
## git version 2.26.2
If you do not have it, instructions for installation on each system is here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
We initialize repositories to store files related to a project. You can make many repositories and they are just a regular folder with some extra properties created by git.
## Initialized empty Git repository in /__w/RU_reproducibleR/RU_reproducibleR/extdata/.git/
mkdir My_Project_Folder
cd My_Project_Folder
git init
Mac/Linux
ls .
## _course.yml
## customCSS
## data
## Descriptions
## imgs
## presRaw
## reproducibleR.html
## reproducibleR.Rmd
## scripts
ls -a .
## .
## ..
## _course.yml
## customCSS
## data
## Descriptions
## .git
## imgs
## presRaw
## reproducibleR.html
## reproducibleR.Rmd
## scripts
Windows
dir .
dir . /ah
The .git folder is a hidden folder. This is where all the business of Git happens.
You can explore it. It is full of simple plain text files.
Mac/Linux
ls -a .git
## .
## ..
## branches
## config
## description
## HEAD
## hooks
## info
## objects
## refs
Windows
dir /ah .git
When first working with git, you need to attach your information. This means you are attached to changes you make. And this will be important for connecting to GitHub later.
git config --global user.name 'BRC-RU'
git config --global user.name
## BRC-RU
git config --global user.email 'brc@rockefeller.edu'
git config --global user.email
## brc@rockefeller.edu
I have added a new file called README. We can check the status of our git once we have added this.
touch README.md
git status .
## On branch master
##
## No commits yet
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## README.md
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## reproducibleR.Rmd
## reproducibleR.html
## scripts/
##
## nothing added to commit but untracked files present (use "git add" to track)
Working directory: Make and save changes to files and git will be aware, but not necessarily involved. New files would be considered untracked.
Staging: Untracked changes or files can be added to the staged area. Staged files are not added to your Git repository.
Local Repository: Once all the edits are finished and the files are staged, they can then be committed. Commits will put the changes in the staged file into their repository.
There is a culture to making edits: * Do not commit every time anything is new. * Each commit should be a nice neat little story: * Example: If you include new function in a pipeline. Save locally while developing. Update README. Stage both. Commit it together. * The aim is you/others will look back and understand what you were thinking and doing. * A good rule: There shouldn’t be multiple clauses in your commit message. This will mean you are likely doing multiple things.
It is all about balance between having a good log of your changes, without having every single thing you do logged. Ideally each step is deliberate and thoughtful. When you are looking back through changes you want the commit of interest to be easy to find, coherent and succinct. Your future self will thank you.
Add the README file to your staged area
git add README.md
git status .
## On branch master
##
## No commits yet
##
## Changes to be committed:
## (use "git rm --cached <file>..." to unstage)
## new file: README.md
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## reproducibleR.Rmd
## reproducibleR.html
## scripts/
Commit the README file into your repository with a message. If you do not add a message, you will get prompted.
git commit -m'Made a README'
git status .
## [master (root-commit) 5b2b645] Made a README
## 1 file changed, 0 insertions(+), 0 deletions(-)
## create mode 100644 README.md
## On branch master
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## reproducibleR.Rmd
## reproducibleR.html
## scripts/
##
## nothing added to commit but untracked files present (use "git add" to track)
You might have 3 different versions i.e. committed, staged, and working directory version. Diff can tell you the changes between these files.
echo 'Hello Friends' >> README.md
git status .
git diff
## On branch master
## Changes not staged for commit:
## (use "git add <file>..." to update what will be committed)
## (use "git restore <file>..." to discard changes in working directory)
## modified: README.md
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## reproducibleR.Rmd
## reproducibleR.html
## scripts/
##
## no changes added to commit (use "git add" and/or "git commit -a")
## diff --git a/README.md b/README.md
## index e69de29..ba06163 100644
## --- a/README.md
## +++ b/README.md
## @@ -0,0 +1 @@
## +Hello Friends
Always good to check what is changed in your staged files before you commit.
git add README.md
git status .
## On branch master
## Changes to be committed:
## (use "git restore --staged <file>..." to unstage)
## modified: README.md
##
## Untracked files:
## (use "git add <file>..." to include in what will be committed)
## Descriptions/
## _course.yml
## customCSS/
## data/
## imgs/
## presRaw/
## reproducibleR.Rmd
## reproducibleR.html
## scripts/
git diff
Always good to check what is changed in your staged files before you commit.
git diff --staged
git commit -m 'Added welcome message'
## diff --git a/README.md b/README.md
## index e69de29..ba06163 100644
## --- a/README.md
## +++ b/README.md
## @@ -0,0 +1 @@
## +Hello Friends
## [master a4fed6b] Added welcome message
## 1 file changed, 1 insertion(+)
git log
## commit a4fed6ba3be0e2180bf3b6d09b0d00f8336f33cb
## Author: BRC-RU <brc@rockefeller.edu>
## Date: Tue Jul 18 21:19:00 2023 +0000
##
## Added welcome message
##
## commit 5b2b645e916765409dc1fd15993ac4434ba6c42a
## Author: BRC-RU <brc@rockefeller.edu>
## Date: Tue Jul 18 21:19:00 2023 +0000
##
## Made a README
Can modify the log response to specific information that you need. This is important if you want to comb through 100s of commits.
git log --oneline
## a4fed6b Added welcome message
## 5b2b645 Made a README
git log --oneline --stat
## a4fed6b Added welcome message
## README.md | 1 +
## 1 file changed, 1 insertion(+)
## 5b2b645 Made a README
## README.md | 0
## 1 file changed, 0 insertions(+), 0 deletions(-)
While Git is local, GitHub is online. It is a repository hosting service.
Easy to back up and share code.
Easy collaboration.
Can make repositories public, so the code supporting a project will be available to others.
Drawbacks
GitHub almost has a monopoly. Which can lead to problems.
Everything about Git is local and offline, until you tell it to go to a remote. Push it up to wherever you’re hosting you code i.e. GitHub.
The address is shown when you first set up the repo
If you want to get the address later you can check press the clone button when you open up the repository.
Add the address to the Git. If you do not want it online on GitHub you can pick somewhere else. A server or someone else laptop.
git remote add origin https://github.com/BRC-RU/My_GitHub_Project.git
git push -u origin master
After pushing, your commits will appear on GitHub
Broadly you will want to do edits through Git as its has more capabilities. You can make commits on GitHub to single files i.e. if you have a typo, this can be a quick fix.
Whenever your local Git is behind the remote GitHub, you can grab the updates with Pull.
git pull -u origin master
These two tools, of many, that make it easier to manage Git repositories on your computer without having to use the command line.
They work by accessing the same git repository as used by the command line Git tools. This means it has the same principles. It just adds a point-and-click interface, which is easier to work with for your day-to-day Git wants and needs.
RStudio has Git integration through its project management system. Simply start a project and click the version control option.
We can then just pick the Git option and enter the repository information. This is the same information we used when we added a remote on the command line.
We can then just pick the Git option and enter the repository information. This is the same information we used when we added a remote on the command line.
Once you have a new project set up this way there will be a new Git tab in the Environment pane (top right in the standard layout).
Commits are similar to GitHub Desktop. You can use the checkbox to stage them. Add a commit message. Then press commit.
Best practice workflow for collaborating with GitHub (or working solo)
Issue raised
Create a branch to address issue
Add commits
Pull request
Review changes and get feedback
Merge changes
This can be done by you, a collaborator, or if the repository is public, anyone! The workflow can used to reveal a bug. An idea for a feature. Or just simple typos in your documentation.
If you are making modifications to a repository that is actively being used, you might still want to be able to maintain a working version, until the updates you make are finished tested
Creating a branch creates an additional copy of your repository, whose history can then diverge. Later on when you finish working on the branch, you can then integrate it back into the master branch.
Master should reflect what is ‘published’. This is your core repository. Branching helps protect the master.
Git
git branch 'newbranch' # to build new one.
git branch # tells you what branches exist
## * master
## newbranch
Rstudio
Once you have made a branch and want to work on it, you will need to make this branch the one that is active in your Git. The checkout command allows you to switch what is active in your Git.
Git
git checkout newbranch
## Switched to branch 'newbranch'
git branch
## master
## * newbranch
Rstudio
Git has the merge function to allow branches to be merged together. Pull requests are the GitHub equivalent, but have built in steps for review. This can be for you, or for others you are collaborating with. Pull requests are the cornerstone of a collaborative GitHub Environment.
When you start a new pull request, you first specify which branches you are bringing together, and what the directionality is.
Next you add a comment to describe what this merge is doing. At this point you can also add reviewers. If you are working collaboratively this is asking a specific person to review the code and approve the pull request.
Once the pull request is added other people can look at the request, including the reviewer, and add comments and feedback. If this addresses an issue you can use the # to tag it in the pull request. At this point GitHub will check that there are no conflicts ie. the branches can be merged succesfully.
Once everyone is happy, you can merge the pull request. You will then get the option to delete the old branch.
Generally it is good practice to do a pull request as opposed to a merge. Pull requests are preserved conversations, that include code. Even if something is not accepted, the rationale for why is saved. This means you have a record of the ‘culture’ surrounding the repository.
Even working with your own private repository, where you are a core/only contributor pull requests are good to use. You can still follow the workflow: make an issue take a branch, address the issue, then pull. If nothing else it is good record keeping and good practice for when you do work on a collaborative project.
Occasionally you may get a conflict error, typically from a pull or a merge. This is because Git is unsure how to merge two files together. Maybe two collaborators have edited the same line of code and tried to merge them back into master branch.
There will be an error message, and you can also see it displayed in status. To resolve the conflict you will have to open the problem file/s.
The conflicting code with information from both sides of the merge will be present in the problem file/s. Once you have opened the file and you will find the structure somewhere:
<<<<<<< HEAD
master code i.e.
y=1
=======
branch code i.e.
y=2
>>>>>>> branch name
To fix this, you just have to pick which code is appropriate. Delete anything superfluous including the “<, = and >”.
If the conflict is found on GitHub, it will walk you through this.
How do I work on a project that belongs to someone else on GitHub?
Fork it first. This is like the GitHub equivalent of branching. This will create a whole new copy in your GitHub. This gives you the opportunity to edit/add/remove files without any risk.
Forking is a cornerstone of the open source nature of GitHub. You can fork any public repository. Adapt it to your needs and then deploy it in whatever analysis or tool you are working on. You should just check for a license first.
Alternatively, when collaborating you can take a fork. This is often how bugs are fixed in public tools, as a user may find a way to fix it before the developer. The user can take a fork, fix the bug, create a pull request, then the developer can approve it for integration into the repository.
Sometimes you have large or private files you do not want to upload to github. A .gitignore file contains any files or directories you want Git to ignore.
You can use wild cards to exclude whole file types i.e. *.bams, or *.bw
Typically a .gitignore will be in top level of the directory. You can also have additional .gitignores in sub-directories. These take precedent. You can use ! to allow something to be seen (i.e. if all log files blocked in parent directory, but allow a specific log file to be displayed in daughter directory)
If you move a file Git can get confused and treats it like you have deleted and created a file. This means the new file will not have the history. RStudios git interface figures it out if you stage the “deletion” and “creation” at the same time.
If you move a file on the command line, there is a special Git move function:
git mv FilePath NewFilePath
The above helps with moving a single file. If you have a bunch of files to move, you can just move them in finder/explorer. Then post-hoc fix the staging to fix the move using:
git add -A .
rm - A simple command to remove files, but with parameters that help remove it from being tracked by git, or to remove the files entire git history.
rebase - A powerful alternative to marge. Perfect for when the master has updates that are not present in a branch. You can use rebase to update the branch, with the new master commits.
reset - Rewind back staging and commits
checkout - Though we covered checking out a branch, this can also be used to travel back to a specific commit in history or roll back unstaged changes to a specific file.
Follow up with the Git book for more detail on implementing these commands.
The Git book for all things git.
Understanding branch topology can be confusing. Git School have a nice tool to help you understand how the different branching, checkouts, commits, and merges map out with a graphic.
The GitHub help page is put together really well and can help with both Git and GitHub questions.
Getting confident with the collaborative side of GitHub is tough. There are less serious repositories which you can take part in to practice i.e. Dad Jokes.
In most of our training we talk about how great R is to do bioinformatics. With Git and GitHub this is no exception. I have shown you RStudios user-friendly interface for Git. There are also R pacakges that allow you to do much of what we have talked about trhough R itself in a programmatic way:
Exercise on Git and GitHub in R can be found here
Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.
It is easy to add remote repositories from GitHub or local repositories from your computer.
We can check the log of the repository with the history tab.
Staging and committing is simplified:
If there is an associated remote on GitHub you can then just push it up.
For help setting up: https://docs.github.com/en/desktop
For help using the software: https://www.softwaretestinghelp.com/github-desktop-tutorial/