In this first module we focus on the skills needed to get started working on the computing cluster. Even if you in many cases can do a lot of bioinformatics on a local (powerful) PC or Mac, it is part of our training to learn how to make use of High Performance Computing (HPC) facilities. This means you log onto this computing cluster from your own computer, and will do all the work in this course on this facility. In this first module we focus on getting started by establishing the connection to the HPC facility in two different ways, and very briefly start to use it.
R is a ‘workhorse’ in the courses given at KBM. We assume all are familiar with the use of R and RStudio. This is not a course where you learn new programming concepts, but we will make use of RStudio as our ‘workbench’, and we need to write some (smallish) pieces of R code frequently. Note that in this course we will run R and RStudio on the computing cluster, not on your local computer, but it will look and behave (almost) the same.
We will also devote some time to learn some basic coding in the UNIX shell. This is required to ‘survive’ in modern computational biology. Previous knowledge of the UNIX operating system is not required, and you will not need much UNIX skills to complete the tasks of this course. However, if you foresee a future for yourself in computational biology, you should invest some time learning more UNIX.
Summarized, these are the learning goals for this module:
I have recorded two short lectures with an introduction to this course:
I will talk about this on our opening day (September 7.), but here
you have a recorded version as well.
In this course we will spend the time working on a High
Performance Computing (HPC) cluster. This is a standard working
environment for most computational sciences, including computational
biology. Our very first step is therefore to get access to this. In the
next module we will dig more into this.
A High Performance Computing cluster can be seen as a collection of (more or less) powerful computers linked together. Each of these computers in the cluster we refer to as a node. Each node may in itself not be particularly powerful, at least compared to a powerful gaming PC you may have at home, but together they make a computing facility with resources exceeding what we typically have locally. We immediately divide the nodes into two distinct categories:
Compared to a local PC, the advantages of a computing cluster are typically:
In Norway we have a national HPC facility called sigma2. It is quite likely that several of you will be using this in the future, and you may read about sigma2 here.
In addition to this, we have at NMBU a local computing cluster called orion. This is the HPC we will be using in BIN310. It is small compared to the national facilities but is a nice platform for learning the necessary skills. It has the same operating system and queuing system as sigma2, and is similar in all ways. Thus, what you learn by using orion is very much transferable to sigma2 and other similar HPC facilities.
We will now take some necessary steps to access the orion HPC.
The documentation website for orion is https://orion.nmbu.no/. NB! You may need a VPN connection (see below) to actually reach this site.
You may start looking into this site now, but it will probably be more useful once you have started to make use of the facilities. Bookmark this site.
In order to access orion, you need to be inside the firewall, for security reasons. This means you need a Virtual Private Network (VPN) connection to NMBU. This may be required even if you are at campus. How to do this?
Note that this requires a two-factor login. Here are some links to some websites with help for setting this up:
These are in norwegian, but I think it should be possible to follow the procedure anyway.
Setting up this is something you do once, but each time you re-start your computer, you need to log in at the VPN website (https://na.nmbu.no/) before you can access orion. Previous experiences is that some computers are more problematic than others with respect to establish a VPN connection to NMBU.
Please make certain you have a VPN connection, you will not be able to follow BIN310 without it! Contact the proper IT help services if you run into problems. We (BIN310 teachers) are probably of little help, but let me know if you have some problems. Then I can give feedback to the IT department to improve things for later.
To be able to log into orion, you must first get an account, i.e. you must be registered with a username and a password. For this course we have created some student accounts on orion. To get your username and dummy password, either
The dummy password is for first time login only, and you must immediately change this, see below how to do this.
Let us connect to orion by using a software that I will refer to as a Terminal. This open a single Terminal-window access to orion. This is the basic way of connecting to orion, or any other HPC. A Terminal is simply a window on your local computer that typically may look something like this:
Mac users: The Mac operating system is very similar to the operating system of orion (both are variants of UNIX). Thus, a Mac already has a Terminal app that you make use of now! Start the Terminal app on your Mac, and you have a window for logging into orion. In this window always type
ssh <username>@login.orion.nmbu.no
where you replace <username>
by your proper
username. After this you press the Return-key, and assuming your VPN
connection is fine, you will be prompted for the password. Type it in,
and hit Return. NOTE: Nothing is displayed while typing
the password (for security reasons), i.e. it may look like it is not
being typed. But it is! Thus, you need to be able to type your password
blindfolded! If this was successful you should get some text scrolling
over the screen, see my short video for Windows users below.
Windows users: You need to first install a software
that gives you a Terminal window, unlike the Mac this is not standard on
Windows computers. The orion website above suggests using either MobaXterm
or putty
. Both are
free software tools, and may be installed from the NMBU Software
Center.
When you log onto orion for the first time, using your dummy password, you will immediately be asked to change it (see the MobaXterm video above).
Just do as the text in the Terminal window tells you. Again, when typing the old and new passwords, nothing will appear on the screen, but it is working!
After completion the new password is the one you use from now on. Please do not forget this! You should now end the session, and start a new one, and log in to see if the new password is now working. It may take some time for the new password to be registered.
In the small black text box above you can see the first example of my
writing a UNIX command line in this course
(ssh <username>@login.orion.nmbu.no
). All
commands/codes written with such black background means written directly
in the Terminal window. This is what we refer to as the command
line. This is the basic access to a UNIX system, there is no
graphical interface (Desktop). We will type a lot of commands on command
lines like this in BIN310.
On our local computers we are used to a graphical interface, i.e. some Desktop on which we can see files as small images, and where we can click and drag stuff. UNIX system also have this, but on a HPC facility this is not used. Here our only ‘window’ into the system is a Terminal window, and in order to communicate with the system we need to type in commands on the command line. You may recognize this from the Console window in RStudio, where you can type R-commands directly, and get output text. The reasons this command line facility has survived, and is still very much in use, are several. It requires little computing resources, and also give us a more direct access to the operating system. It also reflects this is actually a computer in the original meaning of the word, not like our Mac or Windows ‘computers’ who are only used for communication, administration and gaming, and hardly ever for actual computing.
Above we used a Terminal window to connect to orion. This is the standard way to access a computing cluster. However, on orion we may also connect by running RStudio directly on orion. This give us some benefits we will make use of later. In BIN310 we typically both have a Terminal window and an RStudio window open to orion at the same time. Some tasks are best done in the Terminal, some best in RStudio. Let us connect to RStudio on orion.
In BIN310 we have been granted some special servers for running RStudio this year. The reason is to avoid using the standard RStudio access, competing for resources with the rest of the orion users.
To log in to RStudio you use one of these two server addresses:
These should behave the same way. In order to spread our load on both servers we now recommend:
This is not a restriction, just an attempt to spread the load between the two to some degree. You may log into any of the two.
Bookmark these websites! You will be asked for username and password, type in as you did for Terminal login.
If you get an ordinary user account (not BIN310 student) you log into RStudio in a slightly different way. I prefer to also mention this here, since we may also use this option on rare occasions. Note that in BIN310 we will by default use the procedure with the two servers outlined above!
Ordinary users log in from the JupyterHub website https://jupyterhub.orion.nmbu.no/hub/login. Type in your orion username and password as before. You should be taken to a page named Server Options. Here we choose how to run RStudio on orion.
Notice we typically choose:
You then press the big Start button at the lower end of the page, and you are taken to the JupyterHub Launcher site. There you choose the RStudio-4.3.1 button, to start RStudio using R version 4.3.1.
Note again that in BIN310 we should log into RStudio using the two servers mentioned above, this is just a backup solution we may (or may not) need on some rare occasions.
You can configure RStudio more or less in the same way as you can when running RStudio locally on your own computer. Some settings are actually important to be aware of, see the following video:
Here is a short video with some practical hints on configuring RStudio
You may notice that in RStudio we also have Terminal window (locate
the Terminal tab). Could we not just forget about the Terminal login
that we did above, and only use RStudio? Well, I wish we could,
but there are some problems. The Terminal window in RStudio will not
behave 100% as a standard Terminal window, and we probably run into some
problems relying only on this. Also, this solution with running RStudio
on the cluster is not in general available (e.g. on sigma2) and we
should not be dependent on it.
Even if UNIX systems in general also have a graphical user interface
(GUI), the command line is always available. This is what we should take
some time learning. All bioinformaticians are expected to be familiar
with UNIX command lines!
The internet is full of help for those who want to learn the basics of UNIX commands. Let us make use of some of this. Open a new tab in your web-browser, and visit
Bookmark this! Step through the first part, named Learning the Shell (see left margin). Have a Terminal window to orion open, and try out the commands from this web-course on orion. The shell we use on orion is the bash (Bourne Again Shell), which is the same as in the web-course.
Here is a short guide to the Learning the Shell part:
login
node). If everyone started to run heavy computations here, it would
crash!Open the Terminal (not RStudio) on orion, and try to run the codes listed in the course above.
Note that a folder and a directory is the same thing, I may use both in my texts.
Once you are logged in to orion with a Terminal, you are in your home directory. Based on the reading above, do the following from the command line:
module1
under the home
directory. Notice we never use spaces inside names of
files or folders (or any other item). Navigate into it.data
inside it, and
navigate into it..faa
from
/mnt/courses/BIN310/module1/
into this data
directory.Eubacterium_rectale.faa
. Hint: Use cat
.head
tail
>
. Hint: Use first
cat
to list all lines, as above. Then add the pipe
operator and grep '>'
to pick up only the lines
having the symbol >
.wc
. What does wc
do?As we go along in this course we will make use of UNIX shell commands. As always, the only way of learning this is to repeat it over and over again. Thus, you will most likely need to revisit the LinuxCommand.org web-site, or similar sources for help later.
### Suggested solutions to above exercise
cd $HOME # cd means change directory. Alternative: cd ~ or just cd
ls # ls means list
pwd # pwd means print working directory
mkdir module1 # mkdir means make directory. You specify its name
cd module1
mkdir data
cd data
cp /mnt/courses/BIN310/module1/*.faa . # cp needs 2 inputs:
# what to copy (from)
# where to copy (to) Here we use only . (a dot)
# A dot means 'this folder'
# You may also give the copied file a new name
cat *.faa # lists all files ending with .faa in this folder,
# which is only one file
head Eubacterium_rectale.faa # head lists first 10 lines
tail -n 20 Eubacterium_rectale.faa # tail lists last 10 lines, but option -n is used to
# list more/fewer lines
cat Eubacterium_rectale.faa # This list all lines in the file
cat Eubacterium_rectale.faa | grep '>' # All lines are piped into grep, listing only the lines
# with a >
cat Eubacterium_rectale.faa | grep '>' | wc -l # All lines with > are then piped into wc, which is
# counting the number of lines when we use the option -l
When we are logged in to orion, we are in a shell, which is
an environment not unlike the GlobalEnvironment
in RStudio.
In this shell we may create shell variables just like the
objects we create in R. Let us create a new shell variable and fill it
with some text:
my_name="Lars Snipen"
We create the variable my_name
and immediately fill it
with content, the text "Lars Snipen"
. Notice:
_
to splice together words (never space!)=
, and there must be no
spaces around it (unlike R or python)Just like R you may use both single or double quotes around a text, e.g. we could have written
my_name='Lars Snipen'
To see the content of a shell variable, we need to refer to
it using the $
operator:
echo $my_name
Lars Snipen
The echo
command will simply display whatever it gets as
input, here the content of the variable.
If you forget the $
it will simply display the name of
the variable:
echo my_name
my_name
Learn this difference! When we create or assign something to a
variable, we name it in the usual way, but if we want to retrieve its
content we need to add the $
before its name.
We will create a lot of shell variables in the scripts we make in BIN310.
Assume you are in some directory in the UNIX file-tree. Which one of these commands will take you back to your home directory?
cd
cd $HOME
cd ~
Try it out in a Terminal window. You should find that all three ways will take you to your home directory.
Notice the shell variable named HOME
. We never created
this, where does it come from? When you log in to orion, you will find
there are from the start a number of environment variables who
already have a content. These are shell variables who are ‘always
there’.
In the Terminal window, inspect the content of this variable by
echo $HOME
In this course I will from now on use $HOME
to symbolize
the home directory. Note that when I type $HOME
it refers
to my home, but when you type $HOME
it refers to your
home.
There are some other environment variables we will also meet:
COURSES
. This is the path to a folder where courses
like BIN310 may upload data to share among the course participants.SCRATCH
. This is the path to a folder you have, much
like your HOME
, and where you may store temporary
stuff.TMPDIR
. This is a folder where we may tell software to
store temporary files.We will meet these as we go along. By convention, all environment variables are written in uppercase letters. Thus, when we create shell variables we should not use uppercase letters (even if it is legal), just to avoid a potential confusion.
You may also create your own environment variables, but we will probably not spend time on this in BIN310.
Experience tells us that many people are not used to care about folders and the paths to where their files are. In Windows or a Mac everything is visible on a desktop, and users are in general little aware of how the files and folders are organized.
When working in a UNIX system, you have to know where your files are!
When we start making scripts (program-files) and produce result
files, it is a good habit to organize everything in directories
(folders). You typically create a new directory for each module or
assignment, we already created the module1
directory above.
Once we start working with many input and output files, it is important
to be organized properly. It will become a hopeless mess if you store
all files together in one directory.
When you are working inside a folder, your script (R, shell or
whatever) will not know about files in other folders unless you
specifically address them. An example is from the exercise above, where
we copied the file Eubacterium_rectale.faa
from the folder
/mnt/users/larssn/BIN310/coursedata/
. We may, in any
folder, access this file directly as long as we use the full path to it.
Let us use the wc
command to see how many lines it has:
wc -l /mnt/courses/BIN310/module1/Eubacterium_rectale.faa
Note that this will always work regardless from which folder I execute this, since I specify exactly which file I refer to (the path to where it is located and its name).
But, this will not work if we just type:
wc -l Eubacterium_rectale.faa
unless there is in fact a file with that name in the folder you are now working inside. If so, this is not the same file, just another file having the same name!
Learn that the shell will not, magically, know which file you are referring to unless you specify completely which file and where exactly it is located.
Let us return to the RStudio connection to orion again. Before we
start using R properly, we need to install some R packages.
Every user install their own R packages on orion, only the base packages
are always available to everyone. But, before we start installing
R-packages, we like to create a specified R library, which is
simply a directory in which we install all R-packages.
There is a reason for making a specified R library. If you start to install R-packages without having done this, RStudio will create a local library somewhere. This works fine as long as you only use R through RStudio. But, sooner or later we may run bigger jobs, and we want to start R from the SLURM queuing system (more on this later). Then the local library created by RStudio is no longer visible to R! In fact, specifying the directory (folder) where you want to store you R packages is always a good idea, even on your local computer. For instance, many problems can be avoided by never storing R packages in the cloud…
Here is how we do this:
First, create the library directory. In my $HOME
directory there is a directory named R
(if not, create it
now!) and inside this I have created another directory named
myLib
, thus I want $HOME/R/myLib
to be my
R-package library. You should create a directory under your
$HOME
in a similar way (you may of course choose different
names). We must now make R aware of this, and make R always look into
this folder when installing or using packages.
.Rprofile
fileEach time you start R on orion, either through RStudio or directly in
the shell, it will always look for a file named .Rprofile
in you $HOME
directory, and execute the codes in this file.
Thus, in this file we put commands we always want to be executed at
startup. This file does not exist by default, but we will create it and
fill it with some content now:
In RStudio, create a new text file (File - New File - Text File). In this file you write one line of R-code:
.libPaths("~/R/myLib")
Add one newline (return) such that the file ends with an empty line
(this is important!). Save this file in your $HOME
directory and under the exact name .Rprofile
. Notice
the file name starts with a dot.
This should now tell R to use the folder ~/R/myLib
as
library. Note that you may of course choose other directories as your
library, just edit the .Rprofile
file accordingly.
To test if this has been successful, first quit RStudio, and re-start it. This means you must use the menu File - Quit Session… (closing the web browser and re-open it does not re-start RStudio!). Then, from the menu choose Tools - Install Packages… and a small window should appear. In the bottom pop-up menu of this window (Install to Library:) you should now see your chosen library. If not, something was wrong. Make certain you did exactly as specified above!
We can install R-packages from various sources. Please make certain you run the newest version of R on orion when you install R packages. In most cases you can then go back and use older versions of R (if you for some reason like to), but the opposite is not necessarily possible.
Here are some exercises that you need to do, we will need these packages (and more) later:
Install the microseq
package from CRAN. The
Comprehensive R Archive Network (CRAN) is the main repository for R
packages. Use the Tools - Install Packages...
in RStudio.
Search for microseq
. This is a small package we have made,
and we use it mostly for reading/writing sequence files to/from tables.
There are other packages for doing this, but they tend to not store data
in tables. R loves tables!
We will need the tidyverse
packages from CRAN in BIN310.
However, this should already be available to us, along with a number of
other R packages. Inspect the Packages tab in RStudio to verify
this.
Install the package ggtree
from Bioconductor. Try to
google and follow the instructions.
A script is simply a text file containing some code, in our case either R code or UNIX shell code. Since we can run RStudio on orion, we will use RStudio as our editor for making all kinds of scripts. There are editors we may run inside the Terminal, and we may briefly see this too, but in this course we will predominately make use of RStudio.
You should have some familiarity with R coding, as this is a prerequisite for this course. In this course we will not do very much statistics, but use R for data wrangling. In most cases this means reading text files into R, do some filtering, selection, sorting, simple calculations etc, and then do some plotting of this.
We will also make scripts with shell code. A shell script is
simply a text file listing UNIX commands. You need to make such scripts
in order to start a computing job on a HPC, and in these scripts we will
from time to time need some simple coding in addition to straightforward
commands. I use the term ‘shell code’ for all commands we can put into
such shell scripts. The shell is not formally a programming language,
more like a set of commands, but it contains some elements we recognize
from programming languages, like if-else statements and for-loops. We
will also see some use of awk
, which is a programming
language, in some of this shell code, but this is not a course in
awk
coding.
Let us make an R script and refresh some data wrangling in R.
First, make a new R script (File - New File - R Script). Add this code
library(tidyverse) # we will need functions from these packages
library(microseq) # we need the function readFasta() from this package
### Reading data from fasta-file
faa.tbl <- readFasta("/mnt/courses/BIN310/module1/Eubacterium_rectale.faa")
and save it in your module1
directory by a proper name.
R scripts should always have the extension .R
. RStudio will
in fact add this if you forget. Remember: Never use
spaces inside file names, and never ever use the scandinavian letters
when coding!
Notice the looks of the code chunk above, with a rather light grey background. This is how I will display R code in the texts I produce for this course. Each time you see a box like this, it contains R code.
Run (Source) the script from above to verify it works. This code
reads the data into the table faa.tbl
in R. This table has
two columns, one named Header
and one named
Sequence
. Extend the script to fulfill these tasks:
Length
containing the length of
each sequence. Hint: Use mutate()
and
str_length()
.Length
, such that the longest
sequence is at the top. Hint: arrange()
and
desc()
.slice()
.filter()
.butyrate
is mentioned in the Header
. Hint:
str_detect()
.These are small examples of typical data wrangling, i.e handling or processing of data (tables) in R. We will see later that the output we get from various software tools may be read into R and wrangled like this, in order to either summarise the results or give them as input to the next software tool.
library(tidyverse) # we will need functions from these packages
library(microseq) # we need the function readFasta() from this package
### Reading data from fasta-file
faa.tbl <- readFasta("/mnt/courses/BIN310/module1/Eubacterium_rectale.faa") %>%
mutate(Length = str_length(Sequence)) %>%
arrange(desc(Length))
longest.tbl <- faa.tbl %>%
slice(1:10)
above300.tbl <- faa.tbl %>%
filter(Length > 300)
butyrate.tbl <- faa.tbl %>%
filter(str_detect(Header, "butyrate"))
In addition to some R scripts, we will also make shell scripts in this course.
Create a new shell script file (File - New File - Shell Script). Add to it the following shell code:
#!/bin/bash
# My first script
echo "Hello World!"
echo "My home is $HOME"
and save it in a proper folder under the name hello.sh
.
Shell scripts should always have the extension .sh
, and the
newest versions of RStudio will add this in case you forget. You may
notice that when you pasted in the text into RStudio it was colored.
RStudio also understands shell code!
Notice also the darker grey background of the code chunk above. This is how I will display shell scripts in the texts I produce for this course. Thus, whenever you see code chunks in my texts then
We typically run shell scripts from the Terminal. Open a Terminal window and navigate to the same folder where you saved the shell script above. Run the script by
./hello.sh
Is it executed? Probably not… Why not?
List the content of the folder in long format by
ls -l
This reveals the permissions to every file. In my case it looks like this:
Notice the permissions for the shell script file. It is the string
-rw-r--r--
at the start of the listing line. The first
-
indicates it is a file, it would say d
if it
was a directory. The next 9 characters indicate the permissions for this
file:
r
), Writing
(w
) and Execution (x
) permissions for you, the
User that owns this file. A -
indicates the corresponding
permission has not been given, i.e. rw-
indicates reading
and writing permissions, but not execution. The next three are similar
permission for the Group you as a user belong to. You may open files to
people in the same group only. Finally, the last three characters are
similar for for all Other users. Thus, if it reads
-rwxrwxrwx
the file will be completely open to all on
orion!In order to run a shell script, you as a user must have executable permissions to it. Is this the case here? In the web-course at LinuxCommand.org there was a session about permissions. Use this to
If you try to execute a shell script without permissions you always
get the Permission denied
error message. Now you know how
you may fix this!
Epilog: In fact, this little shell script you could also
execute directly inside RStudio, using the Run Script
button (upper right corner of file). If you do, RStudio will
automatically change the scripts permission to executable! However, we
will in general not run shell scripts in RStudio. By far the
most shell scripts we will make in BIN310 are for submitting jobs to the
SLURM queuing system, and this is not something RStudio will do for you!
We will see more of in the coming weeks.