You will get access to the Orion computing cluster that we will use in BIN310. However, the next 3 weeks we will use our own local computers, but from week 5 we will use Orion for all heavy computations. Since some experience more hassle than others with this, it is nice we have some time to fix this before we start using Orion every day.
R is a ‘workhorse’ in the courses given at KBM. We assume all are familiar with the use of R and RStudio. This is not a course where we teach new coding concepts, but we will make use of RStudio as our ‘workbench’, and we need to write some pieces of R code frequently. We take some time to refresh R coding this week, but be prepared to constantly update yourself on this.
We also have a look at some background for the sequencing data we will meet in BIN310. Summarized, these are the learning goals for this module:
I have recorded two short lectures with an introduction to this course:
I will talk about this on our opening day (September 4.), but here
you have a recorded version as well.
In this course we will spend time working on a High Performance
Computing (HPC) cluster called Orion here at NMBU. This is a
standard working environment for most computational sciences, including
computational biology. Our very first step is therefore to get access to
this.
A High Performance Computing cluster can be seen as a collection of (more or less) powerful computers linked together. Each of these computers in the cluster we refer to as a node. Each node may in itself not be particularly powerful, at least compared to a powerful gaming PC you may have at home, but together they make a computing facility with resources exceeding what we typically have locally. We immediately divide the nodes into two distinct categories:
Compared to a local PC, the advantages of a computing cluster are typically:
In Norway we have a national HPC facility called sigma2. It is quite likely that several of you will be using this in the future, and you may read about sigma2 here.
In addition to this, we have at NMBU a local computing cluster called Orion. This is the HPC we will be using in BIN310. It is small compared to the national facilities but is a nice platform for learning the necessary skills. It has the same operating system and queuing system as sigma2, and is similar in all ways. Thus, what you learn by using Orion is very much transferable to sigma2 and other similar HPC facilities.
We will now take some necessary steps to access the orion HPC.
The documentation website for orion is https://orion.nmbu.no/. NB! You may need a VPN connection (see below) to actually reach this site.
You may start looking into this site now, but it will probably be more useful once you have started to make use of the facilities. Bookmark this site.
In order to access orion, you need to be inside the firewall, for security reasons. This means you need a Virtual Private Network (VPN) connection to NMBU. This may be required even if you are at campus.
How to establish a VPN connection depends a little on your PC. Here are some links to help you:
Please consult the IT service resources at NMBU for help on this, this is not something we (BIN310 teachers) can help you with, our NMBU-employee computers are set up to not require VPN.
Setting up this is something you do once, but each time you re-start your computer, you need to establish a VPN connection before you can access Orion. Previous experiences is that some computers are more problematic than others with respect to establish a VPN connection to NMBU.
Please make certain you have a VPN connection! Contact the proper IT help services if you run into problems. We (BIN310 teachers) are probably of little help, but let me know if you have some problems. Then I can give feedback to the IT department to improve things for later.
To be able to log into orion, you must first get an account, i.e. you must be registered with a username and a password.
For this course we have created some student accounts on Orion. These are all based on students registered for the course in Fagpersonweb on Monday 1. September. You should get an email from the IT department on your NMBU email some time after 1. September about this. You use the same username on Orion as you use anywhere else at NMBU, but you need a separate password.
In the email from the IT you get a dummy password. The dummy password is for first time login only, and you must immediately change this, see below how to do this.
In case you did not get any email from the IT about this before Friday 5. september email me at lars.snipen@nmbu.no using the subject orion student and using your NMBU email. Since we will not use Orion right away, there is no need for panic, but we also want this to be settled as quickly as possible.
Let us connect to orion by using a software that we will refer to as a Terminal. This opens a single Terminal-window access to orion. This is the basic way of connecting to orion, or any other HPC. A Terminal is simply a window on your local computer that typically may look something like this:
Mac users: The Mac operating system is very similar to the operating system of Orion (both are variants of UNIX). Thus, a Mac already has a Terminal app that you make use of now! Start the Terminal app on your Mac, and you have a window for logging into orion. In this window always type
ssh <username>@login.orion.nmbu.no
where you replace <username>
by your proper Orion
username. After this you press the Return-key, and assuming your VPN
connection is fine, you will be prompted for the password. Type it in,
and hit Return.
NOTE: Nothing is displayed while typing the password (for security reasons), i.e. it may look like it is not being typed. But it is! Thus, you need to be able to type your password blindfolded! If this was successful you should get some text scrolling over the screen, see my short video for Windows users below.
Windows users: You need to first install a software
that gives you a Terminal window, unlike the Mac this is not standard on
Windows computers. The Orion website above suggests using either MobaXterm
or putty
. Both are
free software tools, and may be installed from the NMBU Software
Center.
This video is from 2023, and the usernames were a little different then, but the procedure is the same.
When you log onto orion for the first time, using your dummy password, you will immediately be asked to change it (see the MobaXterm video above).
Just do as the text in the Terminal window tells you. Again, when typing the old and new passwords, nothing will appear on the screen, but it is working!
After completion the new password is the one you use from now on. Please do not forget this! You should now end the session, and start a new one, and log in to see if the new password is now working. It may take some time for the new password to be registered.
In the small black text box above you can see the first example of my
writing a UNIX command line in this course
(ssh <username>@login.orion.nmbu.no
). All
commands/codes written with such black background means written directly
in the Terminal window. This is what we refer to as the command
line. This is the basic access to a UNIX system, there is no
graphical interface (Desktop). We will type a lot of commands on command
lines like this in BIN310.
On our local computers we are used to a graphical interface, i.e. some Desktop on which we can see files as small images, and where we can click and drag stuff. UNIX system also have this, but on a HPC facility this is not used. Here our only ‘window’ into the system is a Terminal window, and in order to communicate with the system we need to type in commands on the command line. You may recognize this from the Console window in RStudio, where you can type R-commands directly, and get output text. The reasons this command line facility has survived, and is still very much in use, are several. It requires little computing resources, and also give us a more direct access to the operating system. It also reflects this is actually a computer in the original meaning of the word, not like our Mac or Windows ‘computers’ who are only used for communication, administration and gaming, and hardly ever for actual computing.
We want you all to get an account and everything fixed to access the Orion now. However, we will not start using the Orion until in week/module 5. The reason is that in modules 2, 3 and 4 are all about metabarcoding data processing and analysis and such data are usually small enough to be handled well by our local computers. We could have used Orion for this as well, but since it is not really necessary we have chosen not to this year. Only when we start working with whole genome and metagenome data we need the power of the Orion. Thus, if you experience problems with VPN or accessing Orion, you now have some time to fix this before we start using it in week 5.
You should have some familiarity with R coding as this is a prerequisite for this course. In this course we will not do very much statistics, but use R for data wrangling. In most cases this means reading text files into R, do some filtering, selection, sorting, simple calculations etc, and then do some plotting of this.
You will also need to make some reports in RMarkdown and hand in the HTML output from this.
We will later in BIN310 use the Orion computing cluster at NMBU, but we start out by using our local computers the first weeks. We assume you have R and RStudio on your local computer. Update R and RStudio to the newest versions now at the start of the semester.
You can configure RStudio more or less as you like, but some settings are actually important to be aware of, see the following short video:
Here is a short video with some practical hints on configuring RStudio
This was made using RStudio on Orion (we come back to that later) but the same applies to RStudio on you local computer as well.
We will, as always, need some extra R packages during BIN310. Before you start installing R-packages, it is a good idea to create a specified R library, which is simply a folder in which you install all your R-packages.
RStudio will always suggest a folder the first time you try to install a package. This is often OK to accept on a local computer. However, even on a local computer RStudio may choose a sub-optimal folder, e.g. many problems could have been avoided by never storing R packages in the cloud! The main reason we focus on this here now is that when we later start using RStudio on Orion life becomes simpler if we decide where the R packages should be stored (not RStudio). The procedure will then be the same as we sketch here now.
You may omit this if you have done a proper job along these lines already.
First, create a folder somewhere on your local disk (not in
some cloud). As an example, I have on my laptop a folder named
Rlib
directly under the C:
disk (Windows),
i.e. the folder is C:\Rlib
in Windows language. You may
call the folder whatever you like, and put it wherever you like as long
as it is local on you laptop. This is where you want RStudio to install
all R packages.
Next, we need to tell RStudio to use this folder. There are several
ways to do this. We now focus on using the .Rprofile
file
since this is what we will do on Orion later.
Whether you have a Windows, Mac or linux computer, you always have a
HOME
folder somewhere. To locate this, run the following
code in the Console window of RStudio:
path.expand("~")
## [1] "C:/Users/larssn/OneDrive - Norwegian University of Life Sciences/Documents"
The folder listed is what RStudio consider to be your
HOME
folder.
In this folder you must put a file named .Rprofile
(note
it starts with a dot). In RStudio, just open a new R script (File - New
File - R Script). Fill it with the line
.libPaths("C:/Rlib")
And replace the text with the full path to the R library folder your created above. Note:
/
even in
Windows, where we otherwise use backslash \
.Now you save this file in the HOME
folder you revealed
above, and call the file exactly .Rprofile
.
Every time R starts, it will look through your HOME
folder for a file named .Rprofile
. If such a file exists,
the R commands in it will be executed. The command
.libPaths
is used to set the path to the folder in which to
install R packages. Thus, if you did this correctly you should now:
.libPaths()
with no arguments (empty parenthesis)
in the Console.It will typically list more than one path, but the first one should
be the one you created and specified in your .Rprofile
file.
Here is a short video illustrating the procedure
Again, this video was made running RStudio on Orion. But the procedure is the same on you local computer, you just choose/create the folder you want to use on you local computer. We will later do the same on Orion, and then you may want to take a look at this video again.
We install R-packages from various sources. Here are some exercises that you need to do, we will need these packages (and more) later:
Install the microseq
package from CRAN. The
Comprehensive R Archive Network (CRAN) is the main repository for R
packages. Use the Tools - Install Packages...
in RStudio.
Search for microseq
. This is a small package we have made,
and we use it mostly for reading/writing sequence files to/from tables.
There are other packages for doing this, but they tend to not store data
in tables. R loves tables!
To be able to read compressed data files, you also need to install
the package R.utils
from CRAN. Do this right away!
Below you will need the package devtools
. Install this
unless you already have it.
We will also need the tidyverse
packages from CRAN in
BIN310. You may have these already, if not install
tidyverse
now. This takes quite some time!
It is also a good idea to update all packages atthe start of the semester. Use the Update button at the top of your Packages tab in RStudio!
Install the packages phyloseq
and dada2
from Bioconductor. Try to google and follow the instructions.
The Bioconductor is an alternative to the CRAN repository, and has a
lot of useful stuff for bioinformatics. All Bioconductor packages are
installed in a similar way, using the BiocManager()
.
Next week we will need the R package Rsearch
. This also
relies on you installing the software tool vsearch
independent of R. Go to the GitHub site https://github.com/CassandraHjo/Rsearch, read the
instructions and install Rsearch
and also the extra
software vsearch
.
Later in BIN310 we will use RStudio running directly on Orion (not your local PC). It looks and behaves quite similar to what you are familiar with on your local PC. We will then use the same procedure for telling R where to install the R packages as we did above. We will come back to this in week 5.
By sequencing we refer to the technology that allows us to actually read the sequence of DNA (or RNA) molecules. In this course we will focus on DNA sequencing data, but the technologies are the same for RNAseq.
Considering microbial genomics in general we can divide sequencing efforts into the following two categories:
They differ mainly in what is done to the DNA before the actual sequencing takes place. With WGS we want to sequence entire genomes, and no particular region is targeted. With metabarcoding or amplicon sequencing, we first copy (amplify) some region from the genomes, usually by PCR (Polymerase Chain Reaction), and then sequence only these amplified pieces.
Anyway, the data we get from sequencing are DNA sequences that we call reads. A read is a copy of the actual DNA sequence. Reads may be short or long, depending on the technology, and we typically get a lot of them. Our raw data are these reads.
This year we will focus on metabarcoding data in the first part of
the course, and then turn the attention to WGS data in the remaining
weeks.
In this course we do not devote much time to this question. But, it seems rather obvious that in order to understand and being able to predict what happens in biology, the knowledge of the DNA sequence is a basic piece of information. It is not, however, the case that in order to understand any biology, you need to do some sequencing. One could get this impression some times, since ‘everyone’ is sequencing these days.
Sequencing has become rather cheap and easily available, and for this
reason a lot of such data are being produced. In this course we focus on
how to process such data, and how to make some rather basic analyses
from them. The more sophisticated analyses will, however, require a more
focused attention to one particular application (why you sequenced). In
order to make this course useful for as many as possible, we try to
avoid too specific applications, and rather spend time learning the
skills and theory that everyone may benefit from. The downside of this
is of course that the more exciting findings are outside our scope. When
you start on your own project (Master/PhD) you will have the time to dig
into your data looking for answers to your questions!
Instead of me talking about sequencing, it is better to make use of the endless sources of teaching material on the internet.
Below is listed some links to an online course from San Diego State University, US. The topics overlap quite a lot with our BIN310, focusing on genome sequencing and bacteria (even the lecturer looks somewhat similar!). I suggest we listen to what Rob Edwards has to say about sequencing. You should look through these videos, and learn:
In the last video above Rob Edwards lists a number of steps on our left hand side of the board. We will also have a look at this, except for the metabolic modelling, which is outside the scope of BIN310.
Also, all this is about genome sequencing. In the first part of
BIN310 we will instead focus on metabarcoding data, which means we
sequence targeted amplicons instead of random pieces of the genomes. But
in most of BIN310 modules we will also work with whole (meta)genome
sequencing data.
There are a multitude of different technologies available for us to do DNA sequencing. We will mainly focus on Illumina sequencing in this course, since this is the most widely used today. Illumina data is what you meet everywhere. The Illumina technology, and some others, are often referred to as Next Generation Sequencing (NGS) or second generation sequencing (or HTS=High Throughput Sequencing by Rob Edwards).
In addition, we will also briefly see some data from newer technologies for producing long reads, often referred to as third generation technologies. These are usually either produced by machines from Pacific Bioscience or Oxford Nanopore, and we refer to these as PacBio or Nanopore data.
Again, let us listen to Rob Edwards:
There are a couple of other sequencing technologies mentioned in some other Rob Edwards videos. You may have a look at them as well, but we will not dig into this here.
There are also a number of nice videos that animate the Illumina
sequencing process, have a look at this: https://www.youtube.com/watch?v=fCd6B5HRaZ8
Here is a brief summary of the sequencing types and technologies, seen from a BIN310 perspective.
Used with WGS | Used with amplicon | Paired-end | Read lengths (bases) | Precision | Technology | |
---|---|---|---|---|---|---|
Short-reads | yes | yes | usually | 150-300 | high | Illumina |
Long-reads | yes | yes | no | 1000 - 100 000 | medium-low | Nanopore, PacBio |
We notice both short and long read technologies are used both for WGS and metabarcoding type of sequencing, but we could say that for metabarcoding the more precise short-reads are preferred, while for WGS the long reads are beneficial if you want to assemble genomes. We will see more of this as we go along.
How many reads do we get from a sequencing run? This varies a lot. When metabarcoding sequencing, we need only a few thousand reads, since we have copied a small piece of the genome, and only sequence this piece. For full genome sequencing we need more, and for metagenomes even more. From deep sequencing of a highly diverse metagenome we may have 100 million reads or more. Remember also that with e.g. Nanopore sequencing, the reads are long (many thousand bases) and the number of reads may not be that large. It may still be a large data set, it is the number bases we sequenced that counts.
Regardless of sequencing technologies, the actual output from
sequencing comes as text-files following the fastq format.
Well, in case of Nanopore technology there is actually a
base-calling step first, but if we consider this as part of the
sequencing itself (it is in the other technologies), we can say that our
job starts by having some fastq files with sequencing data.
As a bioinformatician, the fastq-files are usually the raw data we start out with. These fastq-files includes the reads (sequences) as well as some header (identifier) for each read. But, they also contain a quality sequence for each read, that indicates its quality.
The quality scores are also sometimes referred to as phred
scores. This is due to the software phred
that was
developed during the human genome project, and from which these scores
originate. Some of the analyses we will do make use of this quality
information, but in many cases it is also ignored. We should be aware
that sequencing technologies may be poorly calibrated, such that quality
scores do not always reflect the absolute error probability as mentioned
in the video above. A higher score means a better sequencing result, but
exactly how good or how much better varies a lot.
The fastq files we get from the sequencing machines are sometimes
big, many gigabytes per file. This is why we should always try to keep
them compressed when storing them, to save disk space. You will
typically see the file extension .fastq.gz
on such files,
where the .gz
indicates a compressed (gzipped) file. Most
software (but not all) working with fastq files will be able to read the
compressed file directly, and we need not decompress it. If you want to
inspect the actual text file you have to decompress it first, but this
is rarely something we do.
In the short video above we saw how the quality symbols in fastq files can be converted to quality scores, and then again how these can be converted to probability of error.
How many errors can we expect to see in a read? Let us now answer this question by recapitulating binomial variables from introductory statistics.
Assume the quality scores reflect the probabilities of error exactly. If we have a read with \(L\) bases then each base has an error probability. Let us denote them \(p_1, p_2,...,p_L\) for each of the bases along the read. Consider now a single base and its error probability \(p_i\). What is the expected number of errors here? Well, the actual number of errors can be either 0 or 1 of course. The expected value is the average number of errors we would get if we had \(n\) such bases, all having the same error probability. This is the same as coin flipping! You have a coin where \(p_i\) is the probability of Heads. You flip the coin \(n\) times and count how many Heads you ended with. This number of Heads is a binomial variable. The expected value of this binomial variable is defined as \(n\cdot p_i\). Thus, the expected number of errors at the single base is \(1\cdot p_i=p_i\) since there is only this single base.
But, the read has several bases, in fact \(N\) of them. The expected number of errors at each position is then \(p_1, p_2,...,p_N\) and the total number for the whole read is \[EE = p_1 + p_2 + \cdots + p_N = \sum_{i=1}^{N}p_i\] We use \(EE\) as short for Expected number of Errors in a read. It is simply the sum of error probabilities! We will see this again next week when we start processing metabarcoding data.
In this exercise we will make a visual display of the expected number of errors for some reads. We want you to do this exercise to
Download (right-click, save as) these two compressed fastq files:
These are examples of Illumina paired-end sequencing data.
Here we have amplified a specific piece of DNA from several bacteria in
some environment, resulting in many smallish pieces or
fragments of double-stranded DNA, typically of a few hundred
bases long. Then, these double-stranded fragments have been sequences
from both sides, resulting in a read-pair from each fragment.
Illumina is often (but not always) paired-end sequencing,
i.e. both ends of the genome fragments have been sequenced. This
typically results in a pair of fastq files as above. Note the
only difference between the file names is the R1
and
R2
. The R1
file is often referred to as the
R1, the left or the forward reads, and the R2
are the R2,
the right of the reverse reads.
Personally, I dislike the use of ‘forward’ and ‘reverse’, because it
gives you the impression all reads in the R1
file are from
the forward strand of the genome. They are not! Roughly half the reads
are from the forward and half from the reverse, in both files. After
cutting/amplifying the fragments, we loose all control of forward and
reverse strand, and I prefer to simply use the term R1 and R2 reads. The
R1 reads are all sequenced in round 1 of the Illumina
sequencing machine, while the R2 are all sequenced in round 2.
This has nothing to do with strands on the genomes!
Here is what you should do:
readFastq()
from the
microseq
package. You should get 2 tables._1
and _2
for
the two files, respectively. This means the Header
column
is named Header_1
in the R1 table and Header_2
in the R2 table.str_split()
. Store this as new columns Cvec_1
(from Quality_1
) and Cvec_2
(from
Quality_2
). Since str_plit()
returns a list,
these columns become list columns.# Converts ASCII characters to phred scores
c2s <- function(Cvec){ # Cvec is a vector of characters
Svec <- numeric(length(Cvec))
for(i in 1:length(Cvec)){
Svec[i] <- strtoi(charToRaw(Cvec[i]), 16L) - 33
}
return(Svec) # Svec is a vector of integers
}
Svec_1
and Svec_2
containing vectors
of quality score values.Pvec_1
and
Pvec_2
.EE_1
and
EE_2
.EE
values using the
ggplot2
package. You choose the proper way to display them,
but we should see how the R1 and R2 differ. Also, remember these are
pairwise data, we should perhaps be able to compare R1 and R2 within
each pair?This exercise will require R coding a little bit beyond the basics from STIN100 or similar courses. It typically makes use of
lapply()
and sapply()
for
looping, even if a standard for-loop may also be used hereapply
-functions, but also in other casesAll this are elements we find useful in bioinformatics. You may consult an AI to learn more about these topics, but make certain you actually learn something. It is of no value to let the AI solve it all without you picking up anything.
library(tidyverse)
library(microseq)
# Reading data and renaming
R1.tbl <- readFastq("data/metabarcoding_R1.fastq.gz") |>
rename(Header_1 = Header,
Sequence_1 = Sequence,
Quality_1 = Quality)
R2.tbl <- readFastq("data/metabarcoding_R2.fastq.gz") |>
rename(Header_2 = Header,
Sequence_2 = Sequence,
Quality_2 = Quality)
# Binding data into one table
# and sampling 1000 read pairs at random
pairs.tbl <- bind_cols(R1.tbl, R2.tbl) |>
slice(sample(1:n(), 1000))
# Splitting Quality texts into single characters
pairs.tbl <- pairs.tbl |>
mutate(Cvec_1 = str_split(Quality_1, "")) |>
mutate(Cvec_2 = str_split(Quality_2, ""))
# Including custom function here now
c2s <- function(Cvec){ # Cvec is a vector of characters
Svec <- numeric(length(Cvec))
for(i in 1:length(Cvec)){
Svec[i] <- strtoi(charToRaw(Cvec[i]), 16L) - 33
}
return(Svec) # Svec is a vector of integers
}
# Converting single characters to phred scores
pairs.tbl <- pairs.tbl |>
mutate(Qvec_1 = lapply(Cvec_1, c2s)) |>
mutate(Qvec_2 = lapply(Cvec_2, c2s))
# Converting phred scores to error probabilities
# using an inline function in lapply
pairs.tbl <- pairs.tbl |>
mutate(Pvec_1 = lapply(Qvec_1, function(q){return(10^(-q/10))})) |>
mutate(Pvec_2 = lapply(Qvec_2, function(q){return(10^(-q/10))}))
# Compute expected number of errors
pairs.tbl <- pairs.tbl |>
mutate(EE_1 = sapply(Pvec_1, sum)) |>
mutate(EE_2 = sapply(Pvec_2, sum))
# Plotting as points to preserve the paired information
# Sort by EE_1 value to get less noisy output
pairs.tbl |>
arrange(EE_1) |>
mutate(pair = 1:n()) |>
pivot_longer(cols = c(EE_1, EE_2), names_to = "type", values_to = "EE") |>
ggplot(aes(x = pair, y = EE, color = type)) +
geom_point(size = 2, alpha = 0.5)
We note that R2 reads in general have higher EE
values
than R1 reads, as expected from what we know about Illumina data. How
big are these values? The largest values seem to be around 10-20,
i.e. we must expect 10-20 errors in these reads. If you compute the read
lengths you will find they are in most cases around 300 bases. This
means up to around 5% of the reads may be errors, in the worst
cases.