In this module we will focus on sequencing and the data we get from sequencing. First we make a short overview of types of sequencing, and take some time looking through some videos. We will then inspect some raw data files, i.e. text files in the fastq format. We will also see how we can filter our data based on quality scores. To do this we will also start using the orion cluster, making the first shell script to submit a job to the queuing system. Finally, we take a look at some theory concerning sequencing depth, also know as read coverage.
In most modules in BIN310 we will use some software tools in addition to R and RStudio. I will list these software tools at the beginning of each module, with links to some web sites where you find more details about them, usually their GitHub site. From such GitHub sites (or similar) you usually find user manuals (or links to such) as well as links to publications describing the tool in more detail. When you use such tools in scientific work, you cite these publications, not the GitHub site. In many cases it is described at the GitHub site how to cite correctly.
By sequencing we refer to the technology that allows us to actually read the sequence of DNA (or RNA) molecules. In this course we will focus on DNA sequencing data, but the technologies are the same for RNAseq.
Considering microbial genomics in general we can divide sequencing efforts into the following two categories:
They differ mainly in what is done to the DNA before the actual sequencing takes place. With WGS we want to sequence entire genomes, and no particular region is targeted. With amplicon sequencing, we first copy some region(s) from the genome(s), usually by PCR (Polymerase Chain Reaction), and then sequence only these targeted pieces.
Anyway, the data we get from sequencing are DNA sequences that we call reads. A read is a copy of the actual DNA sequence. Reads may be short or long, depending on the technology, and we typically get a lot of them. Our raw data are these reads.
This year we will focus on WGS data in the first part of the course,
and then turn the attention to amplicon data in the last weeks.
In this course we do not devote much time to this question. But, it seems rather obvious that in order to understand and being able to predict what happens in biology, the knowledge of the DNA sequence is a basic piece of information. It is not, however, the case that in order to understand any biology, you need to do some sequencing. One could get this impression some times, since ‘everyone’ is sequencing these days.
Sequencing has become rather cheap and easily available, and for this
reason a lot of such data are being produced. In this course we focus on
how to process such data, and how to make some rather basic analyses
from them. The more sophisticated analyses will, however, require a more
focused attention to one particular application (why you sequenced). In
order to make this course useful for as many as possible, we try to
avoid too specific applications, and rather spend time learning the
skills and theory that everyone may benefit from. The downside of this
is of course that the more exciting findings are outside our scope. When
you start on your own project (Master/PhD) you will have the time to dig
into your data looking for answers to your questions!
Instead of me talking about sequencing, it is better to make use of the endless sources of teaching material on the internet.
Below is listed some links to an online course from San Diego State University, US. The topics overlap quite a lot with our BIN310, focusing on genome sequencing and bacteria (even the lecturer looks somewhat similar!). I suggest we listen to what Rob Edwards has to say about sequencing. You should look through these videos, and learn:
In the last video above Rob Edwards lists a number of steps on our
left hand side of the board. We will also have a look at this, except
for the metabolic modelling, which is outside the scope of BIN310.
There are a multitude of different technologies available for us to do DNA sequencing. We will mainly focus on Illumina sequencing in this course, since this is the most widely used today. Illumina data is what you meet everywhere. The Illumina technology, and some others, are often referred to as Next Generation Sequencing (NGS) or second generation sequencing (or HTS=High Throughput Sequencing by Rob Edwards). In addition, we will also briefly see some data from newer technologies for producing long reads, often referred to as third generation technologies. These are usually either produced by machines from Pacific Bioscience or Oxford Nanopore, and we refer to these as PacBio or Nanopore data.
Again, let us listen to Rob Edwards:
There are a couple of other sequencing technologies mentioned in some other Rob Edwards videos. You may have a look at them as well, but we will not dig into this here.
There are also a number of nice videos that animate the Illumina
sequencing process, have a look at this: https://www.youtube.com/watch?v=fCd6B5HRaZ8
Here is a brief summary of the sequencing types and technologies, seen from a BIN310 perspective.
Used with WGS | Used with amplicon | Paired-end | Read lengths (bases) | Precision | Technology | |
---|---|---|---|---|---|---|
Short-reads | yes | yes | usually | 150-300 | high | Illumina |
Long-reads | yes | yes | no | 1000 - 100 000 | medium-low | Nanopore, PacBio |
We notice both short and long read technologies are used both for WGS and amplicon type of sequencing, but we could say that for amplicons the more precise short-reads are preferred, while for WGS the long reads are beneficial if you want to assemble genomes. We will see more of this as we go along.
How many reads do we get from a sequencing run? This varies a lot. When amplicon sequencing, we need only a few thousand reads, since we have copied a small piece of the genome, and only sequence this piece. For full genome sequencing we need more, and for metagenomes even more. From deep sequencing of a highly diverse metagenome we may have 100 million reads or more. Remember also that with e.g. Nanopore sequencing, the reads are long (many thousand bases) and the number of reads may not be that large. It may still be a large data set, it is the number bases we sequenced that counts.
Regardless of sequencing technologies, the actual output from
sequencing comes as text-files following the fastq format.
Well, in case of Nanopore technology there is actually a
base-calling step first, but if we consider this as part of the
sequencing itself (it is in the other technologies), we can say that our
job starts by having some fastq files with sequencing data.
As a bioinformatician, the fastq-files are usually the raw data we start out with. These fastq-files includes the reads (sequences) as well as some header (identifier) for each read. But, they also contain a quality sequence for each read, that indicates its quality.
The quality scores are also sometimes referred to as phred
scores. This is due to the software phred
that was
developed during the human genome project, and from which these scores
originate. Some of the analyses we will do make use of this quality
information, but in many cases it is also ignored. We should be aware
that sequencing technologies may be poorly calibrated, such that quality
scores do not always reflect the absolute error probability as mentioned
in the video above. A higher score means a better sequencing result, but
exactly how good or how much better varies a lot.
The fastq files we get from the sequencing machines are sometimes
big, many gigabytes per file. This is why we should always try to keep
them compressed when storing them, to save disk space. You will
typically see the file extension .fastq.gz
on such files,
where the .gz
indicates a compressed (gzipped) file. Most
software (but not all) working with fastq files will be able to read the
compressed file directly, and we need not decompress it. If you want to
inspect the actual text file you have to decompress it first. On the
orion you decompress a file named filename.fastq.gz
like
this:
pigz -d filename.fastq.gz
and the resulting file will have the same name, but without the
.gz
extension. Remember the black background above means
this is something you typically do directly in the Terminal window. If
you want to compress the file again, you write
pigz filename.fastq
The pigz
command is a software installed on orion. You
may read some help about it by typing
man pigz
This takes you to the manual pages for the software. All ‘basic’ UNIX
commands have such manual pages, e.g. try man ls
. To quit
the manual, just type q
for quit.
All the data you will use in BIN310 will be found under the directory
$COURSES/BIN310/
. The COURSES
is an
environment variable, just like HOME
, and is simply an
alias for the folder /mnt/courses/
on orion.
In the folder $COURSES/BIN310/module2/
you find two
compressed fastq files named
Illumina_MiSeq_R1.fastq.gz
Illumina_MiSeq_R2.fastq.gz
These are fastq files output from an Illumina sequencing, a WGS
sequencing of a single bacterial genome using the MiSeq type of
technology. This means this bacteria has been isolated and grown, all
cells have been killed and opened and the DNA has been extracted. Then,
following some Illumina protocol, the DNA has been cut into smallish
pieces or fragments, typically of a few hundred bases long.
Then, these double-stranded fragments have been sequences from both
sides, resulting in a read-pair from each fragment. Illumina is
often (but not always) paired-end sequencing, i.e. both ends of
the genome fragments have been sequenced. This typically results in a
pair of fastq files as above. Note the only difference between
the file names is the R1
and R2
. The
R1
file is often referred to as the R1, the left or the
forward reads, and the R2
are the R2, the right of the
reverse reads.
Personally, I dislike the use of ‘forward’ and ‘reverse’, because it
gives you the impression all reads in the R1
file are from
the forward strand of the genome. They are not! Roughly half the reads
are from the forward and half from the reverse, in both files. After
cutting the fragments, we loose all control of forward and reverse
strand, and I prefer to simply use the term R1 and R2 reads. The R1
reads are all sequenced in the first round of the Illumina
sequencing machine, while the R2 are all sequenced in the second
round, and this we will see results in poorer quality of the
latter. This has nothing to do with strands on the genomes!
Just to once have done this, we will now copy and decompress the fastq files for inspection. Open a Terminal window on orion and
$HOME
named
module2
.pigz
.module2
folder.This is not something we typically do, but it may be that you once in a while get fastq files who are corrupted, and an inspection like this may be useful.
Here I list all the commands in one chunk, but you have to exercise these one at the time in your Terminal.
cd $HOME # make certain your are in your HOME folder
mkdir module2 # create module2 directory
cp $COURSES/BIN310/module2/Illumina_MiSeq_R1*.fastq.gz $HOME/module2 # copy both fastq files to $HOME/module2
ls -l module2 # list files and their sizes. We note the R1 file is amller than the R2 file
pigz -d module2/*.gz # decompress all files in module2 ending with .gz
ls -l module2 # list again, and now the files are bigger, and of exactly same size!
head -n 12 module2/Illumina_MiSeq_R1.fastq # list first 12 lines
head -n 12 module2/Illumina_MiSeq_R2.fastq # list first 12 lines
wc -l module2/Illumina_MiSeq_R1.fastq # number of lines in file
wc -l module2/Illumina_MiSeq_R2.fastq # number of lines in file
rm module2/*.fastq # delete all files in module2 ending with .fastq
You may find it strange that the R1 file is smaller than the R2 file when compressed, but of exact same size without compression. The compression of files depends on the ‘entropy’ in the file, i.e. how much ‘noise’ or ‘disorder’ we have in the file. If reads are without error, they will be more similar to each other, and the file may be compressed more. The more sequencing error we have, the more different reads become, and we cannot compress it to the same small size. Thus, the difference in compressed file size between an R1 and R2 file in a pair is only due to more sequencing errors (more entropy) in one of them!
In order to run the sequencing technology, some technical sequences are added (ligated) to the actual genomic fragments. We refer to these as technical sequences. These have nothing to do with the biological sequence itself, and we should remove such artifacts from our raw sequences.
The adapters are short technical sequences used to start the sequencing process. these are typically found at the ends of the raw reads, in which ends depends on how the sequencing was done. There are many such adapters, and different technologies use different adapters. A collection of commonly used adapters is often available. Adapters are in most cases removed already directly after sequencing, and the raw fastq files we get should in principle not contain these technical sequences. However, this varies and we should at least be prepared to remove more of these ourselves.
When we later in BIN310 focus more on amplicon sequencing, we also will see some primer sequences as part of our reads. We will talk more about those in the corresponding modules. For now we focus on genomic data, and then there are no such primers.
When we do shotgun sequencing, the insert size is a concept of some importance. This is simply the length of the fragments we sequence, and is only relevant for paired-end sequencing of fragmented DNA, i.e. Illumina sequencing. How long are these fragments? This depends on how the DNA was processed in the wet lab, but also on the quality of the DNA prior to this. If we collect ‘dead’ DNA, i.e. DNA from dead organisms, it may be quite fragmented even before we take it to the lab. In studies of ancient DNA, i.e. DNA from organisms who have been dead for a long time, this problem occurs. But also in environmental studies, where we collect all DNA, we may get a large fraction of such fragmented DNA (dead microbes).
When we do paired-end sequencing we want the insert size to be at least twice the read-lengths. Even if we only sequence, say, 150 bases at each end of the fragment, if the fragment is more than 300 bases long the two reads together cover 300 bases in the genome. If the fragment was only 200 bases long, the two reads will overlap each other, and only give us information about these 200 bases. If the fragment is as short as 100 bases, then both reads will also be 100 bases (they cannot be longer than the fragment!) and the R1-read will be a reverse-complement copy of the R2-read. Thus, it is a waste to do paired-end sequencing, both reads contain the same (small) piece of the genome. The insert size is of special importance when we come to the assembly of genomes (module 4).
Below we will have a look at how we may use a software to
But before we can do this, we need to say more about how we run software on the computing cluster.
We now have some sequencing data, in the shape of fastq files, and we will start to use the orion HPC for processing such data. Before we actually start computing anything, we need to understand some things in general about using cluster software. This will apply to all other tools we will use later in BIN310 as well. The same system is also used on the national sigma2 servers, in case you will be using this later.
We briefly mentioned the HPC in the module 1 document, but let us repeat some of this here now:
Software typically come in many versions, and for various reasons we are not always interested in using the very latest version. It could be that one software depends on using another software of a specific version, in order to work. It may also be we want to re-compute something we did earlier, and then we should always use the exact same (older) version of the software, in order to make it reproducible. For this reason, software installed on orion (and sigma2) uses the module system. This means multiple versions of a software is often available. To get an overview of module software available we use the command
module avail
list all software tools installed as modules in this system. If you
do this, and scroll up/down the list (use the Return key to scroll down,
use the q
key to quit listing), you should find
fastp/0.23.2-GCC-11.3.0
somewhere in there. This is the
module for loading the fastp
software that we will use
shortly. Notice the module includes both the software name
(fastp
) and the version number
(0.23.2-GCC-11.3.0
). The latter is important, because
whenever you report what you have done to your data, you must list all
software tools, and their versions, in order to make it reproducible. We
will talk more about this.
Software tools installed as modules on the cluster are not immediately available for use. Up to now we have used UNIX commands, and they are used directly, but external software installed on the cluster must be loaded once in each session, before we start using it. Let us load the FastQC software:
module load fastp/0.23.2-GCC-11.3.0
Now the software should be loaded and ready for us to use. Now we
need to know how to use it. The command for running it is
fastp
. Note the lowercase letters only in this command.
UNIX is case-sensitive! You have to type all commands exactly correct.
If we have no idea on what more to specify on the command line, we
usually try
fastp --help
The option --help
, or just -h
, will often
produce some help text for most software tools, but some may have a
slightly different help option. Often you can also just type the command
(here fastp
) without any option, and you get the help text
listed.
From this help text we can start to learn how to use this software. Another, and often better, way is to google! Use the software name, and perhaps add GitHub as well, as search phrases in google. Often you can find some online manuals with a richer help text than what is displayed from the command line.
The command
module list
will list all currently loaded software. Notice that other modules
were also loaded as an effect of our loading of the fastp
module. To unload all modules we do a
module purge
This is actually a good idea to do prior to starting a new
session and loading new software, since previous loaded modules may
interfere with whatever you load. Now, try to run the
fastp --help
and see what happens. Now you know what this
message means!
If you want to dig more into this, you can read more about the
module
system at the sigma2 Help pages.
In bioinformatics we are often faced with tasks that require a lot of computation time and memory, and it takes too long time to sit and wait for them to complete. Also, since all computing clusters is a shared environment, with many users, we need some kind of queuing system to make certain all users get a reasonable access to the resources. On orion, as well as sigma2, the queuing system is named SLURM (Simple Linux Utility for Resource Management).
Above we ran the fastp
directly from the command line.
This is possible, but in reality not how we run most jobs. Whenever you
log into orion, you are always working on a node called the login
node. This is not meant to be doing heavy computations, in
fact if you try to start bigger jobs directly on the command line they
will be killed after some time (currently 5 minutes). The login node is
only meant for small and simple operations, like listing the help text
like we did above, copy or move small files etc. Anything bigger than
this, and we should make use of the queuing system.
Before we start a job, using some software, we should always establish
The first is the easiest to resolve. By looking into the manual or help text for the software, you will discover if there is an option for specifying the number of threads. If not, it will use 1 thread only.
What is a thread? Well, think of it as the number of processors used to do the computations. The more threads working in parallel, the faster the computations will be done. However, this is not a linear function! Quite often the effect levels off, i.e. doubling the number of threads will not double the speed as we use more and more threads. Since the number of threads is always limited (we share this with others), we should not ask for more threads than we can use efficiently. For the smallish jobs we will run in BIN310, we never need more than 10 threads. Even for the biggest jobs I hardly ever go beyond 20 threads, since the software I use cannot utilize this efficiently. See my HPC video lecture linked to above!
The amount of memory refers to how much internal memory we reserve
for our job. The different nodes in the cluster have different
capacities. The largest nodes in orion have up to 1 terabyte (1000
gigabytes) memory available, which is the maximum we can reserve (as
long as we use 1 node only). How do we know how much memory we need?
Well, this is not straightforward, but it usually depends on the data
set size. In microbial genomics, the data sets are in general smallish,
and we manage well with a few gigabytes in most cases. However, if your
job requires more memory than you have reserved, it will crash. Then,
why don’t we just reserve ‘a lot of memory’? Well, we share this with
each other, and if we all are greedy on memory it means longer queues.
Also, if you request a lot of memory, your job will not start until this
is available, and you may spend a loooong time in the queue! Thus, try
to reserve just the memory needed, for your own and others sake.
Instead of typing commands directly on the command line, we put them into a shell script. Here is a template for a shell script that we will use in BIN310:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=1 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=my_job_name # Sensible name for the job
#SBATCH --output=my_job_name_%j.log # Logfile output here
######################
### Loading software
###
##############
### Settings
###
######################
### Running software
###
#######################################
### Post-processing, cleaning up etc.
###
First, since this is a shell script (using bash) it always starts
with #!/bin/bash
.
Then follows a section of lines all starting with
#SBATCH
. These are instructions for the SLURM queuing
system, and something we always include, but will change slightly from
script to script. Note that if you run this script directly from the
command line (which we in general do not), these lines are ignored. They
are only interpreted when we submit to SLURM.
The first three #SBATCH-lines we never change in BIN310 scripts! Here
we specify that we will use 1 node only. The two lines specifying
--reservation
and --account
allow us to make
use of a special partition set up for BIN310 specifically. This
is to prevent too much interference with other users on orion. The
computing nodes in a cluster are grouped into partitions, typically
decided by their memory capacity. In real life (after BIN310) you will
instead replace these two lines with the line
#SBATCH --partition=hugemem
or some other named
partition.
we then ask for 1 thread and 10 gigabyte of memory. The
--time
specifies an upper time limit, or wall
time, for the job. Sometimes we start jobs with errors in them that
makes the run ‘forever’, and in case we forget to cancel them, we set a
limit here to make certain they will eventually be killed. Here we set
the limit to 1 hour, but this should be adjusted by how long we expect
the job to take. You may also specify days, but in BIN310 we will hardly
need more than hours and minutes.
The job name, and the output log file name, you decide yourself. This
should reflect the job you are doing, and will be listed in the queue
for you to recognize. All jobs, once started, will get a job
number or job id. This is then added to the log file name
where we put the %j
, as we will see. This job number is
important in case you want to kill a job or inspect the memory usage
later, as we will see.
The remaining sections are just comments (starting with a
#
) and we will fill in some valid commands there as we go
along.
fastp
The fastp
software we saw above is something we may use
to filter fastq files with sequencing data. It takes as input
one or a pair of fastq files, remove some ‘bad’ sequence from the raw
reads, and output the results as two new fastq files, plus an additional
extra fastq file with reads where the corresponding mate was discarded
due to some low quality. It also produces a report and store this in an
HTML-file. We can then open this HTML-file in our web-browser and get
some overview of the quality of our data before and after filtering. We
will now make our first SLURM shell script using fastp
,
and then submit this to the SLURM system. This is exactly the same
procedure we will use over and over again many times in BIN310.
Below is a link to a video, but go through this text first and try to do it on your own.
Start RStudio on orion, create a new shell script and copy the shell
script template from above and fill in the necessary commands to run
fastp
on the fastq files. It should look like this:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum 1 hour
#SBATCH --job-name=fastp # Sensible name for the job
#SBATCH --output=fastp_%j.log # Logfile output here
######################
### Loading software
###
module purge
module load fastp/0.23.2-GCC-11.3.0
##############
### Settings
###
threads=10
r1_in=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2_in=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
r1_out=filtered_R1.fastq.gz
r2_out=filtered_R2.fastq.gz
unpaired_out=unpaired.fastq.gz
######################
### Running software
###
fastp --in1 $r1_in --in2 $r2_in --out1 $r1_out --out2 $r2_out --unpaired1 $unpaired_out --thread $threads
#######################################
### Post-processing, cleaning up etc.
###
Notice we made some changes in the #SBATCH
part of the
script, and added some commands. Save this under the file name
fastp.sh
in your $HOME/module2/
folder.
NB! Use the extension .sh
in all shell
script file names!
Under Settings we now created several new shell variables. We saw shell variables already in module 1. Let us explain these variables in detail:
threads
variable simply contains how many threads
we want to use when running the software. Its value must always
correspond to the value in #SBATCH --ntasks=10
in the
heading! Since the software fastp
can use multi-threading,
we speed up the computation by using 10 threads instead of 1.r1_in
and r2_in
contains the full
names of the two input files we want to filter. Notice we include the
full path. We cannot just use the file name here
(e.g. Illumina_MiSeq_R1.fastq.gz
), since the files are in
some other folder. We must of course tell the system where
these files are by including the path to where they are located. It is a
common mistake to forget this. Always know exactly where your files
are!r1_out
, r2_out
and
unpaired_out
contain the file names for the output files we
want to generate. Since we give them no path, they will end up in the
same folder as your script. We could easily have added a path to some
other folder, in case we wanted them to end up there.What is the unpaired_out
? All reads are pairs, but
imagine one of the two reads in a pair is discarded due to too bad
quality. The other one, surviving the filtering, no longer has a mate,
and such unpaired reads are then output to this file. They could be both
R1 or R2 reads, but must from now on be treated as single reads.
Later in the script we use these variables. When using the
$
in front of a shell variable, we refer to its content, as
we saw in module 1. It is not that we have to create these
variables to run the script, but both the number of threads as well as
file names may be changed later, and then it is nice to have one single
place in the script where we change all this (under Settings).
The line where we actually run fastp
makes use of these
variables. As an example we have --in1 $r1_in
. The
--in1
is an option (optional argument) to
fastp
and after this you should name the input R1 file.
Notice how the command line contains a series of such options, and the
corresponding information related to it. We will not discuss how we may
know all options to fastp
here now, we will talk more about
this later.
Time to submit our first job to the SLURM queuing system! In the
Terminal window, navigate to the $HOME/module2
folder, and
verify the file fastp.sh
is actually there. Submit to SLURM
by
sbatch fastp.sh
Thus, the command sbatch
is used to submit to SLURM, and
it should be followed by the name of the shell script where we have
listed all the commands we want to execute. If the script has the proper
content, a job is started, and a number is output in the terminal
window. This is the JOBID
number. You may now inspect the
queue to see if your job is there:
squeue -u <username>
where you replace <username>
by your proper user
name. The output is listed under the headers
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Under JOBID
you find the large integer, the job number.
This is what you have to refer to in case you want to do something to
the job. Under NAME
you should recognize the name you gave
your job in the shell script. Under ST
you find the job
status. If it says R
it means the job is running. If it is
waiting in the queue it should say PD
for pending. Once the
job is done it disappears from the queue. This job should take some
seconds to complete, and it may be it is done before you are able to see
it in the queue! Then, just sbatch
again, and run the
squeue -u <username>
immediately.
If you have submitted a job, but then realizes your script is not
doing what you wanted, you may kill your job right away. Then you need
the JOBID
number we saw above. The command for killing a
job is scancel
. If the JOBID
is
12345678
, then you simply write
scancel 12345678
on the command line. You are (of course) only allowed to kill your own jobs.
Once your job is done, list the content of the folder and see if the
log file and some output has been generated. There should also be an
.html
and a .zip
archive for each input
file.
Each started job will produce a log file, the statement
#SBATCH --output=fastp_%j.log
we added to the script
indicates its name. These files are nice to inspect if something goes
wrong. Once we are done, and happy with the results, we just delete
them. Make a habit of deleting log files!
Now we have seen the basic procedure for making shell scripts and submitting to SLURM. This is the template for later, and we will make many such shell scripts in the coming weeks.
Here is a short video on how we make our first shell scripts and submit it to SLURM
In RStudio, navigate to your $HOME/module2
folder, click
on the .html
file and choose View in Web Browser
to open it. Try to answer the following question by looking through the
report:
Make an R script where you read in the fastq file with unpaired
reads, and find how many of these came from the R1 and from the R2 file.
Hint: The last token in the Header contains the information you need.
Use word()
.
A short video inspecting the HTML report
unpaired.tbl <- readFastq("unpaired.fastq.gz") %>%
mutate(source = word(Header, -1)) %>%
mutate(source = str_c("R", word(source, 1, sep = ":"))) %>%
group_by(source) %>%
summarise(n_reads = n())
Let us extend the fastp
script by also trim away the
first and last 10 bases of every read. Trimming make reads shorter, but
if the ends are full of errors, this may be a good idea.
How would we proceed to achieve this? First, we need to know more
about if this kind of trimming is possible with fastp
, and
if so, how to do it. We would typically then google! Try to google
“fastp github”. The first hit should be the GitHub site for this tool,
at https://github.com/OpenGene/fastp .
Visit this website, and if you scroll down, there is a manual
describing how to install and how to use this tool. Hint: Look for the
options trim_front1
and trim_tail1
and similar
for R2. Try to learn how to use these, and add code to the shell script,
and sbatch
again.
Compare the results you get now to what you saw in the first round. Especially the quality profiles seems different, right?
The fastp
also outputs a file in JSON format. This is a
text format we may read into R, like this:
library(tidyverse)
library(jsonlite)
fastp.lst <- read_json("fastp.json")
This results in a list
with all the information (and
more!) that you saw in the HTML report.
From this, make R code that plots the mean quality scores
over the positions, and makes one curve for quality before and after the
filtering, and two panels, one for R1 and one for R2 reads. It should
look something like this:
library(jsonlite)
fastp.lst <- read_json("fastp.json")
read1_before_quality <- unlist(fastp.lst$read1_before_filtering$quality_curves$mean)
read2_before_quality <- unlist(fastp.lst$read2_before_filtering$quality_curves$mean)
read1_after_quality <- unlist(fastp.lst$read1_after_filtering$quality_curves$mean)
read2_after_quality <- unlist(fastp.lst$read2_after_filtering$quality_curves$mean)
quality.tbl <- tibble(quality = c(read1_before_quality,
read2_before_quality,
read1_after_quality,
read2_after_quality)) %>%
mutate(time = c(rep("before filtering", length(read1_before_quality)+length(read2_before_quality)),
rep("after filtering", length(read1_after_quality)+length(read2_after_quality)))) %>%
mutate(type = c(rep("R1", length(read1_before_quality)),
rep("R2", length(read2_before_quality)),
rep("R1", length(read1_after_quality)),
rep("R2", length(read2_after_quality)))) %>%
mutate(position = c(1:length(read1_before_quality),
1:length(read2_before_quality),
1:length(read1_after_quality),
1:length(read2_after_quality))) %>%
mutate(time = factor(time, levels = c("before filtering", "after filtering")))
fig <- ggplot(quality.tbl) +
geom_line(aes(x = position, y = quality, color = time, group = time)) +
facet_wrap(~type) +
labs(x = "Position along reads", y = "Quality scores", color = "")
print(fig)
squeue
Terminal outputThe default output we get from squeue
uses too few
letters, i.e. each ‘column’ is too narrow. We typically do not see the
full usernames! To fix this, create a alias to the
squeue
command, and use this alias instead. You can
actually make such aliases to all UNIX commands.
You may create this directly in the Terminal, but then you need to
type this each time you log in. To create it automatically on each log
in, we add such commands to the file named .bashrc
in our
$HOME
. NB! Be careful when editing this file, this is like
brain surgery! If you mess up your .bashrc
you may no
longer be able to log in.
Here is a small video on how you can add an alias to your
.bashrc
file in order to see your full username in the
SLURM queue. Please be careful and do exactly as specified:
Here is the exact code line you add to your .bashrc
file:
alias squ="squeue -o '%.18i %.15j %.15u %.2t %.10M %R'"
Please, be very careful when editing your
.bashrc
file, and do exactly as in the
video. This is a big difference to working in Windows or Mac. These
operating systems are designed for ‘sleepy’ users that must be looked
after all the time, and prevented from doing something bad! In UNIX we
are given much more freedom, assuming users are awake and fully
concentrated…
As the last part of this module, let us dig a little into some some basic theory concerning sequencing data. We need to be familiar with the Poisson distribution of read coverage before we proceed to the next module.
When working with the theory behind methods for analyzing sequence data we often stumble across the binomial and the Poisson distributions. Since these are no longer prioritized topics in statistics courses, we need to say a few words about these here.
The binomial variable is a random variable that stores the counts of how many Heads (or Tales) you get after flipping a coin \(n\) times, with the probability \(p\) of getting Head in each coin flip. Notice the following:
Whenever we have a situation where some process is repeated over and over again, \(n\) times, and there are two outcomes of the process each time. we think of binomial variables. Is the probabilities for each outcome the same each time, and are the outcomes independent? If yes, then the number of each outcome follows the binomial distribution. Even if the probability varies slightly, or the is some weak dependence, we still often use this as a model, with some success.
Example: The number of A
‘s in reads of length 150 bases
is a binomial variable. We have sequenced a genome, and in this genome
sequence there is a fixed fraction of A
(and
C
, G
and T
). Let us for
simplicity say this fraction is 0.25. Each read is 150 bases long, and
come from some random position of the genome sequence. We can think of
each read as $n=150 ’coin flips’ where the probability of an
A
is \(p=0.25\) each time.
The number of A
’s in a read, \(y\), is then an integer from 0 up to 150.
This \(y\) is a binomial random
variable. It may take on different values for each read we inspect, but
we will soon discover that not all values are equally likely to occur.
What is the probability of observing 0 A
’s? Or 1
A
? Or 2,3…. and so on. The answer to this is given to us by
the binomial density which we often (but not entirely
correctly) call the binomial distribution. Here is some R code to
display this:
library(tidyverse)
n <- 150 # number of 'coin flips'
p <- 0.25 # probability of 'Hed' each time
y <- 0:150 # possible outcomes
f.bin <- dbinom(x = y, size = n, prob = p) # the binomial density in R
fig <- ggplot() +
geom_col(aes(x = y, y = f.bin)) +
labs(x = "Number of A's in reads of length 150 bases", y = "Probability",
title = "Binomial density")
print(fig)
The density shows one column for each possible outcome (some so small you do not see them), and the column heights are all probabilities. If you sum all column heights, it sums to 1.0. Notice how the most likely values for \(y\) are around 37-38, which is exactly \(n\cdot p = 0.25\cdot 150=37.5\). This is the expected value for \(y\), the ‘average if we took the average over infinite many \(y\) values’. Note that if \(y\) is a binomial variable then the expected value is always \(E(y) = np\).
What is the probability of having no A
in a read? Well,
it is \(P(y=0)\), and this is the
height of the leftmost column, at 0. It is too small to read from the
plot, but we have the values stored in f
, and can inspect
this first value:
print(f.bin[1])
## [1] 1.816308e-19
This is a value around \(10^{-19}\), which is zero for all practical purposes.
The Poisson variable we can think of as an approximation of the binomial. In some cases we have a binomial with a very large \(n\), but a very small \(p\). In such cases we can claim the outcome \(y\) is a Poisson variable with parameter \(\lambda=np\). Notice that the Poisson has only one parameter \(\lambda\), while the binomial has two (\(n\) and \(p\)), and is thus simpler. Note also that the parameter \(\lambda\) is simply the expected value \(np\) that we saw above. We can again compute the density values for all possible outcomes:
f.pois <- dpois(x = y, lambda = n*p)
fig <- ggplot() +
geom_col(aes(x = y, y = f.pois)) +
labs(x = "Number of A's in reads of length 150 bases", y = "Probability",
title = "Poisson density")
print(fig)
Why bother with the Poisson ? Why not always stick to the binomial? Well, in most cases the binomial is the “correct” machinery for understanding how such data are generated, and the Poisson is just an approximation. In this context, we should stick to the binomial. Still, the Poisson simplification pops up quite often if we read scientific papers describing bioinformatic methods, and we should be familiar with this as well.
We will now see how these random variables may be used for describing real data.
Consider a WGS sequencing of a single genome, producing reads in fastq files like those we saw above. The coverage concept is not specifically linked to genome sequencing, but let us use this setting anyway.
Genome sequencing means we have grown many copies of the exact same genome (an isolate), extracted the DNA from these and sequenced it in the random way we call shotgun sequencing. This simply means we cut the DNA into random pieces, and sequence these pieces. Thus, each read we get should be the copy of some random region of the genome. The read coverage for such a data set is simply the total number of bases in our read data divided by the total number of base-pairs in one copy of the genome we are sequencing. Synonyms for this are fold coverage or sequencing depth.
Example: We have \(N=1000\) reads, and the reads are all \(L=100\) bases long, i.e. we have in total \(1000\cdot 100 = 100 000\) read bases. Assume they are all sampled randomly from some small molecule (plasmid?) of length \(G=50 000\) base-pairs. Then we have a read coverage of \(D = 100 000/ 50 000 = 2\).
We may also interpret read coverage as the expected number of reads covering a given position. Let us dig into this some more:
We have a genome of double-stranded DNA that contains \(G\) base pairs. Even if the DNA molecule has 2 strands, the sequence of the genome is only \(G\) bases long, since the negative strand is known from the positive. Thus, we have \(G\) positions along the genome. From this genome we sequence \(N\) reads, and for simplicity we assume all reads are exactly \(L\) bases long (if they vary in length, we just replace \(L\) by their average length \(\bar{L}\)). The read coverage or depth \(D\) is then \[D = \frac{NL}{G}\]
Now we assume
Visualize the circular genome with the \(G\) different positions. Think of each read as sampling at random some positions on this circle, and the read covers that position and the following \(L-1\) positions, in total \(L\) positions. What is the probability that a read cover any particular position? Well, if all positions are equally likely to be sampled, it should be \(L/G\). The idea of ‘shotgun’ sequencing is exactly that all positions are equally likely to be sequenced.
If we keep on sampling independent reads like this \(N\) times, the number of reads that cover any given position is binomial with \(n=N\) and \(p=L/G\). We saw from above that the expected value, i.e. the number reads that covers a given position, is then \(N \cdot L/G\), which is \(D\).
This does not mean that each position is covered exactly \(D\) times! Remember the binomial density plots we made above. Some positions are covered more and some less, due to the random sampling, but if you average this over all \(G\) positions, you should get \(D\).
In reality \(N\) is often a huge number, while \(L/G\) is very small. Thus, we here have a case where the binomial is often approximated by the Poisson variable. We saw above that the Poisson has only 1 parameter, \(\lambda = np\) where \(n\) and \(p\) are the 2 parameters of the binomial. This means in our case \(\lambda = N\frac{L}{G} = D\).
We have WGS sequenced a genome with \(5 427 083\) basepairs. We used paired-end sequencing, resulting in \(595 925\) read-pairs, and each read is exactly \(301\) bases long. Make a barplot in R showing the probabilities of the read coverages \(0, 1,...,100\) at one particular position along the genome, assuming we have a purely random shotgun sequencing.
Hints: In R, the function dpois()
give you the Poisson
probabilities. It requires as input
x
which is a vector of read coverage values you want
probabilities for.lambda
which is the \(\lambda\) from above.Here is a start on some R code to do this, fill in the missing parts
library(tidyverse)
G <- ___
N <- ___
L <- ___
lambda <- ___
pois.tbl <- tibble(Read_coverage = 0:100) %>%
mutate(Poisson_prob = dpois(___, ___))
fig <- ggplot(___) +
geom_col(aes(x = ___, y = ___)) +
labs(x = "___", y = "___", title = "___")
print(fig)
The numbers used here are exactly those from the fastq files we
looked at earlier (Illumina_MiSeq_R1.fastq.gz
and
corresponding R2 file). Can you find the probability that a position is
not covered by any read? The latter is important, because we
would ideally like all positions to be sequenced by some read. Having
coverage 0 at some position means no read contains the base at that
position, i.e. a ‘hole’ in the genome that we never see.
Before we start sequencing, we can decide the number of reads \(N\) to sequence. Try to vary the number of
reads \(N\), and re-plot. You may then
need to change the outcome values (0:100
). Notice the shape
of the distribution. Compute the read coverage and compare this to the
peak of the distribution. Is there a relation? Does this make sense?
How small \(N\) can we tolerate, and still have a probability below \(10^{-3}\) of seeing coverage 0 somewhere?
library(tidyverse)
G <- 5427083
N <- 595925
L <- 301
lambda <- N*L*2/G # we multiply with 2 since we have paired-end reads, and each of the 2 reads are 301 bases
pois.tbl <- tibble(Read_coverage = 0:100) %>%
mutate(Poisson_prob = dpois(0:100, lambda))
fig <- ggplot(pois.tbl) +
geom_col(aes(x = Read_coverage, y = Poisson_prob))
print(fig)
The probability of a position not being covered is the first
Poisson_prob
in the pois.tbl
, i.e. the one for
which Read_coverage
equals 0. This is super-small here! It
is very unlikely that we missed some position given that we sequence to
this depth and with reads spread out over the genome in a random way.
The depth, or read coverage, is the value of lambda
.
If we vary the number of reads N
, and keep everything
else constant, it corresponds to sequencing more or less deep. Each time
the read coverage is the value of lambda
, and this seems to
match the peak of the distribution in the plot. Remember the read
coverage is by definition the expected number of reads covering a
position. The barplot displays the probability of a read covering
0,1,… positions. Thus, the read coverage is the expected value of the
distribution you plot, and this should be pretty close to the peak.
By simply adjusting N and running the code over again several times,
I found that the first Poisson probability in the pois.tbl
,
describing the probability of 0 coverage, rise \(10^{-3}\) if I sequence around \(N=60000\) reads. We need to sequence more
than 60000 read pairs to avoid this. It is no point in finding the exact
limit, since the Poisson model is, as any model, an approximation.
Epilog: You may think that having the small probability of \(10^{-3}=0.001\) (one in a thousand) of having 0 coverage means we are ‘safe’, and we will never see this problem. But, remember this is the probability for a given position. There are many positions! In fact, we have \(G\) more than 5 millions here. This is again like flipping a coin: You have the small probability of 0.001 of seeing a Head in each coin flip, but you flip the coin more than 5 million times! Surely, you would expect to see some Heads after all…