1 Learning goals

In this module we will focus on sequencing and the data we get from sequencing. First we make a short overview of types of sequencing, and take some time looking through some videos. We will then inspect some raw data files, i.e. text files in the fastq format. We will also see how we can filter our data based on quality scores. To do this we will also start using the orion cluster, making the first shell script to submit a job to the queuing system. Finally, we take a look at some theory concerning sequencing depth, also know as read coverage.

  • Get an overview of the basic sequencing types
  • Understand the raw data format - fastq files
  • Understand the basics of HPC
  • Start using the orion HPC
  • Understand the statistics concerning sequencing depth.



1.1 Software used in this module

In most modules in BIN310 we will use some software tools in addition to R and RStudio. I will list these software tools at the beginning of each module, with links to some web sites where you find more details about them, usually their GitHub site. From such GitHub sites (or similar) you usually find user manuals (or links to such) as well as links to publications describing the tool in more detail. When you use such tools in scientific work, you cite these publications, not the GitHub site. In many cases it is described at the GitHub site how to cite correctly.





2 Sequencing

By sequencing we refer to the technology that allows us to actually read the sequence of DNA (or RNA) molecules. In this course we will focus on DNA sequencing data, but the technologies are the same for RNAseq.

Considering microbial genomics in general we can divide sequencing efforts into the following two categories:

  • Whole genome sequencing (also known as WGS or shotgun sequencing)
  • Amplicon sequencing (also known as targeted sequencing or barcoding)

They differ mainly in what is done to the DNA before the actual sequencing takes place. With WGS we want to sequence entire genomes, and no particular region is targeted. With amplicon sequencing, we first copy some region(s) from the genome(s), usually by PCR (Polymerase Chain Reaction), and then sequence only these targeted pieces.

Anyway, the data we get from sequencing are DNA sequences that we call reads. A read is a copy of the actual DNA sequence. Reads may be short or long, depending on the technology, and we typically get a lot of them. Our raw data are these reads.

This year we will focus on WGS data in the first part of the course, and then turn the attention to amplicon data in the last weeks.

2.1 Why sequencing?

In this course we do not devote much time to this question. But, it seems rather obvious that in order to understand and being able to predict what happens in biology, the knowledge of the DNA sequence is a basic piece of information. It is not, however, the case that in order to understand any biology, you need to do some sequencing. One could get this impression some times, since ‘everyone’ is sequencing these days.

Sequencing has become rather cheap and easily available, and for this reason a lot of such data are being produced. In this course we focus on how to process such data, and how to make some rather basic analyses from them. The more sophisticated analyses will, however, require a more focused attention to one particular application (why you sequenced). In order to make this course useful for as many as possible, we try to avoid too specific applications, and rather spend time learning the skills and theory that everyone may benefit from. The downside of this is of course that the more exciting findings are outside our scope. When you start on your own project (Master/PhD) you will have the time to dig into your data looking for answers to your questions!

2.2 Online course

Instead of me talking about sequencing, it is better to make use of the endless sources of teaching material on the internet.

Below is listed some links to an online course from San Diego State University, US. The topics overlap quite a lot with our BIN310, focusing on genome sequencing and bacteria (even the lecturer looks somewhat similar!). I suggest we listen to what Rob Edwards has to say about sequencing. You should look through these videos, and learn:

In the last video above Rob Edwards lists a number of steps on our left hand side of the board. We will also have a look at this, except for the metabolic modelling, which is outside the scope of BIN310.

2.3 Sequencing technologies

There are a multitude of different technologies available for us to do DNA sequencing. We will mainly focus on Illumina sequencing in this course, since this is the most widely used today. Illumina data is what you meet everywhere. The Illumina technology, and some others, are often referred to as Next Generation Sequencing (NGS) or second generation sequencing (or HTS=High Throughput Sequencing by Rob Edwards). In addition, we will also briefly see some data from newer technologies for producing long reads, often referred to as third generation technologies. These are usually either produced by machines from Pacific Bioscience or Oxford Nanopore, and we refer to these as PacBio or Nanopore data.

Again, let us listen to Rob Edwards:

There are a couple of other sequencing technologies mentioned in some other Rob Edwards videos. You may have a look at them as well, but we will not dig into this here.

There are also a number of nice videos that animate the Illumina sequencing process, have a look at this: https://www.youtube.com/watch?v=fCd6B5HRaZ8

2.4 Brief summary

Here is a brief summary of the sequencing types and technologies, seen from a BIN310 perspective.

Used with WGS Used with amplicon Paired-end Read lengths (bases) Precision Technology
Short-reads yes yes usually 150-300 high Illumina
Long-reads yes yes no 1000 - 100 000 medium-low Nanopore, PacBio

We notice both short and long read technologies are used both for WGS and amplicon type of sequencing, but we could say that for amplicons the more precise short-reads are preferred, while for WGS the long reads are beneficial if you want to assemble genomes. We will see more of this as we go along.

How many reads do we get from a sequencing run? This varies a lot. When amplicon sequencing, we need only a few thousand reads, since we have copied a small piece of the genome, and only sequence this piece. For full genome sequencing we need more, and for metagenomes even more. From deep sequencing of a highly diverse metagenome we may have 100 million reads or more. Remember also that with e.g. Nanopore sequencing, the reads are long (many thousand bases) and the number of reads may not be that large. It may still be a large data set, it is the number bases we sequenced that counts.





3 Sequencing data

Regardless of sequencing technologies, the actual output from sequencing comes as text-files following the fastq format. Well, in case of Nanopore technology there is actually a base-calling step first, but if we consider this as part of the sequencing itself (it is in the other technologies), we can say that our job starts by having some fastq files with sequencing data.

3.1 The fastq files

As a bioinformatician, the fastq-files are usually the raw data we start out with. These fastq-files includes the reads (sequences) as well as some header (identifier) for each read. But, they also contain a quality sequence for each read, that indicates its quality.

The quality scores are also sometimes referred to as phred scores. This is due to the software phred that was developed during the human genome project, and from which these scores originate. Some of the analyses we will do make use of this quality information, but in many cases it is also ignored. We should be aware that sequencing technologies may be poorly calibrated, such that quality scores do not always reflect the absolute error probability as mentioned in the video above. A higher score means a better sequencing result, but exactly how good or how much better varies a lot.

The fastq files we get from the sequencing machines are sometimes big, many gigabytes per file. This is why we should always try to keep them compressed when storing them, to save disk space. You will typically see the file extension .fastq.gz on such files, where the .gz indicates a compressed (gzipped) file. Most software (but not all) working with fastq files will be able to read the compressed file directly, and we need not decompress it. If you want to inspect the actual text file you have to decompress it first. On the orion you decompress a file named filename.fastq.gz like this:

pigz -d filename.fastq.gz

and the resulting file will have the same name, but without the .gz extension. Remember the black background above means this is something you typically do directly in the Terminal window. If you want to compress the file again, you write

pigz filename.fastq

The pigz command is a software installed on orion. You may read some help about it by typing

man pigz

This takes you to the manual pages for the software. All ‘basic’ UNIX commands have such manual pages, e.g. try man ls. To quit the manual, just type q for quit.



3.2 Data from a sequenced genome

All the data you will use in BIN310 will be found under the directory $COURSES/BIN310/. The COURSES is an environment variable, just like HOME, and is simply an alias for the folder /mnt/courses/ on orion.

In the folder $COURSES/BIN310/module2/ you find two compressed fastq files named

  • Illumina_MiSeq_R1.fastq.gz
  • Illumina_MiSeq_R2.fastq.gz

These are fastq files output from an Illumina sequencing, a WGS sequencing of a single bacterial genome using the MiSeq type of technology. This means this bacteria has been isolated and grown, all cells have been killed and opened and the DNA has been extracted. Then, following some Illumina protocol, the DNA has been cut into smallish pieces or fragments, typically of a few hundred bases long. Then, these double-stranded fragments have been sequences from both sides, resulting in a read-pair from each fragment. Illumina is often (but not always) paired-end sequencing, i.e. both ends of the genome fragments have been sequenced. This typically results in a pair of fastq files as above. Note the only difference between the file names is the R1 and R2. The R1 file is often referred to as the R1, the left or the forward reads, and the R2 are the R2, the right of the reverse reads.

Personally, I dislike the use of ‘forward’ and ‘reverse’, because it gives you the impression all reads in the R1 file are from the forward strand of the genome. They are not! Roughly half the reads are from the forward and half from the reverse, in both files. After cutting the fragments, we loose all control of forward and reverse strand, and I prefer to simply use the term R1 and R2 reads. The R1 reads are all sequenced in the first round of the Illumina sequencing machine, while the R2 are all sequenced in the second round, and this we will see results in poorer quality of the latter. This has nothing to do with strands on the genomes!

3.2.1 Exercise - inspect fastq files directly in the Terminal

Just to once have done this, we will now copy and decompress the fastq files for inspection. Open a Terminal window on orion and

  • Make a folder under your $HOME named module2.
  • Copy the two compressed fastq-files from above into this folder.
  • Note the file size for both files. Are they identical?
  • Decompress both files using pigz.
  • Note the file sizes after decompression. What happened?
  • List the first 12 lines in each file.
  • Inspect the Header-line of the first 3 reads in both files. How can you see that the reads in a pair actually belongs together as a pair?
  • Count the number of lines in both files. They should be identical, are they?
  • Delete the fastq files from your module2 folder.

This is not something we typically do, but it may be that you once in a while get fastq files who are corrupted, and an inspection like this may be useful.

3.2.2 Exercise solution

Here I list all the commands in one chunk, but you have to exercise these one at the time in your Terminal.

cd $HOME              # make certain your are in your HOME folder
mkdir module2         # create module2 directory
cp $COURSES/BIN310/module2/Illumina_MiSeq_R1*.fastq.gz $HOME/module2   # copy both fastq files to $HOME/module2
ls -l module2         # list files and their sizes. We note the R1 file is amller than the R2 file
pigz -d module2/*.gz  # decompress all files in module2 ending with .gz
ls -l module2         # list again, and now the files are bigger, and of exactly same size!
head -n 12 module2/Illumina_MiSeq_R1.fastq   # list first 12 lines
head -n 12 module2/Illumina_MiSeq_R2.fastq   # list first 12 lines
wc -l module2/Illumina_MiSeq_R1.fastq        # number of lines in file
wc -l module2/Illumina_MiSeq_R2.fastq        # number of lines in file
rm module2/*.fastq                           # delete all files in module2 ending with .fastq

You may find it strange that the R1 file is smaller than the R2 file when compressed, but of exact same size without compression. The compression of files depends on the ‘entropy’ in the file, i.e. how much ‘noise’ or ‘disorder’ we have in the file. If reads are without error, they will be more similar to each other, and the file may be compressed more. The more sequencing error we have, the more different reads become, and we cannot compress it to the same small size. Thus, the difference in compressed file size between an R1 and R2 file in a pair is only due to more sequencing errors (more entropy) in one of them!



3.3 More on raw sequence data

In order to run the sequencing technology, some technical sequences are added (ligated) to the actual genomic fragments. We refer to these as technical sequences. These have nothing to do with the biological sequence itself, and we should remove such artifacts from our raw sequences.

The adapters are short technical sequences used to start the sequencing process. these are typically found at the ends of the raw reads, in which ends depends on how the sequencing was done. There are many such adapters, and different technologies use different adapters. A collection of commonly used adapters is often available. Adapters are in most cases removed already directly after sequencing, and the raw fastq files we get should in principle not contain these technical sequences. However, this varies and we should at least be prepared to remove more of these ourselves.

When we later in BIN310 focus more on amplicon sequencing, we also will see some primer sequences as part of our reads. We will talk more about those in the corresponding modules. For now we focus on genomic data, and then there are no such primers.

When we do shotgun sequencing, the insert size is a concept of some importance. This is simply the length of the fragments we sequence, and is only relevant for paired-end sequencing of fragmented DNA, i.e. Illumina sequencing. How long are these fragments? This depends on how the DNA was processed in the wet lab, but also on the quality of the DNA prior to this. If we collect ‘dead’ DNA, i.e. DNA from dead organisms, it may be quite fragmented even before we take it to the lab. In studies of ancient DNA, i.e. DNA from organisms who have been dead for a long time, this problem occurs. But also in environmental studies, where we collect all DNA, we may get a large fraction of such fragmented DNA (dead microbes).

When we do paired-end sequencing we want the insert size to be at least twice the read-lengths. Even if we only sequence, say, 150 bases at each end of the fragment, if the fragment is more than 300 bases long the two reads together cover 300 bases in the genome. If the fragment was only 200 bases long, the two reads will overlap each other, and only give us information about these 200 bases. If the fragment is as short as 100 bases, then both reads will also be 100 bases (they cannot be longer than the fragment!) and the R1-read will be a reverse-complement copy of the R2-read. Thus, it is a waste to do paired-end sequencing, both reads contain the same (small) piece of the genome. The insert size is of special importance when we come to the assembly of genomes (module 4).

Below we will have a look at how we may use a software to

  • Trim away very low quality raw sequence, based on the fastq quality scores
  • Trim away adapters
  • Estimate the insert size

But before we can do this, we need to say more about how we run software on the computing cluster.





4 Using software on the cluster

We now have some sequencing data, in the shape of fastq files, and we will start to use the orion HPC for processing such data. Before we actually start computing anything, we need to understand some things in general about using cluster software. This will apply to all other tools we will use later in BIN310 as well. The same system is also used on the national sigma2 servers, in case you will be using this later.

We briefly mentioned the HPC in the module 1 document, but let us repeat some of this here now:



4.1 The module system

Software typically come in many versions, and for various reasons we are not always interested in using the very latest version. It could be that one software depends on using another software of a specific version, in order to work. It may also be we want to re-compute something we did earlier, and then we should always use the exact same (older) version of the software, in order to make it reproducible. For this reason, software installed on orion (and sigma2) uses the module system. This means multiple versions of a software is often available. To get an overview of module software available we use the command

module avail

list all software tools installed as modules in this system. If you do this, and scroll up/down the list (use the Return key to scroll down, use the q key to quit listing), you should find fastp/0.23.2-GCC-11.3.0 somewhere in there. This is the module for loading the fastp software that we will use shortly. Notice the module includes both the software name (fastp) and the version number (0.23.2-GCC-11.3.0). The latter is important, because whenever you report what you have done to your data, you must list all software tools, and their versions, in order to make it reproducible. We will talk more about this.

Software tools installed as modules on the cluster are not immediately available for use. Up to now we have used UNIX commands, and they are used directly, but external software installed on the cluster must be loaded once in each session, before we start using it. Let us load the FastQC software:

module load fastp/0.23.2-GCC-11.3.0

Now the software should be loaded and ready for us to use. Now we need to know how to use it. The command for running it is fastp. Note the lowercase letters only in this command. UNIX is case-sensitive! You have to type all commands exactly correct. If we have no idea on what more to specify on the command line, we usually try

fastp --help

The option --help, or just -h, will often produce some help text for most software tools, but some may have a slightly different help option. Often you can also just type the command (here fastp) without any option, and you get the help text listed.

From this help text we can start to learn how to use this software. Another, and often better, way is to google! Use the software name, and perhaps add GitHub as well, as search phrases in google. Often you can find some online manuals with a richer help text than what is displayed from the command line.

The command

module list

will list all currently loaded software. Notice that other modules were also loaded as an effect of our loading of the fastp module. To unload all modules we do a

module purge

This is actually a good idea to do prior to starting a new session and loading new software, since previous loaded modules may interfere with whatever you load. Now, try to run the fastp --help and see what happens. Now you know what this message means!

If you want to dig more into this, you can read more about the module system at the sigma2 Help pages.

4.2 The SLURM queuing system

In bioinformatics we are often faced with tasks that require a lot of computation time and memory, and it takes too long time to sit and wait for them to complete. Also, since all computing clusters is a shared environment, with many users, we need some kind of queuing system to make certain all users get a reasonable access to the resources. On orion, as well as sigma2, the queuing system is named SLURM (Simple Linux Utility for Resource Management).

Above we ran the fastp directly from the command line. This is possible, but in reality not how we run most jobs. Whenever you log into orion, you are always working on a node called the login node. This is not meant to be doing heavy computations, in fact if you try to start bigger jobs directly on the command line they will be killed after some time (currently 5 minutes). The login node is only meant for small and simple operations, like listing the help text like we did above, copy or move small files etc. Anything bigger than this, and we should make use of the queuing system.

Before we start a job, using some software, we should always establish

  • Does the software have an option for using multiple threads?
  • How much memory do we need?

The first is the easiest to resolve. By looking into the manual or help text for the software, you will discover if there is an option for specifying the number of threads. If not, it will use 1 thread only.

What is a thread? Well, think of it as the number of processors used to do the computations. The more threads working in parallel, the faster the computations will be done. However, this is not a linear function! Quite often the effect levels off, i.e. doubling the number of threads will not double the speed as we use more and more threads. Since the number of threads is always limited (we share this with others), we should not ask for more threads than we can use efficiently. For the smallish jobs we will run in BIN310, we never need more than 10 threads. Even for the biggest jobs I hardly ever go beyond 20 threads, since the software I use cannot utilize this efficiently. See my HPC video lecture linked to above!

The amount of memory refers to how much internal memory we reserve for our job. The different nodes in the cluster have different capacities. The largest nodes in orion have up to 1 terabyte (1000 gigabytes) memory available, which is the maximum we can reserve (as long as we use 1 node only). How do we know how much memory we need? Well, this is not straightforward, but it usually depends on the data set size. In microbial genomics, the data sets are in general smallish, and we manage well with a few gigabytes in most cases. However, if your job requires more memory than you have reserved, it will crash. Then, why don’t we just reserve ‘a lot of memory’? Well, we share this with each other, and if we all are greedy on memory it means longer queues. Also, if you request a lot of memory, your job will not start until this is available, and you may spend a loooong time in the queue! Thus, try to reserve just the memory needed, for your own and others sake.

4.3 A SLURM script template

Instead of typing commands directly on the command line, we put them into a shell script. Here is a template for a shell script that we will use in BIN310:

#!/bin/bash

#SBATCH --nodes=1                         # We always use 1 node
#SBATCH --reservation=BIN310              # For BIN310 course only
#SBATCH --account=bin310                  # For BIN310 course only
#SBATCH --ntasks=1                        # The number of threads reserved
#SBATCH --mem=10G                         # The amount of memory reserved
#SBATCH --time=01:00:00                   # Runs for maximum this time
#SBATCH --job-name=my_job_name            # Sensible name for the job
#SBATCH --output=my_job_name_%j.log       # Logfile output here

######################
### Loading software
###

##############
### Settings
###

######################
### Running software
###

#######################################
### Post-processing, cleaning up etc.
###

First, since this is a shell script (using bash) it always starts with #!/bin/bash.

Then follows a section of lines all starting with #SBATCH. These are instructions for the SLURM queuing system, and something we always include, but will change slightly from script to script. Note that if you run this script directly from the command line (which we in general do not), these lines are ignored. They are only interpreted when we submit to SLURM.

The first three #SBATCH-lines we never change in BIN310 scripts! Here we specify that we will use 1 node only. The two lines specifying --reservation and --account allow us to make use of a special partition set up for BIN310 specifically. This is to prevent too much interference with other users on orion. The computing nodes in a cluster are grouped into partitions, typically decided by their memory capacity. In real life (after BIN310) you will instead replace these two lines with the line #SBATCH --partition=hugemem or some other named partition.

we then ask for 1 thread and 10 gigabyte of memory. The --time specifies an upper time limit, or wall time, for the job. Sometimes we start jobs with errors in them that makes the run ‘forever’, and in case we forget to cancel them, we set a limit here to make certain they will eventually be killed. Here we set the limit to 1 hour, but this should be adjusted by how long we expect the job to take. You may also specify days, but in BIN310 we will hardly need more than hours and minutes.

The job name, and the output log file name, you decide yourself. This should reflect the job you are doing, and will be listed in the queue for you to recognize. All jobs, once started, will get a job number or job id. This is then added to the log file name where we put the %j, as we will see. This job number is important in case you want to kill a job or inspect the memory usage later, as we will see.

The remaining sections are just comments (starting with a #) and we will fill in some valid commands there as we go along.

4.4 Filtering fastq files with fastp

The fastp software we saw above is something we may use to filter fastq files with sequencing data. It takes as input one or a pair of fastq files, remove some ‘bad’ sequence from the raw reads, and output the results as two new fastq files, plus an additional extra fastq file with reads where the corresponding mate was discarded due to some low quality. It also produces a report and store this in an HTML-file. We can then open this HTML-file in our web-browser and get some overview of the quality of our data before and after filtering. We will now make our first SLURM shell script using fastp , and then submit this to the SLURM system. This is exactly the same procedure we will use over and over again many times in BIN310.

Below is a link to a video, but go through this text first and try to do it on your own.

Start RStudio on orion, create a new shell script and copy the shell script template from above and fill in the necessary commands to run fastp on the fastq files. It should look like this:

#!/bin/bash

#SBATCH --nodes=1                    # We always use 1 node
#SBATCH --reservation=BIN310         # For BIN310 course only
#SBATCH --account=bin310             # For BIN310 course only
#SBATCH --ntasks=10                  # The number of threads reserved
#SBATCH --mem=10G                    # The amount of memory reserved
#SBATCH --time=01:00:00              # Runs for maximum 1 hour
#SBATCH --job-name=fastp             # Sensible name for the job
#SBATCH --output=fastp_%j.log        # Logfile output here

######################
### Loading software
###
module purge
module load fastp/0.23.2-GCC-11.3.0

##############
### Settings
###
threads=10
r1_in=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2_in=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
r1_out=filtered_R1.fastq.gz
r2_out=filtered_R2.fastq.gz
unpaired_out=unpaired.fastq.gz

######################
### Running software
###
fastp --in1 $r1_in --in2 $r2_in --out1 $r1_out --out2 $r2_out --unpaired1 $unpaired_out --thread $threads


#######################################
### Post-processing, cleaning up etc.
###

Notice we made some changes in the #SBATCH part of the script, and added some commands. Save this under the file name fastp.sh in your $HOME/module2/ folder. NB! Use the extension .sh in all shell script file names!

Under Settings we now created several new shell variables. We saw shell variables already in module 1. Let us explain these variables in detail:

  • The threads variable simply contains how many threads we want to use when running the software. Its value must always correspond to the value in #SBATCH --ntasks=10 in the heading! Since the software fastp can use multi-threading, we speed up the computation by using 10 threads instead of 1.
  • The r1_in and r2_in contains the full names of the two input files we want to filter. Notice we include the full path. We cannot just use the file name here (e.g. Illumina_MiSeq_R1.fastq.gz), since the files are in some other folder. We must of course tell the system where these files are by including the path to where they are located. It is a common mistake to forget this. Always know exactly where your files are!
  • The r1_out, r2_out and unpaired_out contain the file names for the output files we want to generate. Since we give them no path, they will end up in the same folder as your script. We could easily have added a path to some other folder, in case we wanted them to end up there.

What is the unpaired_out? All reads are pairs, but imagine one of the two reads in a pair is discarded due to too bad quality. The other one, surviving the filtering, no longer has a mate, and such unpaired reads are then output to this file. They could be both R1 or R2 reads, but must from now on be treated as single reads.

Later in the script we use these variables. When using the $ in front of a shell variable, we refer to its content, as we saw in module 1. It is not that we have to create these variables to run the script, but both the number of threads as well as file names may be changed later, and then it is nice to have one single place in the script where we change all this (under Settings).

The line where we actually run fastp makes use of these variables. As an example we have --in1 $r1_in. The --in1 is an option (optional argument) to fastp and after this you should name the input R1 file. Notice how the command line contains a series of such options, and the corresponding information related to it. We will not discuss how we may know all options to fastp here now, we will talk more about this later.



4.5 Submitting to SLURM

Time to submit our first job to the SLURM queuing system! In the Terminal window, navigate to the $HOME/module2 folder, and verify the file fastp.sh is actually there. Submit to SLURM by

sbatch fastp.sh

Thus, the command sbatch is used to submit to SLURM, and it should be followed by the name of the shell script where we have listed all the commands we want to execute. If the script has the proper content, a job is started, and a number is output in the terminal window. This is the JOBID number. You may now inspect the queue to see if your job is there:

squeue -u <username>

where you replace <username> by your proper user name. The output is listed under the headers

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Under JOBID you find the large integer, the job number. This is what you have to refer to in case you want to do something to the job. Under NAME you should recognize the name you gave your job in the shell script. Under ST you find the job status. If it says R it means the job is running. If it is waiting in the queue it should say PD for pending. Once the job is done it disappears from the queue. This job should take some seconds to complete, and it may be it is done before you are able to see it in the queue! Then, just sbatch again, and run the squeue -u <username> immediately.

If you have submitted a job, but then realizes your script is not doing what you wanted, you may kill your job right away. Then you need the JOBID number we saw above. The command for killing a job is scancel. If the JOBID is 12345678, then you simply write

scancel 12345678

on the command line. You are (of course) only allowed to kill your own jobs.

Once your job is done, list the content of the folder and see if the log file and some output has been generated. There should also be an .html and a .zip archive for each input file.

Each started job will produce a log file, the statement #SBATCH --output=fastp_%j.log we added to the script indicates its name. These files are nice to inspect if something goes wrong. Once we are done, and happy with the results, we just delete them. Make a habit of deleting log files!

Now we have seen the basic procedure for making shell scripts and submitting to SLURM. This is the template for later, and we will make many such shell scripts in the coming weeks.

Here is a short video on how we make our first shell scripts and submit it to SLURM

4.5.1 Exercise - Sequence data quality

In RStudio, navigate to your $HOME/module2 folder, click on the .html file and choose View in Web Browser to open it. Try to answer the following question by looking through the report:

  • How many read-pairs are there in this data set?
  • How long are the reads typically?
  • How about number of reads and lengths after filtering?
  • What is the estimated insert size here?
  • What percentage of the reads survived the filtering?
  • What is the main reason for discarding reads here?
  • Were any adapters found and removed in these reads?
  • Compare the read quality profiles of the R1 and R2 reads before filtering. What are the differences?
  • How about after filtering?

Make an R script where you read in the fastq file with unpaired reads, and find how many of these came from the R1 and from the R2 file. Hint: The last token in the Header contains the information you need. Use word().

4.5.2 Exercise solution

A short video inspecting the HTML report

unpaired.tbl <- readFastq("unpaired.fastq.gz") %>%
  mutate(source = word(Header, -1)) %>%
  mutate(source = str_c("R", word(source, 1, sep = ":"))) %>%
  group_by(source) %>%
  summarise(n_reads = n())

4.5.3 Exercise - trimming

Let us extend the fastp script by also trim away the first and last 10 bases of every read. Trimming make reads shorter, but if the ends are full of errors, this may be a good idea.

How would we proceed to achieve this? First, we need to know more about if this kind of trimming is possible with fastp, and if so, how to do it. We would typically then google! Try to google “fastp github”. The first hit should be the GitHub site for this tool, at https://github.com/OpenGene/fastp .

Visit this website, and if you scroll down, there is a manual describing how to install and how to use this tool. Hint: Look for the options trim_front1 and trim_tail1 and similar for R2. Try to learn how to use these, and add code to the shell script, and sbatch again.

Compare the results you get now to what you saw in the first round. Especially the quality profiles seems different, right?

4.5.4 Exercise solution

A short video with solution code

4.5.5 Exercise - JSON

The fastp also outputs a file in JSON format. This is a text format we may read into R, like this:

library(tidyverse)
library(jsonlite)

fastp.lst <- read_json("fastp.json")

This results in a list with all the information (and more!) that you saw in the HTML report.

From this, make R code that plots the mean quality scores over the positions, and makes one curve for quality before and after the filtering, and two panels, one for R1 and one for R2 reads. It should look something like this:

4.5.6 Exercise solution

library(jsonlite)
fastp.lst <- read_json("fastp.json")
read1_before_quality <- unlist(fastp.lst$read1_before_filtering$quality_curves$mean)
read2_before_quality <- unlist(fastp.lst$read2_before_filtering$quality_curves$mean)
read1_after_quality <- unlist(fastp.lst$read1_after_filtering$quality_curves$mean)
read2_after_quality <- unlist(fastp.lst$read2_after_filtering$quality_curves$mean)
quality.tbl <- tibble(quality = c(read1_before_quality,
                                  read2_before_quality,
                                  read1_after_quality,
                                  read2_after_quality)) %>% 
  mutate(time = c(rep("before filtering", length(read1_before_quality)+length(read2_before_quality)),
                  rep("after filtering", length(read1_after_quality)+length(read2_after_quality)))) %>% 
  mutate(type = c(rep("R1", length(read1_before_quality)),
                  rep("R2", length(read2_before_quality)),
                  rep("R1", length(read1_after_quality)),
                  rep("R2", length(read2_after_quality)))) %>% 
  mutate(position = c(1:length(read1_before_quality),
                      1:length(read2_before_quality),
                      1:length(read1_after_quality),
                      1:length(read2_after_quality))) %>% 
  mutate(time = factor(time, levels = c("before filtering", "after filtering")))
fig <- ggplot(quality.tbl) +
  geom_line(aes(x = position, y = quality, color = time, group = time)) +
  facet_wrap(~type) +
  labs(x = "Position along reads", y = "Quality scores", color = "")
print(fig)



4.6 Improve the squeue Terminal output

The default output we get from squeue uses too few letters, i.e. each ‘column’ is too narrow. We typically do not see the full usernames! To fix this, create a alias to the squeue command, and use this alias instead. You can actually make such aliases to all UNIX commands.

You may create this directly in the Terminal, but then you need to type this each time you log in. To create it automatically on each log in, we add such commands to the file named .bashrc in our $HOME. NB! Be careful when editing this file, this is like brain surgery! If you mess up your .bashrc you may no longer be able to log in.

Here is a small video on how you can add an alias to your .bashrc file in order to see your full username in the SLURM queue. Please be careful and do exactly as specified:

Here is the exact code line you add to your .bashrc file:

alias squ="squeue -o '%.18i %.15j %.15u %.2t %.10M %R'"

Please, be very careful when editing your .bashrc file, and do exactly as in the video. This is a big difference to working in Windows or Mac. These operating systems are designed for ‘sleepy’ users that must be looked after all the time, and prevented from doing something bad! In UNIX we are given much more freedom, assuming users are awake and fully concentrated…





5 Sequencing depth

As the last part of this module, let us dig a little into some some basic theory concerning sequencing data. We need to be familiar with the Poisson distribution of read coverage before we proceed to the next module.



5.1 The binomial and Poisson variables

When working with the theory behind methods for analyzing sequence data we often stumble across the binomial and the Poisson distributions. Since these are no longer prioritized topics in statistics courses, we need to say a few words about these here.

The binomial variable is a random variable that stores the counts of how many Heads (or Tales) you get after flipping a coin \(n\) times, with the probability \(p\) of getting Head in each coin flip. Notice the following:

  • The random variable contains counts, i.e. it can take values from 0 up to \(n\).
  • The probability \(p\) may not be 0.5, as in a fair coin, but it must be the same for each ‘coin flip’.
  • The ‘coin flips’ must be independent, i.e. the outcome of the next coin flip does not depend on the outcome of the previous

Whenever we have a situation where some process is repeated over and over again, \(n\) times, and there are two outcomes of the process each time. we think of binomial variables. Is the probabilities for each outcome the same each time, and are the outcomes independent? If yes, then the number of each outcome follows the binomial distribution. Even if the probability varies slightly, or the is some weak dependence, we still often use this as a model, with some success.

Example: The number of A‘s in reads of length 150 bases is a binomial variable. We have sequenced a genome, and in this genome sequence there is a fixed fraction of A (and C, G and T). Let us for simplicity say this fraction is 0.25. Each read is 150 bases long, and come from some random position of the genome sequence. We can think of each read as $n=150 ’coin flips’ where the probability of an A is \(p=0.25\) each time. The number of A’s in a read, \(y\), is then an integer from 0 up to 150. This \(y\) is a binomial random variable. It may take on different values for each read we inspect, but we will soon discover that not all values are equally likely to occur. What is the probability of observing 0 A’s? Or 1 A? Or 2,3…. and so on. The answer to this is given to us by the binomial density which we often (but not entirely correctly) call the binomial distribution. Here is some R code to display this:

library(tidyverse)

n <- 150                                # number of 'coin flips'
p <- 0.25                               # probability of 'Hed' each time
y <- 0:150                              # possible outcomes
f.bin <- dbinom(x = y, size = n, prob = p)  # the binomial density in R
fig <- ggplot() +
  geom_col(aes(x = y, y = f.bin)) +
  labs(x = "Number of A's in reads of length 150 bases", y = "Probability",
       title = "Binomial density")
print(fig)

The density shows one column for each possible outcome (some so small you do not see them), and the column heights are all probabilities. If you sum all column heights, it sums to 1.0. Notice how the most likely values for \(y\) are around 37-38, which is exactly \(n\cdot p = 0.25\cdot 150=37.5\). This is the expected value for \(y\), the ‘average if we took the average over infinite many \(y\) values’. Note that if \(y\) is a binomial variable then the expected value is always \(E(y) = np\).

What is the probability of having no A in a read? Well, it is \(P(y=0)\), and this is the height of the leftmost column, at 0. It is too small to read from the plot, but we have the values stored in f, and can inspect this first value:

print(f.bin[1])
## [1] 1.816308e-19

This is a value around \(10^{-19}\), which is zero for all practical purposes.

The Poisson variable we can think of as an approximation of the binomial. In some cases we have a binomial with a very large \(n\), but a very small \(p\). In such cases we can claim the outcome \(y\) is a Poisson variable with parameter \(\lambda=np\). Notice that the Poisson has only one parameter \(\lambda\), while the binomial has two (\(n\) and \(p\)), and is thus simpler. Note also that the parameter \(\lambda\) is simply the expected value \(np\) that we saw above. We can again compute the density values for all possible outcomes:

f.pois <- dpois(x = y, lambda = n*p)
fig <- ggplot() +
  geom_col(aes(x = y, y = f.pois)) +
  labs(x = "Number of A's in reads of length 150 bases", y = "Probability",
       title = "Poisson density")
print(fig)

Why bother with the Poisson ? Why not always stick to the binomial? Well, in most cases the binomial is the “correct” machinery for understanding how such data are generated, and the Poisson is just an approximation. In this context, we should stick to the binomial. Still, the Poisson simplification pops up quite often if we read scientific papers describing bioinformatic methods, and we should be familiar with this as well.

We will now see how these random variables may be used for describing real data.



5.2 Read coverage

Consider a WGS sequencing of a single genome, producing reads in fastq files like those we saw above. The coverage concept is not specifically linked to genome sequencing, but let us use this setting anyway.

Genome sequencing means we have grown many copies of the exact same genome (an isolate), extracted the DNA from these and sequenced it in the random way we call shotgun sequencing. This simply means we cut the DNA into random pieces, and sequence these pieces. Thus, each read we get should be the copy of some random region of the genome. The read coverage for such a data set is simply the total number of bases in our read data divided by the total number of base-pairs in one copy of the genome we are sequencing. Synonyms for this are fold coverage or sequencing depth.

Example: We have \(N=1000\) reads, and the reads are all \(L=100\) bases long, i.e. we have in total \(1000\cdot 100 = 100 000\) read bases. Assume they are all sampled randomly from some small molecule (plasmid?) of length \(G=50 000\) base-pairs. Then we have a read coverage of \(D = 100 000/ 50 000 = 2\).

We may also interpret read coverage as the expected number of reads covering a given position. Let us dig into this some more:

We have a genome of double-stranded DNA that contains \(G\) base pairs. Even if the DNA molecule has 2 strands, the sequence of the genome is only \(G\) bases long, since the negative strand is known from the positive. Thus, we have \(G\) positions along the genome. From this genome we sequence \(N\) reads, and for simplicity we assume all reads are exactly \(L\) bases long (if they vary in length, we just replace \(L\) by their average length \(\bar{L}\)). The read coverage or depth \(D\) is then \[D = \frac{NL}{G}\]

Now we assume

  • The genome is circular, which is usually the case for bacteria. This means the genome sequence has no start or end, even if we store it as a linear sequence.
  • Positions (base pairs) are enumerated from \(1\) up to \(G\), starting somewhere on the circle.
  • Reads can come from both DNA strands, but in the cases where a read originates from the negative strand, we can just reverse-complement it to match the positive strand at the exact same positions.

Visualize the circular genome with the \(G\) different positions. Think of each read as sampling at random some positions on this circle, and the read covers that position and the following \(L-1\) positions, in total \(L\) positions. What is the probability that a read cover any particular position? Well, if all positions are equally likely to be sampled, it should be \(L/G\). The idea of ‘shotgun’ sequencing is exactly that all positions are equally likely to be sequenced.

If we keep on sampling independent reads like this \(N\) times, the number of reads that cover any given position is binomial with \(n=N\) and \(p=L/G\). We saw from above that the expected value, i.e. the number reads that covers a given position, is then \(N \cdot L/G\), which is \(D\).

This does not mean that each position is covered exactly \(D\) times! Remember the binomial density plots we made above. Some positions are covered more and some less, due to the random sampling, but if you average this over all \(G\) positions, you should get \(D\).

In reality \(N\) is often a huge number, while \(L/G\) is very small. Thus, we here have a case where the binomial is often approximated by the Poisson variable. We saw above that the Poisson has only 1 parameter, \(\lambda = np\) where \(n\) and \(p\) are the 2 parameters of the binomial. This means in our case \(\lambda = N\frac{L}{G} = D\).



5.2.1 Exercise - Poisson probabilities

We have WGS sequenced a genome with \(5 427 083\) basepairs. We used paired-end sequencing, resulting in \(595 925\) read-pairs, and each read is exactly \(301\) bases long. Make a barplot in R showing the probabilities of the read coverages \(0, 1,...,100\) at one particular position along the genome, assuming we have a purely random shotgun sequencing.

Hints: In R, the function dpois() give you the Poisson probabilities. It requires as input

  • x which is a vector of read coverage values you want probabilities for.
  • lambda which is the \(\lambda\) from above.

Here is a start on some R code to do this, fill in the missing parts

library(tidyverse)
G <- ___
N <- ___
L <- ___
lambda <- ___
pois.tbl <- tibble(Read_coverage = 0:100) %>% 
  mutate(Poisson_prob = dpois(___, ___))
fig <- ggplot(___) +
  geom_col(aes(x = ___, y = ___)) +
  labs(x = "___", y = "___", title = "___")
print(fig)

The numbers used here are exactly those from the fastq files we looked at earlier (Illumina_MiSeq_R1.fastq.gz and corresponding R2 file). Can you find the probability that a position is not covered by any read? The latter is important, because we would ideally like all positions to be sequenced by some read. Having coverage 0 at some position means no read contains the base at that position, i.e. a ‘hole’ in the genome that we never see.

Before we start sequencing, we can decide the number of reads \(N\) to sequence. Try to vary the number of reads \(N\), and re-plot. You may then need to change the outcome values (0:100). Notice the shape of the distribution. Compute the read coverage and compare this to the peak of the distribution. Is there a relation? Does this make sense?

How small \(N\) can we tolerate, and still have a probability below \(10^{-3}\) of seeing coverage 0 somewhere?

5.2.2 Exercise solution

library(tidyverse)
G <- 5427083
N <- 595925
L <- 301
lambda <- N*L*2/G   # we multiply with 2 since we have paired-end reads, and each of the 2 reads are 301 bases
pois.tbl <- tibble(Read_coverage = 0:100) %>%
  mutate(Poisson_prob = dpois(0:100, lambda))
fig <- ggplot(pois.tbl) +
  geom_col(aes(x = Read_coverage, y = Poisson_prob))
print(fig)

The probability of a position not being covered is the first Poisson_prob in the pois.tbl, i.e. the one for which Read_coverage equals 0. This is super-small here! It is very unlikely that we missed some position given that we sequence to this depth and with reads spread out over the genome in a random way. The depth, or read coverage, is the value of lambda.

If we vary the number of reads N, and keep everything else constant, it corresponds to sequencing more or less deep. Each time the read coverage is the value of lambda, and this seems to match the peak of the distribution in the plot. Remember the read coverage is by definition the expected number of reads covering a position. The barplot displays the probability of a read covering 0,1,… positions. Thus, the read coverage is the expected value of the distribution you plot, and this should be pretty close to the peak.

By simply adjusting N and running the code over again several times, I found that the first Poisson probability in the pois.tbl, describing the probability of 0 coverage, rise \(10^{-3}\) if I sequence around \(N=60000\) reads. We need to sequence more than 60000 read pairs to avoid this. It is no point in finding the exact limit, since the Poisson model is, as any model, an approximation.

Epilog: You may think that having the small probability of \(10^{-3}=0.001\) (one in a thousand) of having 0 coverage means we are ‘safe’, and we will never see this problem. But, remember this is the probability for a given position. There are many positions! In fact, we have \(G\) more than 5 millions here. This is again like flipping a coin: You have the small probability of 0.001 of seeing a Head in each coin flip, but you flip the coin more than 5 million times! Surely, you would expect to see some Heads after all…