When dealing with DNA sequences, and huge amounts of it, we very often cut the sequences into what we call K-mers. We need to understand the basics of K-mers, since this is something we will meet again and again. We will also do some K-mer counting and see how it may be used for detecting errors in DNA sequences.
We will also meet the more technical concept of software containers. The use software as containers is something we will emphasize a lot in BIN310.
singularity
containers.
In introductory bioinformatics we learn about sequence alignments, and how we compare biological sequences by aligning them. In microbial genomics we also align sequences from time to time, but when it comes to comparing DNA, especially long sequences, there is another concept we see more and more often. This is the representation of DNA as K-mers.
A K-mer is simply a DNA subsequence of length K. Any DNA sequence can
be chopped up into K-mers. Example: The sequence ACGTTTACGT
can be chopped up in the following 3-mers (K=3): ACG
,
CGT
, GTT
, TTT
, TTA
,
TAC
, ACG
, CGT
. Notice that the
K-mers overlap. We start at the first symbol of the sequence, then move
one symbol at the time, and collect the 3-mers starting at each
position.
Why collect or focus on K-mers? We will see its use in many ways in the coming weeks. In this module we will see how K-mers are used for error correction of the sequenced reads, and in the next module how they are used for assembly of the reads into longer pieces called contigs, in an attempt to reconstruct parts of a sequenced genome. We will later also see how they are used in more general applications of comparing sequences, short or long, and even entire metagenomes.
Let us first establish some truths about K-mers.
Well, we start at position 1, then position 2,…, and the last position a K-mer can start is at position \(L-K+1\). Thus, we can collect \(L-K+1\) K-mers from a sequence (read) of length \(L\). Notice here that if the sequence we refer to is a full microbial genome, these are often circular, i.e. the DNA is a ring, and then the number of K-mers is actually \(L\), the full length of the sequence, since there is no ‘start’ or ‘end’. I practice, it makes little difference if we say a genome of length \(G\) has \(G\) or \(G-K+1\) K-mers, since \(G\) is super big compared to \(K\).
This is simple math, because at each position in the K-mer there are \(4\) possible bases. The K-mer has \(K\) positions, and for each of the 4 possible bases in the first position, we can have 4 in the second, and so on. This means there are \(4^K\) distinct possible K-mers.
NOTE: This does not mean we observe this many distinct K-mers! \(4^K\) quickly becomes a very large integer as \(K\) increases! There are approximately \(10^{80}\) elementary particles in the known universe. For which value of \(K\) do you get a similar number of distinct K-mers? Remember this, the total number of possible K-mers quickly becomes so huge we will never observe more than a tiny fraction of them regardless of how large data sets we may have.
Genomic DNA is double-stranded, i.e. the negative strand is the
reverse-complement of the positive strand. This means that if we on the
positive strand see GCTTA
, the negative strand will read
TAAGC
in the exact same region. These are two different
5-mers, but they represent the exact same part of the DNA. In many cases
we like to ‘merge’ such reverse-complementary K-mers. For instance when
we sequence a genome, the reads can come from both strands. Imagine the
exact same region of the genome has been sequenced twice, but the two
reads are from the two different strands. They are completely different,
but are from the exact same place on the genome. This is why we often
make use of canonical K-mers.
By using canonical K-mers, we recognize that two K-mers may contain the exact same information about how a genome looks like in a specific region. Note that the use of canonical K-mers only makes sense if we are dealing with double-strand DNA. When sequencing RNA or cDNA (single-strand) we should consider all K-mers, never use canonical K-mers.
This is not as straightforward as one might think. If \(K\) is an odd number (3, 5, 7, etc) it is
simply \(4^K/2\), i.e. half the number
of all K-mers. All K-mers have a reverse-complement different from
themselves, and exactly half of these are canonical. For \(K\) being even (2, 4, 6, etc) it becomes
complicated, because then some K-mers are palindromes,
i.e. they are identical to their own reverse-complement. Example:
AATT
is AATT
after reverse-complementing.
Thus, we have more than \(4^K/2\)
canonical K-mers if \(K\) is an even
number!
For this reason, we typically always choose an odd value for \(K\), making life simpler.
We will now see how the counting of K-mers in our reads can tell us something about how deep we have sequenced, without knowing anything about the genomes we have sequenced.
Recall from the previous module how we defined read coverage as the expected number of reads covering a given position in a genome. We compute it as \[D = \frac{NL}{G}\] Where \(N\) is the number of reads, \(L\) is the read length and \(G\) is the size of the genome (number of base-pairs or positions). We also saw that the actual coverage per position in theory should follow a Poisson distribution with \(D\) as the expected value. Notice that in this formula we need to know \(G\), the size of the genome we are sequencing.
This is about how many reads cover a position. If this is position \(i\) in the genome, then all the \(L\) different reads that start at positions \(i-L+1\), \(i-L+2\),…,\(i\) cover this. Note that these \(L\) reads are most likely different from each other, even if they all cover position \(i\). When we are faced with a data set consisting of a huge number of reads, all from random regions around a genome, how can we tell if two distinct reads are actually from the same region of the genome? We can look at their canonical K-mers! (or we could have aligned them)
A tiny example: Assume we have sequenced reads of length \(L=15\), and two reads look like this:
CACCATAGGACATAA
and TAGGACATAAGTGGC
. We then
collect all canonical K-mers of length \(K=5\) from these:
CACCATAGGACATAA | TAGGACATAAGTGGC |
---|---|
CACCA | TAGGA |
ACCAT | AGGAC |
CCATA | GGACA |
CATAG | ATGTC |
ATAGG | ACATA |
TAGGA | CATAA |
AGGAC | ATAAG |
GGACA | ACTTA |
ATGTC | AAGTG |
ACATA | AGTGG |
CATAA | GCCAC |
Note that some of the K-mers here are the reverse-complement of the one seen directly in the read, since we only collect the canonical K-mers. Notice how the last 6 K-mers from the first read are also found as the 6 first in the second read. It is then natural to suspect these come from the same region, covering some of the same positions of the genome. Of course, with both \(L\) and \(K\) being this small it could be they are just randomly this similar, and actually come from widely different positions, but with larger values for both \(L\) and \(K\) this is no longer very likely.
If we collect all canonical K-mers from each read, and we have sequenced with read coverage \(D\), how many times should we expect to observe each of the K-mers? It may be tempting to answer \(D\), but this is not entirely correct.
Remember how we outlined the sequencing depth as coin flipping: We sequence \(N\) reads (number of coin flips). For each read, the probability of overlapping position \(i\) is \(L/G\), i.e. there are \(L\) different position a read can start and still cover position \(i\). Then, the expected number of random reads overlapping position \(i\) is \(N\cdot L/G\). We now collect (canonical) K-mers from these reads, and ask how many times do we expect to see each K-mer. For K-mers to be identical they must start in position \(i\), it is not enough to just cover it. So, could not each of the \(L\) reads that covers position \(i\) also contain the K-mer that starts in position \(i\)? No, not all of them! K-mers inside any read can start at position \(1, 2,...,L-K+1\) in the read, but not in its last \(K-1\) positions! Thus, of the \(L\) reads covering position \(i\), only \((L-K+1)\) of them cover position \(i\) such that a K-mer inside the read can start in position \(i\). The probability reduces from \(L/G\) to \[\frac{L-K+1}{G}\] From this we get that K-mer coverage \(D_K\) is computed as \[D_K = N\cdot \frac{L-K+1}{G} = N\cdot \frac{L}{G}\cdot \frac{L-K+1}{L} = D \frac{L-K+1}{L}\] where we just re-arranged and multiplied both numerator and denominator with \(L\) to get an expression relating \(D_K\) to \(D\). Since the fraction \((L-K+1)/L\) is always (slightly) less than 1, it means K-mer coverage \(D_K\) is always (slightly) less than the corresponding read coverage \(D\). We now have an exact relation between sequencing depth \(D\) and K-mer coverage \(D_K\), and since we always know the read length \(L\) and the K-mer length \(K\) we can compute one from the other.
Here comes the big point now: K-mer coverage is something we can indirectly observe, simply by counting canonical K-mers from our reads. By counting canonical K-mers, we can estimate \(D_K\) pretty well, it is simply the average number of times we observe each distinct K-mer in our data. From this we can find the corresponding estimate of \(D\) using the formula above. From this we can also estimate the size of the genome \(G\)! Thus, by simply counting K-mers we can say how big the genome is that we have sequenced.
We will see K-mer coverages in the output from the sequence assembly later, and now we know exactly what this is!
What if we collected all K-mers instead of the canonical ones? Then all counts become roughly half of what we get with canonical K-mers, and the corresponding read coverage is also halved.
Short lecture on K-mer coverage
Let us have a look at how a clever use of canonical K-mers counting can help us reveal sequencing errors when we have sequenced a genome, and to discard or even correct reads with errors.
Assume we have sequenced a genome of size \(G\) with \(N\) reads of length \(L\), i.e. we have the read coverage \(D=NL/G\). If reads contain no errors, we can collect canonical K-mers from these reads, and expect to see a K-mer coverage \(D_K\) as explained above. This means that, as long as \(K\) is large enough to avoid random identities to other regions, all K-mers should occur on average around \(D_K\) times. As always, K-mers will not all occur exactly \(D_K\) times, due to the randomness of sequencing. If we have a rather large K-value, and the genome is random, we expect the count for each K-mer to follow a Poisson distribution with a peak around \(D_K\).
What happens if one read has a sequencing error in one position? Then the K-mers we collect from this read, overlapping with this position, will differ from those we collect from all the other reads overlapping the same position, but who have no errors. The error K-mers will differ from the majority of K-mers also starting at the same position. As long as we use a large \(K\)-value, few other K-mers are likely to be identical to these error K-mers. Thus, they come out with very low counts, usually 1.
To sum up:
A method like BayesHammer
(built into the software
spades
), which we will see below, uses this type of
information to filter and correct reads.
jellyfish
Let us do some K-mer counting in practice, just to see how the idea
above is clearly revealed in real data. Counting K-mers in large data
sets is a computer intensive job. We could do this in R, but it
will take a long time. We will use the software jellyfish
for this, and then we will read the result into R and plot.
Let us use the fastq files we saw in the previous module as input to
jellyfish
. Here is the shell script we use:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=jellyfish # Sensible name for the job
#SBATCH --output=jellyfish_%j.log # Logfile output here
######################
### Loading software
###
module purge
module load Jellyfish/2.3.0-GCC-8.3.0
##############
### Settings
###
threads=10 # number of threads to speed up
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
K=21 # count K-mers of this length
hash_size="16M" # memory for jellyfish to reserve
count_file=counts.jf # raw output file
out_file=countcount.txt # count file with results we are interested in
############################################
### Copying and de-compressing fastq files
###
cp $r1 $SCRATCH
cp $r2 $SCRATCH
r1=$(basename $r1)
r2=$(basename $r2)
pigz -d $SCRATCH/$r1
pigz -d $SCRATCH/$r2
r1=$(basename $SCRATCH/$r1 .gz)
r2=$(basename $SCRATCH/$r2 .gz)
########################
### Running software
### First we count canonical K-mers
### Then we make histogram, count of counts
###
jellyfish count -m $K -s $hash_size -t $threads -C -o $count_file $SCRATCH/$r1 $SCRATCH/$r2
jellyfish histo -o $out_file $count_file
#######################################
### Post-processing, cleaning up etc.
###
rm $SCRATCH/$r1
rm $SCRATCH/$r2
rm $count_file
Let us first have a look at some of the code in this script, because
it shows some use of UNIX code as well as the $SCRATCH
disk.
Sadly, the jellyfish
software is incapable of reading
compressed fastq files as input (most software manage this). This means
we have to
In cp $r1 $SCRATCH
we simply copy the R1 fastq file to
the $SCRATCH
. All users have a $HOME
folder at
/mnt/users/<username>
, but we also have a
$SCRATCH
folder at
/mnt/SCRATCH/<username>
. Here we may store things
without using our ordinary disk quota. However, there is no backup from
$SCRATCH
, and we should only use this for temporary
storage. We typically direct the output from software to
$SCRATCH
and then just copy what we need to our own
$HOME
and delete the rest.
In the script above we also use the basename
command in
UNIX to change the file names stored in the shell variables
$r1
and $r2
. The basename
used
like this will simply strip off all folder names in the full file name.
Then, we de-compress the files with pigz -d
, and finally,
use basename
to also strip off the .gz
extension of the stored file names. Now the $r1
and
$r2
should contain the proper file names. Notice how
basename /mnt/users/larssn/BIN310/coursedata/module2/Illumina_MiSeq_R1.fastq.gz
will result in the text Illumina_MiSeq_R1.fastq.gz
,
i.e. peeling off the folders before the actual file name.basename /mnt/users/larssn/BIN310/coursedata/module2/Illumina_MiSeq_R1.fastq.gz .gz
will result in the text Illumina_MiSeq_R1.fastq
, also
peeling off the extension .gz
.Re-visit LinuxCommand.org to see why
we use the $(...)
notation above (Expansions).
Notice also how we delete the copied files in the end. This is good
practice, otherwise the $SCRATCH
will be filled with
rubbish.
jellyfish
Copy the code above into a shell script, save it in a
module3
folder under your $HOME
. Maneuver into
this module3
folder and sbatch
to SLURM.
It should produce a text file countcount.txt
in your
module3
folder. This file has 2 columns, separated by a
blank. The first column lists the frequencies (counts) observed when
counting K-mers. The second is the ‘frequency of frequencies’. Thus,
jellyfish
first counts how often each canonical K-mer
occurs. This means a lot of counts, one for each K-mer. Then,
jellyfish
counts how often we see each count, e.g. if 1000
K-mers occur 50 times, then the count 50 has been observed 1000 times.
The latter is the second column.
Make an R script to read this and make a barplot of the second column
versus the first (use geom_col()
). Here is a start that you
can copy and fill in:
library(tidyverse)
jelly.tbl <- read_delim("___", delim = " ", col_names = c("Kmer_counts", "Count_counts"))
fig <- ggplot(___) +
geom_col(aes(x = ___, y = ___)) +
xlim(0, 100)
print(___)
# HINT: To make the plot look nicer, you may replace the largest value by some maximum, e.g. 5e+5
# Add this before plotting:
# jelly.tbl <- jelly.tbl %>%
# mutate(Count_counts = pmin(5e+5, Count_counts))
Inspect the plot. What is the most striking observation we make from this? Hint: Two groups of K-mers. What would you say is the K-mer coverage \(D_K\) here? The reads in this data set are 301 bases long. What (approximately) is the read coverage \(D\)?
How would the K-mer histogram look like if there were no sequencing errors in the data?
We have simulated such reads, and you find them in the files
Illumina_MiSeq_errfree_R1.fastq.gz
and
Illumina_MiSeq_errfree_R2.fastq.gz
under
$COURSES/BIN310/module3/
.
Edit the script from the previous exercise to use these files as
input instead. Also, change the name of the output file to
countcount_errfree.txt
to avoid overwriting the existing
result-file. Make the same plot in R from the results you now get. What
is the main difference to what you saw for the real (noisy) data?
We have seen that software is installed on orion as modules and that we must load a module before we can run the application. This allows us to have many different versions of applications, compilers and libraries available for different users at the same time without conflicting with each other.
What if we want to run some application that is not installed as a
module? How can we make this software available? Well, you may ask the
orion support people, in a polite way, to install it as a module. But,
we also have another option that we will use from now on.
If you have tried to install software on your own local computer, you will probably recognize the dependency hell. In order for software A to work properly, you need certain versions of software B, C and D installed as well. Sometimes, though, another software E also depends on software B, but a different version of it. Then, you cannot have both A and E installed at the same time. This is what we avoid by the module system. Another way of avoiding it is by using containers.
We can think of a container for software A as a package, including software A together with all the other applications, libraries etc that is needed to run software A. This is all wrapped up in one big file, and when we ‘run’ this it is (almost) independent of what is actually installed on your computer! This means we may have such containers for all kinds of conflicting applications, and run them without interfering with each other.
What you need to use container software is the basic software that
allows you to utilize this. The two dominating ones are docker
and singularity, the latter recently changed its name to
apptainer. So, for us to use conatiner we need to have either
docker
or apptainer
installed on orion. The
docker
has some problems in a multi-user environment like a
HPC, so on HPC in general you find only apptainer
. It is,
however, in general possible to use docker
containers also
if you run apptainer
, so this is really no limitation.
In BIN310 we will only use containers, not build them. This may change in the (near) future, and if you foresee a future in bioinformatics, you should probably invest time to learn how to build containers. Have a look at https://apptainer.org/docs/user/latest/quick_start.html.
Over the recent years, it has become more and more common that modern biology software is distributed as containers, and we hardly install software the old way anymore. We simply use the available containers, as we will see shortly. A major benefit of this system is that it make computations much more reproducible. In science this is of vital importance. Other scientists should be able to reproduce your results, otherwise it is of little value. The lack of reproducibility has been recognized as a major problem in modern biology. In a scientific publication we have to list all software tools used, and their versions, but still it could be a hassle for someone else to reproduce this on another computer, with different libraries and operating system. You would need to install the same software, and the exact same versions of it. Today we simply specify and share the containers we used, and anyone else can just use the exact same containers to repeat it!
This is the reason comparatively few software tools are available in the module system. Only some very basic tools are there, and for the rest you use containers.
Another very popular approach, quite similar to containers, is the
use of conda
, where you install your version of a software
locally, for you to use. We will come back to this later in BIN310, but
we will focus mostly on the use of apptainer
containers in
this course.
One of the platforms for running containers is apptainer,
which is the same as we earlier called singularity
. Thus,
we need apptainer
installed on our system, and this is
available on orion (and sigma2). Open a Terminal and run
apptainer --help
Notice there are many Available Commands listed. We will focus on the
exec
for now, which is used to execute an
apptainer
container. But, where do we find the
containers?
One big source of containers is found at the galaxy project servers. All the galaxy containers are listed on the website https://depot.galaxyproject.org/singularity/ (bookmark this site!). To download a container to orion, do as follows:
wget
and then paste in (right
click) the copied link, and return.This should start the download. Notice, the command wget
is used for downloading files to orion. Due to security reasons you are
not allowed to download from any website, but the galaxy servers are on
the ‘white list’.
After downloading we always do the following:
.sif
to the downloaded file (it
has no extension originally). This is not strictly necessary, but by
convention all such Singularity Image Files should have the extension
.sif
(this is still used despite the change in name from
singularity to apptainer).chmod ugo+rx *.sif
The nice thing about having the software as a local file is that you can put this into the same project folder where you have your data and other code, and when your study is finished, you can store it all in some archive for later. Should it be necessary to reproduce the results at some later time, you just open the archive, and there you find all you need, including the software you used.
apptainer
containerLet us find a container for the jellyfish
software in
the listing from above, and see how we can run this. The newest versions
at the time of writing seems to be
jellyfish:2.2.10--h6bb024c_1
(version 2.2.10, build
h6bb024c_1). Download this to your module3
folder, and give
it the file extension and make it executable as shown above.
To run a container we use the command
apptainer exec <container> <command>
where we need some proper command for running the software. This will
vary from software to software, and we need to find out, e.g. by reading
its manual, how to run the software. In this case we use
jellyfish -h
to see if we can get some help text:
apptainer exec jellyfish:2.2.10--h6bb024c_1.sif jellyfish -h
This should now open the container and start running its software by
the command jellyfish -h
, which should hopefully produce
some help text.
In the above command line we assume the container file is in the same
folder as the shell script we are invoking it from. This is of course
not required. The container file could be in any folder, and then you
just specify the full path to it
(e.g. $HOME/module3/jellyfish:2.2.10--h6bb024c_1.sif
) in
the command line. However, since we like to keep the containers together
with the rest of the code, it is perhaps natural to expect the
.sif
file is (always) in the same folder as the shell
script. The exception is when several people work together, using the
same containers. Instead of each having their copy, it is better to have
have one file stored in an open folder, for everyone to use. We will do
this in BIN310, but for now you should all download and use your own
containers.
To summarize:
.sif
file.module load
in our shell
scripts!At galaxy there is a very long list of software, and in most cases you will find what you seek there. This means we are now independent of what is installed on orion!
We will come back to containers again later in BIN310, but for now we
stick to the use of apptainer
and the galaxy
containers.
Here is a short video demonstrating how to download a container and run it on orion
/cvmfs
connection (that we do not use!)On orion we have the luxury of having a /cvmfs
connection to the galaxy containers. From the Terminal list the content
of the folder /cvmfs/singularity.galaxyproject.org/
:
ls -l /cvmfs/singularity.galaxyproject.org/
Notice that this lists the content of the ‘folder’ as if it was a
folder in the orion system, even if it is on a remote server somewhere.
Read more about the cvmfs
(Cern Virtual Machine File
System) in order to understand how this is possible.
Here the same containers you saw at https://depot.galaxyproject.org/singularity/ are listed
in subfolders specified by the first two letters in the container name.
This means the containers for jellyfish
are found
under:
ls -l /cvmfs/singularity.galaxyproject.org/j/e/
We could now run the container without downloading it, by
apptainer exec /cvmfs/singularity.galaxyproject.org/j/e/jellyfish:2.2.10--h6bb024c_1 jellyfish -h
Then we make use of the /cvmfs
connection. We do not
recommend this. First, it is unstable, and may at times not work.
Second, there is no guarantee this software version is still on the
galaxy server in the (near) future, hampering your reproducibility.
Third, this /cvmfs
connection is not something you find on
every HPC, and we should not depend upon it.
By downloading the container to a local file, you can take it with you to another HPC system, like sigma2, at any other time and use it in the exact same way as you did on orion!
jellyfish
by apptainer
Edit the jellyfish
shell script from above such that it
makes use of the newest container for this software at galaxy instead of
the module installed at orion. You may now delete the
module
statements from the script, and add the
apptainer exec <container>
(replace
<container>
with the proper name) in front of the
command lines. Note that each time you run jellyfish
(both
jellyfish count
and jellyfish histo
) you need
to add this. Instead of loading the software once, you invoke the
container each time you use it.
Here is a suggested solution. Notice the command lines for running
the software is broken over two lines, using the \
at the
end of the first line. This is required, otherwise it is considered as
two different command lines:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=jellyfish # Sensible name for the job
#SBATCH --output=jellyfish_%j.log # Logfile output here
##############
### Settings
###
threads=10
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
K=21
hash_size="16M"
count_file=counts.jf
out_file=countcount.txt
###############################
### Copying and de-compressing
### fastq files
###
cp $r1 $SCRATCH
cp $r2 $SCRATCH
r1=$(basename $r1)
r2=$(basename $r2)
pigz -d $SCRATCH/$r1
pigz -d $SCRATCH/$r2
r1=$(basename $SCRATCH/$r1 .gz)
r2=$(basename $SCRATCH/$r2 .gz)
###############################################
### Running software
### First we count canonical K-mers
### Then we make histogram, count of counts
###
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish count \
-m $K -s $hash_size -t $threads -C -o $count_file $SCRATCH/$r1 $SCRATCH/$r2
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish histo \
-o $out_file $count_file
#######################################
### Post-processing, cleaning up etc.
###
rm $SCRATCH/$r1
rm $SCRATCH/$r2
rm $count_file
Here is a short video demonstrating how we changed the
original script
Let us now return to the concept of canonical K-mer counting, and use this for correcting reads with errors or filtering them out of the data set.
In module 2 we used the software fastp
for filtering
reads. This was based on the quality scores of the reads that is
supplied in the fastq files. These are probabilistic in nature, i.e. bad
quality scores indicates the reads probably have some errors,
but it does not look into the reads themselves. Also, these quality
scores are not always very well calibrated, may sometimes give us a
wrong picture of the actual read quality.
The filtering we do below is not using the quality scores at all.
Instead, we count K-mers in the actual reads, and try to filter or
repair reads who have ‘strange’ K-mers in them. Thus, we go more
directly into the data instead of making use of some metadata (data
about data) that may sometimes be misleading.
The idea of using simple K-mer frequencies to filter or correct reads
has been around for some time. We will now study a software solution
called BayesHammer that uses this approach. This is included
into the assembler software spades
that we will use more
later.
Before we start using it, we should try to understand the basic idea behind it. The original paper is found here:
Read this paper, and discuss with each other. As a bioinformatician you are expected to read (and write?) such papers! Here are some guidelines:
Focus on the Abstract-Background-Method sections. We are interested in the method!
Do not dig into all the technical details of the algorithm, try to get the more general ideas first. A lot of what they write is about making high speed solutions, but first we need to understand how it works, then we can worry about speed.
Understand the Figures, especially Figure 3 and 5.
Here is a video trying to highlight what I consider the most important to understand BayesHammer.
Let us use BayesHammer on the fastq files from module 2:
$cOURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
The BayesHammer is part of the spades
application. This
is a software used to assemble reads into genomes, but the BayesHammer
is included as a first filtering step. We will look at assembly in the
next module, and for now we will run spades
with the option
for only doing this first filtering.
Download the newest apptainer
container for
spades
from galaxy, and put it into your
module3
folder:
module3
folder in a Terminal, and
download using wget
..sif
to the downloaded file, and
make it executable.Make a new shell script, and make use of an apptainer
container for spades
that you download. It is a good idea
to copy the script we used for jellyfish
above, save it
under a new name (bayeshammer.sh
), and edit it to run
spades
instead.
Note: You no longer need the code where we copied
the files to $SCRATCH
, and decompressed them. The
spades
can read the compressed files directly!
The command line for running spades
should look like
this:
spades.py --pe-1 1 $r1 --pe-2 1 $r2 --only-error-correction -o $out_dir --memory 99 --threads $threads
Notice this command line makes use of some shell variables (look for
the $
s). These must be created before the command
line, under the Settings:
$r1
and $r2
are two shell variables
containing to full path to the data files (see the
jellyfish
script).$out_dir
with the
path to the folder where you want the output. It is sensible to refer to
some folder under $SCRATCH
here, typically
out_dir=$SCRATCH/bayeshammer
. If the folder does not exist,
it will be created automatically by spades
.$threads
(see the
jellyfish
script).spades
software requires a lot of memory!Deleted the code under ### Post-processing
from the
jellyfish
script. We will add a new code, but first
sbatch
the script as it is now. The computations should
take some minutes.
The software spades
always outputs a lot of stuff, but
we focus only on the compressed fastq files you should find it in
$out_dir/corrected/
once the computations are done. You
will recognize the files with R1
and R2
in
their names, but also one with unpaired
. The latter are
reads who lost their mates during filtering, i.e. in a read-pair one of
the reads is so full of errors it has been discarded by the BayesHammer.
Then the pair is broken, and such unpaired reads are written to a
separate file. Typically the unpaired file is much smaller then the
paired files, since most pairs survive this step. We also saw such
unpaired
reads when we filtered reads in module 2.
The only files we need from this output are these three files. Add code at the end of the script where you
module3
folderrm -r $out_dir
Finally, sbatch
the script again, and verify the three
compressed fastq-files show up in your module3
folder.
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=99G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=bayeshammer # Sensible name for the job
#SBATCH --output=bayeshammer_%j.log # Logfile output here
##############
### Settings
###
threads=10
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
out_dir=$SCRATCH/bayeshammer
######################
### Running software
###
apptainer exec spades:3A3.15.5--h95f258a_1.sif spades.py \
--pe-1 1 $r1 --pe-2 1 $r2 --only-error-correction -o $out_dir --memory 99 --threads $threads
#######################################
### Post-processing, cleaning up etc.
###
cp $out_dir/corrected/*.fastq.gz .
rm -r $out_dir
To delete a folder you need the -r
option to
rm
as seen in the code above. The -r
means
recursive, i.e. you delete a folders and all its
subfolders. Be concentrated when using this command, and make
certain you delete the exact folder you want to delete, since there are
no regrets here. It would be a dramatic error to delete a folder where
you have something important in some of its subfolders! The folders we
create under $SCRATCH
typically contain output from some
software, and these we can always re-create by re-running the script, in
case we deleted a wrong folder.
jellyfish
on BayesHammer outputFinally, edit your jellyfish
script again, to use the
output R1
and R2
files from BayesHammer as
input to jellyfish
(we ignore the unpaired
file here now). We want to see of the K-mer counts have changed as a
result of running BayesHammer on our data.
Again, read the results into R and plot the counts of counts as
before. Compare the jellyfish
results to those of the raw
data. Discuss the results. What is your conclusion?
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=jellyfish # Sensible name for the job
#SBATCH --output=jellyfish_%j.log # Logfile output here
##############
### Settings
###
threads=10
r1=Illumina_MiSeq_R1.fastq.00.0_0.cor.fastq.gz
r2=Illumina_MiSeq_R2.fastq.00.0_0.cor.fastq.gz
K=21
hash_size="16M"
count_file=counts.jf
out_file=countcount_bayeshammer.txt
###############################
### Copying and de-compressing
### fastq files
###
cp $r1 $SCRATCH
cp $r2 $SCRATCH
r1=$(basename $r1)
r2=$(basename $r2)
pigz -d $SCRATCH/$r1
pigz -d $SCRATCH/$r2
r1=$(basename $SCRATCH/$r1 .gz)
r2=$(basename $SCRATCH/$r2 .gz)
######################
### Running software
###
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish count \
-m $K -s $hash_size -t $threads -C -o $count_file $SCRATCH/$r1 $SCRATCH/$r2
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish histo \
-o $out_file $count_file
#######################################
### Post-processing, cleaning up etc.
###
rm $SCRATCH/$r1
rm $SCRATCH/$r2
rm $count_file