1 Learning goals

When dealing with DNA sequences, and huge amounts of it, we very often cut the sequences into what we call K-mers. We need to understand the basics of K-mers, since this is something we will meet again and again. We will also do some K-mer counting and see how it may be used for detecting errors in DNA sequences.

We will also meet the more technical concept of software containers. The use software as containers is something we will emphasize a lot in BIN310.

  • Understand phrases like K-mer coverage and canonical K-mers.
  • Learn how to use software deployed as singularity containers.
  • Learn how K-mer statistics allow us to filter or correct reads with errors.
  • Start to read and understand scientific papers on methods in bioinformatics.



1.1 Software used in this module





2 K-mers

In introductory bioinformatics we learn about sequence alignments, and how we compare biological sequences by aligning them. In microbial genomics we also align sequences from time to time, but when it comes to comparing DNA, especially long sequences, there is another concept we see more and more often. This is the representation of DNA as K-mers.

A K-mer is simply a DNA subsequence of length K. Any DNA sequence can be chopped up into K-mers. Example: The sequence ACGTTTACGT can be chopped up in the following 3-mers (K=3): ACG, CGT, GTT, TTT, TTA, TAC, ACG, CGT. Notice that the K-mers overlap. We start at the first symbol of the sequence, then move one symbol at the time, and collect the 3-mers starting at each position.

Why collect or focus on K-mers? We will see its use in many ways in the coming weeks. In this module we will see how K-mers are used for error correction of the sequenced reads, and in the next module how they are used for assembly of the reads into longer pieces called contigs, in an attempt to reconstruct parts of a sequenced genome. We will later also see how they are used in more general applications of comparing sequences, short or long, and even entire metagenomes.



2.1 Canonical K-mers

Let us first establish some truths about K-mers.

  • If you have a sequence of length \(L\), how many K-mers can you collect from this sequence?

Well, we start at position 1, then position 2,…, and the last position a K-mer can start is at position \(L-K+1\). Thus, we can collect \(L-K+1\) K-mers from a sequence (read) of length \(L\). Notice here that if the sequence we refer to is a full microbial genome, these are often circular, i.e. the DNA is a ring, and then the number of K-mers is actually \(L\), the full length of the sequence, since there is no ‘start’ or ‘end’. I practice, it makes little difference if we say a genome of length \(G\) has \(G\) or \(G-K+1\) K-mers, since \(G\) is super big compared to \(K\).

  • How many distinct possible K-mers are there?

This is simple math, because at each position in the K-mer there are \(4\) possible bases. The K-mer has \(K\) positions, and for each of the 4 possible bases in the first position, we can have 4 in the second, and so on. This means there are \(4^K\) distinct possible K-mers.

NOTE: This does not mean we observe this many distinct K-mers! \(4^K\) quickly becomes a very large integer as \(K\) increases! There are approximately \(10^{80}\) elementary particles in the known universe. For which value of \(K\) do you get a similar number of distinct K-mers? Remember this, the total number of possible K-mers quickly becomes so huge we will never observe more than a tiny fraction of them regardless of how large data sets we may have.

Genomic DNA is double-stranded, i.e. the negative strand is the reverse-complement of the positive strand. This means that if we on the positive strand see GCTTA, the negative strand will read TAAGC in the exact same region. These are two different 5-mers, but they represent the exact same part of the DNA. In many cases we like to ‘merge’ such reverse-complementary K-mers. For instance when we sequence a genome, the reads can come from both strands. Imagine the exact same region of the genome has been sequenced twice, but the two reads are from the two different strands. They are completely different, but are from the exact same place on the genome. This is why we often make use of canonical K-mers.

  • Consider any K-mer and its reverse-complement. Then the canonical one of them is the one with the lowest lexicographical rank, i.e. the one that comes first if you sort them alphabetically.

By using canonical K-mers, we recognize that two K-mers may contain the exact same information about how a genome looks like in a specific region. Note that the use of canonical K-mers only makes sense if we are dealing with double-strand DNA. When sequencing RNA or cDNA (single-strand) we should consider all K-mers, never use canonical K-mers.

  • How many distinct canonical K-mers are there?

This is not as straightforward as one might think. If \(K\) is an odd number (3, 5, 7, etc) it is simply \(4^K/2\), i.e. half the number of all K-mers. All K-mers have a reverse-complement different from themselves, and exactly half of these are canonical. For \(K\) being even (2, 4, 6, etc) it becomes complicated, because then some K-mers are palindromes, i.e. they are identical to their own reverse-complement. Example: AATT is AATT after reverse-complementing. Thus, we have more than \(4^K/2\) canonical K-mers if \(K\) is an even number!

For this reason, we typically always choose an odd value for \(K\), making life simpler.



2.2 K-mer coverage

We will now see how the counting of K-mers in our reads can tell us something about how deep we have sequenced, without knowing anything about the genomes we have sequenced.

Recall from the previous module how we defined read coverage as the expected number of reads covering a given position in a genome. We compute it as \[D = \frac{NL}{G}\] Where \(N\) is the number of reads, \(L\) is the read length and \(G\) is the size of the genome (number of base-pairs or positions). We also saw that the actual coverage per position in theory should follow a Poisson distribution with \(D\) as the expected value. Notice that in this formula we need to know \(G\), the size of the genome we are sequencing.

This is about how many reads cover a position. If this is position \(i\) in the genome, then all the \(L\) different reads that start at positions \(i-L+1\), \(i-L+2\),…,\(i\) cover this. Note that these \(L\) reads are most likely different from each other, even if they all cover position \(i\). When we are faced with a data set consisting of a huge number of reads, all from random regions around a genome, how can we tell if two distinct reads are actually from the same region of the genome? We can look at their canonical K-mers! (or we could have aligned them)

A tiny example: Assume we have sequenced reads of length \(L=15\), and two reads look like this: CACCATAGGACATAA and TAGGACATAAGTGGC. We then collect all canonical K-mers of length \(K=5\) from these:

CACCATAGGACATAA TAGGACATAAGTGGC
CACCA TAGGA
ACCAT AGGAC
CCATA GGACA
CATAG ATGTC
ATAGG ACATA
TAGGA CATAA
AGGAC ATAAG
GGACA ACTTA
ATGTC AAGTG
ACATA AGTGG
CATAA GCCAC

Note that some of the K-mers here are the reverse-complement of the one seen directly in the read, since we only collect the canonical K-mers. Notice how the last 6 K-mers from the first read are also found as the 6 first in the second read. It is then natural to suspect these come from the same region, covering some of the same positions of the genome. Of course, with both \(L\) and \(K\) being this small it could be they are just randomly this similar, and actually come from widely different positions, but with larger values for both \(L\) and \(K\) this is no longer very likely.

If we collect all canonical K-mers from each read, and we have sequenced with read coverage \(D\), how many times should we expect to observe each of the K-mers? It may be tempting to answer \(D\), but this is not entirely correct.

Remember how we outlined the sequencing depth as coin flipping: We sequence \(N\) reads (number of coin flips). For each read, the probability of overlapping position \(i\) is \(L/G\), i.e. there are \(L\) different position a read can start and still cover position \(i\). Then, the expected number of random reads overlapping position \(i\) is \(N\cdot L/G\). We now collect (canonical) K-mers from these reads, and ask how many times do we expect to see each K-mer. For K-mers to be identical they must start in position \(i\), it is not enough to just cover it. So, could not each of the \(L\) reads that covers position \(i\) also contain the K-mer that starts in position \(i\)? No, not all of them! K-mers inside any read can start at position \(1, 2,...,L-K+1\) in the read, but not in its last \(K-1\) positions! Thus, of the \(L\) reads covering position \(i\), only \((L-K+1)\) of them cover position \(i\) such that a K-mer inside the read can start in position \(i\). The probability reduces from \(L/G\) to \[\frac{L-K+1}{G}\] From this we get that K-mer coverage \(D_K\) is computed as \[D_K = N\cdot \frac{L-K+1}{G} = N\cdot \frac{L}{G}\cdot \frac{L-K+1}{L} = D \frac{L-K+1}{L}\] where we just re-arranged and multiplied both numerator and denominator with \(L\) to get an expression relating \(D_K\) to \(D\). Since the fraction \((L-K+1)/L\) is always (slightly) less than 1, it means K-mer coverage \(D_K\) is always (slightly) less than the corresponding read coverage \(D\). We now have an exact relation between sequencing depth \(D\) and K-mer coverage \(D_K\), and since we always know the read length \(L\) and the K-mer length \(K\) we can compute one from the other.

Here comes the big point now: K-mer coverage is something we can indirectly observe, simply by counting canonical K-mers from our reads. By counting canonical K-mers, we can estimate \(D_K\) pretty well, it is simply the average number of times we observe each distinct K-mer in our data. From this we can find the corresponding estimate of \(D\) using the formula above. From this we can also estimate the size of the genome \(G\)! Thus, by simply counting K-mers we can say how big the genome is that we have sequenced.

We will see K-mer coverages in the output from the sequence assembly later, and now we know exactly what this is!

What if we collected all K-mers instead of the canonical ones? Then all counts become roughly half of what we get with canonical K-mers, and the corresponding read coverage is also halved.

Short lecture on K-mer coverage



2.3 Sequencing errors and K-mers

Let us have a look at how a clever use of canonical K-mers counting can help us reveal sequencing errors when we have sequenced a genome, and to discard or even correct reads with errors.

Assume we have sequenced a genome of size \(G\) with \(N\) reads of length \(L\), i.e. we have the read coverage \(D=NL/G\). If reads contain no errors, we can collect canonical K-mers from these reads, and expect to see a K-mer coverage \(D_K\) as explained above. This means that, as long as \(K\) is large enough to avoid random identities to other regions, all K-mers should occur on average around \(D_K\) times. As always, K-mers will not all occur exactly \(D_K\) times, due to the randomness of sequencing. If we have a rather large K-value, and the genome is random, we expect the count for each K-mer to follow a Poisson distribution with a peak around \(D_K\).

What happens if one read has a sequencing error in one position? Then the K-mers we collect from this read, overlapping with this position, will differ from those we collect from all the other reads overlapping the same position, but who have no errors. The error K-mers will differ from the majority of K-mers also starting at the same position. As long as we use a large \(K\)-value, few other K-mers are likely to be identical to these error K-mers. Thus, they come out with very low counts, usually 1.

To sum up:

  • From the reads, count all canonical K-mers for some choice of \(K\) being ‘large enough’ (e.g. 15 or more).
  • K-mers from error-free reads should on average appear around \(D_K\) times.
  • K-mers occurring very in-frequently, compared to \(D_K\), are likely to be due to sequencing error. Reads with such K-mers may be either discarded or we try to correct them.

A method like BayesHammer (built into the software spades), which we will see below, uses this type of information to filter and correct reads.

2.4 K-mer counting with jellyfish

Let us do some K-mer counting in practice, just to see how the idea above is clearly revealed in real data. Counting K-mers in large data sets is a computer intensive job. We could do this in R, but it will take a long time. We will use the software jellyfish for this, and then we will read the result into R and plot.

Let us use the fastq files we saw in the previous module as input to jellyfish. Here is the shell script we use:

#!/bin/bash

#SBATCH --nodes=1                    # We always use 1 node
#SBATCH --reservation=BIN310         # For BIN310 course only
#SBATCH --account=bin310             # For BIN310 course only
#SBATCH --ntasks=10                  # The number of threads reserved
#SBATCH --mem=10G                    # The amount of memory reserved
#SBATCH --time=01:00:00              # Runs for maximum this time
#SBATCH --job-name=jellyfish         # Sensible name for the job
#SBATCH --output=jellyfish_%j.log    # Logfile output here

######################
### Loading software
###
module purge
module load Jellyfish/2.3.0-GCC-8.3.0

##############
### Settings
###
threads=10    # number of threads to speed up
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
K=21                       # count K-mers of this length 
hash_size="16M"            # memory for jellyfish to reserve
count_file=counts.jf       # raw output file
out_file=countcount.txt    # count file with results we are interested in

############################################
### Copying and de-compressing fastq files
###
cp $r1 $SCRATCH
cp $r2 $SCRATCH
r1=$(basename $r1)
r2=$(basename $r2)
pigz -d $SCRATCH/$r1
pigz -d $SCRATCH/$r2
r1=$(basename $SCRATCH/$r1 .gz)
r2=$(basename $SCRATCH/$r2 .gz)

########################
### Running software
### First we count canonical K-mers
### Then we make histogram, count of counts
###
jellyfish count -m $K -s $hash_size -t $threads -C -o $count_file $SCRATCH/$r1 $SCRATCH/$r2
jellyfish histo -o $out_file $count_file

#######################################
### Post-processing, cleaning up etc.
###
rm $SCRATCH/$r1
rm $SCRATCH/$r2
rm $count_file

Let us first have a look at some of the code in this script, because it shows some use of UNIX code as well as the $SCRATCH disk.

Sadly, the jellyfish software is incapable of reading compressed fastq files as input (most software manage this). This means we have to

  • Copy the compressed fastq files to some other folder
  • De-compress these copies
  • Use these as input
  • Delete the copied and de-compressed files once we are done with them

In cp $r1 $SCRATCH we simply copy the R1 fastq file to the $SCRATCH. All users have a $HOME folder at /mnt/users/<username>, but we also have a $SCRATCH folder at /mnt/SCRATCH/<username>. Here we may store things without using our ordinary disk quota. However, there is no backup from $SCRATCH, and we should only use this for temporary storage. We typically direct the output from software to $SCRATCH and then just copy what we need to our own $HOME and delete the rest.

In the script above we also use the basename command in UNIX to change the file names stored in the shell variables $r1 and $r2. The basename used like this will simply strip off all folder names in the full file name. Then, we de-compress the files with pigz -d, and finally, use basename to also strip off the .gz extension of the stored file names. Now the $r1 and $r2 should contain the proper file names. Notice how

  • The code basename /mnt/users/larssn/BIN310/coursedata/module2/Illumina_MiSeq_R1.fastq.gz will result in the text Illumina_MiSeq_R1.fastq.gz, i.e. peeling off the folders before the actual file name.
  • The code basename /mnt/users/larssn/BIN310/coursedata/module2/Illumina_MiSeq_R1.fastq.gz .gz will result in the text Illumina_MiSeq_R1.fastq, also peeling off the extension .gz.

Re-visit LinuxCommand.org to see why we use the $(...) notation above (Expansions).

Notice also how we delete the copied files in the end. This is good practice, otherwise the $SCRATCH will be filled with rubbish.

2.4.1 Exercise - running jellyfish

Copy the code above into a shell script, save it in a module3 folder under your $HOME. Maneuver into this module3 folder and sbatch to SLURM.

It should produce a text file countcount.txt in your module3 folder. This file has 2 columns, separated by a blank. The first column lists the frequencies (counts) observed when counting K-mers. The second is the ‘frequency of frequencies’. Thus, jellyfish first counts how often each canonical K-mer occurs. This means a lot of counts, one for each K-mer. Then, jellyfish counts how often we see each count, e.g. if 1000 K-mers occur 50 times, then the count 50 has been observed 1000 times. The latter is the second column.

Make an R script to read this and make a barplot of the second column versus the first (use geom_col()). Here is a start that you can copy and fill in:

library(tidyverse)

jelly.tbl <- read_delim("___", delim = " ", col_names = c("Kmer_counts", "Count_counts"))
fig <- ggplot(___) +
  geom_col(aes(x = ___, y = ___)) +
  xlim(0, 100)
print(___)

# HINT: To make the plot look nicer, you may replace the largest value by some maximum, e.g. 5e+5
# Add this before plotting:
# jelly.tbl <- jelly.tbl %>% 
#   mutate(Count_counts = pmin(5e+5, Count_counts))

Inspect the plot. What is the most striking observation we make from this? Hint: Two groups of K-mers. What would you say is the K-mer coverage \(D_K\) here? The reads in this data set are 301 bases long. What (approximately) is the read coverage \(D\)?

2.4.2 Exercise - error free reads

How would the K-mer histogram look like if there were no sequencing errors in the data?

We have simulated such reads, and you find them in the files Illumina_MiSeq_errfree_R1.fastq.gz and Illumina_MiSeq_errfree_R2.fastq.gz under $COURSES/BIN310/module3/.

Edit the script from the previous exercise to use these files as input instead. Also, change the name of the output file to countcount_errfree.txt to avoid overwriting the existing result-file. Make the same plot in R from the results you now get. What is the main difference to what you saw for the real (noisy) data?

3 Containers

We have seen that software is installed on orion as modules and that we must load a module before we can run the application. This allows us to have many different versions of applications, compilers and libraries available for different users at the same time without conflicting with each other.

What if we want to run some application that is not installed as a module? How can we make this software available? Well, you may ask the orion support people, in a polite way, to install it as a module. But, we also have another option that we will use from now on.

3.1 Increased flexibility and reproducibility

If you have tried to install software on your own local computer, you will probably recognize the dependency hell. In order for software A to work properly, you need certain versions of software B, C and D installed as well. Sometimes, though, another software E also depends on software B, but a different version of it. Then, you cannot have both A and E installed at the same time. This is what we avoid by the module system. Another way of avoiding it is by using containers.

We can think of a container for software A as a package, including software A together with all the other applications, libraries etc that is needed to run software A. This is all wrapped up in one big file, and when we ‘run’ this it is (almost) independent of what is actually installed on your computer! This means we may have such containers for all kinds of conflicting applications, and run them without interfering with each other.

What you need to use container software is the basic software that allows you to utilize this. The two dominating ones are docker and singularity, the latter recently changed its name to apptainer. So, for us to use conatiner we need to have either docker or apptainer installed on orion. The docker has some problems in a multi-user environment like a HPC, so on HPC in general you find only apptainer. It is, however, in general possible to use docker containers also if you run apptainer, so this is really no limitation.

In BIN310 we will only use containers, not build them. This may change in the (near) future, and if you foresee a future in bioinformatics, you should probably invest time to learn how to build containers. Have a look at https://apptainer.org/docs/user/latest/quick_start.html.

Over the recent years, it has become more and more common that modern biology software is distributed as containers, and we hardly install software the old way anymore. We simply use the available containers, as we will see shortly. A major benefit of this system is that it make computations much more reproducible. In science this is of vital importance. Other scientists should be able to reproduce your results, otherwise it is of little value. The lack of reproducibility has been recognized as a major problem in modern biology. In a scientific publication we have to list all software tools used, and their versions, but still it could be a hassle for someone else to reproduce this on another computer, with different libraries and operating system. You would need to install the same software, and the exact same versions of it. Today we simply specify and share the containers we used, and anyone else can just use the exact same containers to repeat it!

This is the reason comparatively few software tools are available in the module system. Only some very basic tools are there, and for the rest you use containers.

Another very popular approach, quite similar to containers, is the use of conda, where you install your version of a software locally, for you to use. We will come back to this later in BIN310, but we will focus mostly on the use of apptainer containers in this course.

3.2 Apptainer containers

One of the platforms for running containers is apptainer, which is the same as we earlier called singularity. Thus, we need apptainer installed on our system, and this is available on orion (and sigma2). Open a Terminal and run

apptainer --help

Notice there are many Available Commands listed. We will focus on the exec for now, which is used to execute an apptainer container. But, where do we find the containers?

One big source of containers is found at the galaxy project servers. All the galaxy containers are listed on the website https://depot.galaxyproject.org/singularity/ (bookmark this site!). To download a container to orion, do as follows:

  • Open a web browser, and go to https://depot.galaxyproject.org/singularity/.
  • Scroll down to find the container you seek.
  • Right-click on it, and copy the link.
  • Log into orion with the Terminal.
  • Type in the command wget and then paste in (right click) the copied link, and return.

This should start the download. Notice, the command wget is used for downloading files to orion. Due to security reasons you are not allowed to download from any website, but the galaxy servers are on the ‘white list’.

After downloading we always do the following:

  • Add the file extension .sif to the downloaded file (it has no extension originally). This is not strictly necessary, but by convention all such Singularity Image Files should have the extension .sif (this is still used despite the change in name from singularity to apptainer).
  • Make it executable by running chmod ugo+rx *.sif

The nice thing about having the software as a local file is that you can put this into the same project folder where you have your data and other code, and when your study is finished, you can store it all in some archive for later. Should it be necessary to reproduce the results at some later time, you just open the archive, and there you find all you need, including the software you used.



3.3 Running the apptainer container

Let us find a container for the jellyfish software in the listing from above, and see how we can run this. The newest versions at the time of writing seems to be jellyfish:2.2.10--h6bb024c_1 (version 2.2.10, build h6bb024c_1). Download this to your module3 folder, and give it the file extension and make it executable as shown above.

To run a container we use the command

apptainer exec <container> <command>

where we need some proper command for running the software. This will vary from software to software, and we need to find out, e.g. by reading its manual, how to run the software. In this case we use jellyfish -h to see if we can get some help text:

apptainer exec jellyfish:2.2.10--h6bb024c_1.sif jellyfish -h

This should now open the container and start running its software by the command jellyfish -h, which should hopefully produce some help text.

In the above command line we assume the container file is in the same folder as the shell script we are invoking it from. This is of course not required. The container file could be in any folder, and then you just specify the full path to it (e.g. $HOME/module3/jellyfish:2.2.10--h6bb024c_1.sif) in the command line. However, since we like to keep the containers together with the rest of the code, it is perhaps natural to expect the .sif file is (always) in the same folder as the shell script. The exception is when several people work together, using the same containers. Instead of each having their copy, it is better to have have one file stored in an open folder, for everyone to use. We will do this in BIN310, but for now you should all download and use your own containers.

To summarize:

  • To run a software, we look up under https://depot.galaxyproject.org/singularity/ to find the available container at galaxy.
  • Download the container to a local .sif file.
  • We no longer need any module load in our shell scripts!

At galaxy there is a very long list of software, and in most cases you will find what you seek there. This means we are now independent of what is installed on orion!

We will come back to containers again later in BIN310, but for now we stick to the use of apptainer and the galaxy containers.

Here is a short video demonstrating how to download a container and run it on orion



3.4 The /cvmfs connection (that we do not use!)

On orion we have the luxury of having a /cvmfs connection to the galaxy containers. From the Terminal list the content of the folder /cvmfs/singularity.galaxyproject.org/:

ls -l /cvmfs/singularity.galaxyproject.org/

Notice that this lists the content of the ‘folder’ as if it was a folder in the orion system, even if it is on a remote server somewhere. Read more about the cvmfs (Cern Virtual Machine File System) in order to understand how this is possible.

Here the same containers you saw at https://depot.galaxyproject.org/singularity/ are listed in subfolders specified by the first two letters in the container name. This means the containers for jellyfish are found under:

ls -l /cvmfs/singularity.galaxyproject.org/j/e/

We could now run the container without downloading it, by

apptainer exec /cvmfs/singularity.galaxyproject.org/j/e/jellyfish:2.2.10--h6bb024c_1 jellyfish -h

Then we make use of the /cvmfs connection. We do not recommend this. First, it is unstable, and may at times not work. Second, there is no guarantee this software version is still on the galaxy server in the (near) future, hampering your reproducibility. Third, this /cvmfs connection is not something you find on every HPC, and we should not depend upon it.

By downloading the container to a local file, you can take it with you to another HPC system, like sigma2, at any other time and use it in the exact same way as you did on orion!



3.4.1 Exercise - run jellyfish by apptainer

Edit the jellyfish shell script from above such that it makes use of the newest container for this software at galaxy instead of the module installed at orion. You may now delete the module statements from the script, and add the apptainer exec <container> (replace <container> with the proper name) in front of the command lines. Note that each time you run jellyfish (both jellyfish count and jellyfish histo) you need to add this. Instead of loading the software once, you invoke the container each time you use it.

3.4.2 Exercise solution

Here is a suggested solution. Notice the command lines for running the software is broken over two lines, using the \ at the end of the first line. This is required, otherwise it is considered as two different command lines:

#!/bin/bash

#SBATCH --nodes=1                    # We always use 1 node
#SBATCH --reservation=BIN310         # For BIN310 course only
#SBATCH --account=bin310             # For BIN310 course only
#SBATCH --ntasks=10                  # The number of threads reserved
#SBATCH --mem=10G                    # The amount of memory reserved
#SBATCH --time=01:00:00              # Runs for maximum this time
#SBATCH --job-name=jellyfish         # Sensible name for the job
#SBATCH --output=jellyfish_%j.log    # Logfile output here

##############
### Settings
###
threads=10
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
K=21
hash_size="16M"
count_file=counts.jf
out_file=countcount.txt

###############################
### Copying and de-compressing
### fastq files
###
cp $r1 $SCRATCH
cp $r2 $SCRATCH
r1=$(basename $r1)
r2=$(basename $r2)
pigz -d $SCRATCH/$r1
pigz -d $SCRATCH/$r2
r1=$(basename $SCRATCH/$r1 .gz)
r2=$(basename $SCRATCH/$r2 .gz)

###############################################
### Running software
### First we count canonical K-mers
### Then we make histogram, count of counts
###
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish count \
-m $K -s $hash_size -t $threads -C -o $count_file $SCRATCH/$r1 $SCRATCH/$r2
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish histo \
-o $out_file $count_file

#######################################
### Post-processing, cleaning up etc.
###
rm $SCRATCH/$r1
rm $SCRATCH/$r2
rm $count_file

Here is a short video demonstrating how we changed the original script



4 Read filtering and correction based on K-mers

Let us now return to the concept of canonical K-mer counting, and use this for correcting reads with errors or filtering them out of the data set.

In module 2 we used the software fastp for filtering reads. This was based on the quality scores of the reads that is supplied in the fastq files. These are probabilistic in nature, i.e. bad quality scores indicates the reads probably have some errors, but it does not look into the reads themselves. Also, these quality scores are not always very well calibrated, may sometimes give us a wrong picture of the actual read quality.

The filtering we do below is not using the quality scores at all. Instead, we count K-mers in the actual reads, and try to filter or repair reads who have ‘strange’ K-mers in them. Thus, we go more directly into the data instead of making use of some metadata (data about data) that may sometimes be misleading.

4.1 The BayesHammer

The idea of using simple K-mer frequencies to filter or correct reads has been around for some time. We will now study a software solution called BayesHammer that uses this approach. This is included into the assembler software spades that we will use more later.

Before we start using it, we should try to understand the basic idea behind it. The original paper is found here:

Read this paper, and discuss with each other. As a bioinformatician you are expected to read (and write?) such papers! Here are some guidelines:

4.1.1 Exercise - Running BayesHammer

Let us use BayesHammer on the fastq files from module 2:

  • $cOURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
  • $COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz

The BayesHammer is part of the spades application. This is a software used to assemble reads into genomes, but the BayesHammer is included as a first filtering step. We will look at assembly in the next module, and for now we will run spades with the option for only doing this first filtering.

Download the newest apptainer container for spades from galaxy, and put it into your module3 folder:

Make a new shell script, and make use of an apptainer container for spades that you download. It is a good idea to copy the script we used for jellyfish above, save it under a new name (bayeshammer.sh), and edit it to run spades instead.

Note: You no longer need the code where we copied the files to $SCRATCH, and decompressed them. The spades can read the compressed files directly!

The command line for running spades should look like this:

spades.py --pe-1 1 $r1 --pe-2 1 $r2 --only-error-correction -o $out_dir --memory 99 --threads $threads

Notice this command line makes use of some shell variables (look for the $s). These must be created before the command line, under the Settings:

  • $r1 and $r2 are two shell variables containing to full path to the data files (see the jellyfish script).
  • You have defined a shell variable $out_dir with the path to the folder where you want the output. It is sensible to refer to some folder under $SCRATCH here, typically out_dir=$SCRATCH/bayeshammer. If the folder does not exist, it will be created automatically by spades.
  • We also have the shell variable $threads (see the jellyfish script).
  • We need to reserve 99GB of memory in the header of the script. The spades software requires a lot of memory!

Deleted the code under ### Post-processing from the jellyfish script. We will add a new code, but first sbatch the script as it is now. The computations should take some minutes.

The software spades always outputs a lot of stuff, but we focus only on the compressed fastq files you should find it in $out_dir/corrected/ once the computations are done. You will recognize the files with R1 and R2 in their names, but also one with unpaired. The latter are reads who lost their mates during filtering, i.e. in a read-pair one of the reads is so full of errors it has been discarded by the BayesHammer. Then the pair is broken, and such unpaired reads are written to a separate file. Typically the unpaired file is much smaller then the paired files, since most pairs survive this step. We also saw such unpaired reads when we filtered reads in module 2.

The only files we need from this output are these three files. Add code at the end of the script where you

  • Copy the three compressed fastq files to your module3 folder
  • Then delete the entire output folder, i.e. rm -r $out_dir

Finally, sbatch the script again, and verify the three compressed fastq-files show up in your module3 folder.

4.1.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=99G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=bayeshammer           # Sensible name for the job
#SBATCH --output=bayeshammer_%j.log      # Logfile output here

##############
### Settings
###
threads=10
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
out_dir=$SCRATCH/bayeshammer

######################
### Running software
###
apptainer exec spades:3A3.15.5--h95f258a_1.sif spades.py \
--pe-1 1 $r1 --pe-2 1 $r2 --only-error-correction -o $out_dir --memory 99 --threads $threads

#######################################
### Post-processing, cleaning up etc.
###
cp $out_dir/corrected/*.fastq.gz .
rm -r $out_dir

To delete a folder you need the -r option to rm as seen in the code above. The -r means recursive, i.e. you delete a folders and all its subfolders. Be concentrated when using this command, and make certain you delete the exact folder you want to delete, since there are no regrets here. It would be a dramatic error to delete a folder where you have something important in some of its subfolders! The folders we create under $SCRATCH typically contain output from some software, and these we can always re-create by re-running the script, in case we deleted a wrong folder.

4.1.3 Exercise - Run jellyfish on BayesHammer output

Finally, edit your jellyfish script again, to use the output R1 and R2 files from BayesHammer as input to jellyfish (we ignore the unpaired file here now). We want to see of the K-mer counts have changed as a result of running BayesHammer on our data.

Again, read the results into R and plot the counts of counts as before. Compare the jellyfish results to those of the raw data. Discuss the results. What is your conclusion?

4.1.4 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                    # We always use 1 node
#SBATCH --reservation=BIN310         # For BIN310 course only
#SBATCH --account=bin310             # For BIN310 course only
#SBATCH --ntasks=10                  # The number of threads reserved
#SBATCH --mem=10G                    # The amount of memory reserved
#SBATCH --time=01:00:00              # Runs for maximum this time
#SBATCH --job-name=jellyfish         # Sensible name for the job
#SBATCH --output=jellyfish_%j.log    # Logfile output here

##############
### Settings
###
threads=10
r1=Illumina_MiSeq_R1.fastq.00.0_0.cor.fastq.gz
r2=Illumina_MiSeq_R2.fastq.00.0_0.cor.fastq.gz
K=21
hash_size="16M"
count_file=counts.jf
out_file=countcount_bayeshammer.txt

###############################
### Copying and de-compressing
### fastq files
###
cp $r1 $SCRATCH
cp $r2 $SCRATCH
r1=$(basename $r1)
r2=$(basename $r2)
pigz -d $SCRATCH/$r1
pigz -d $SCRATCH/$r2
r1=$(basename $SCRATCH/$r1 .gz)
r2=$(basename $SCRATCH/$r2 .gz)

######################
### Running software
###
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish count \
-m $K -s $hash_size -t $threads -C -o $count_file $SCRATCH/$r1 $SCRATCH/$r2
apptainer exec jellyfish:2.2.10--h2d50403_0.sif jellyfish histo \
-o $out_file $count_file

#######################################
### Post-processing, cleaning up etc.
###
rm $SCRATCH/$r1
rm $SCRATCH/$r2
rm $count_file