1 Learning goals

In this module we take a look into the world of comparative genomics. This is a big topic, and we only have a brief visit this time. It is all about how to compare the genomes, once we have sequenced and possibly also annotated them. The comparisons are typically made by building phylogenetic trees, or something similar. Such comparison of genomes is relevant in many applications, typically related to outbreaks food safety or epidemiology. It is quite likely some of you will dig a lot more into this in a Master project or similar.

We aim to

Introduce you to some common ideas and approaches for how to use whole-genome data in this context
Making pan genome computations, and make some simple trees based on different types of results from this.
Learn the basic ideas behind the MinHASH algorithm, an important idea in modern genomics.
Learn how to compute mash distances and use this for building trees.

Here we now start processing many samples/genomes. This means we also need to focus on

How we use loops in shell, since we now may need to re-run some tools many times
Learn how to use array jobs for parallel processing on orion, which is perhaps the most powerful advantage of using a computing cluster.

1.1 Software used in this module

2 Whole-genome data

In many applications we are interested in comparing genomes for understanding how they are related. In some cases we are interested in what functions the various genomes may have, but in this module we are more focused on how they are related.

I can think of no better example of an application like this than the pandemic we have just been through. We have been constantly sequencing and monitoring the SARS-cov2 virus genomes, and look for new variants and how these are related, to understand how it spreads. This is then typically displayed in some kind of phylogenetic tree.

Tree showing the development of SARS cov-2. This has been copied from https://www.nature.com/articles/s41598-020-79484-8

A similar approach is also used a lot in tracing the outbreaks of certain pathogens, e.g. in hospitals, in farm animals or the food industry. Typical of these cases is: We have an isolated bacteria, and we know which species we are looking at, but we want to see the more detailed and recent evolution of this particular strain that causes problems. This may reveal where we have seen similar strains before, and then understand how it has been spreading. Thus, we are typically looking for methods that allow us to discriminate between strains within the same species. It is then natural to presume that the whole genome data provide us with the best possible data for digging up this information.

Before we proceed with some computations, let us briefly overview some commonly used ideas and phrases related to whole genome data.

2.1 Multi Locus Sequence Typing (MLST)

Historically, the first attempts to compare bacteria based on more than a single marker gene, was the Multi Locus Sequence Typing (MLST), originating in the late 1980’s. In MLST analyses we sequence a small number (usually 7) of housekeeping genes that all genomes within a species contain. This is done by targeted sequencing, i.e. no need to sequence the entire genome, which is both faster and cheaper. Each of these genes has some small variation between the genomes, and each variant is given an arbitrary number, just like a name or label. Then, two genomes are compared by counting how many of these 7 genes have different variants (labels) in the two genomes. Here is an example to illustrate:

Genome A has the variants 1, 3, 2, 1, 11, 5, 1, i.e. each of the 7 gene sequences have been given a number (a label). Such a vector of numbers is referred to as a sequence type (ST). Then, genome B has the variants 2, 2, 2, 1, 3, 5, 1. We see that gene number 3, 4, 6 and 7 are identical between these two genomes, while genes 1, 2 and 5 differ. Thus, the distance between them is 3 (3 labels are different).

Note that for the genes who differ, we never consider how different they are, only that they are not identical. This sounds like a potential loss of information, since the difference may everything from a single base to large parts of the gene. However, one may also argue the other way, saying that any change of a single gene, large or small, is most likely due to a single evolutionary event, and should be counted as one. Anyway, having the sequence type vectors from many genomes, we can compute all pairwise differences, and from this we can cluster them or make a tree-like dendrogram indicating which genomes are more or less similar.

A problem with this approach is a limited resolution, since the difference between any two genomes must be either 0 or 1 or 2 or…7. We can imagine two strains, who are not really identical, may still have identical sequence types when we only consider these genes. These genes have been chosen from the start because they are housekeeping genes, and this probably means they evolve very slowly. Essential parts of the genomic machinery is always highly conserved, since messing it up is devastating to the organism.

But, the MLST is still very much in use, simply because we have, worldwide, established huge collections of variants of the housekeeping genes for many relevant species. These are now available at the PubMLST. For routine surveillance it is also essential that methods are standardized, making it possible to compare against the past.

Instead of just collecting the sequence variant label for each MLST gene, we may of course collect the sequences themselves. In this way we can align all the variants seen in the different genomes, and compare the genomes by making phylogenetic trees in a more conventional way.

2.2 The pan-genome and extended MLST

The concept of a pan-genome was launched around two decades ago, when we started to sequence many bacterial genomes. When sequencing several different strains within the same species, we realized that the genes we find in these genomes are not all the same. The pan-genome of the species is then the collection of all unique genes we find in all genomes of the species. Roughly, we can divide the genes of a pan-genome into

Core genes. These are the genes that are shared by all genomes within the species. The housekeeping genes we used for MLST will typically be among these, but there are usually many more. A core gene may not be essential for the organism, they are simply, for the time being, present in all genomes of the same species. Note that when we have sequenced and assembled many genomes, we may not require core genes to be present in absolutely all of them. The reason is we realize assembly and gene prediction are processes with some randomness/error in them, and even if a gene is actually present in all the genomes, we may miss it in a few of them. Thus, a gene may be lacking in a small percentage of the collected genome, and still be called a core gene. We sometimes refer to this extended set of genes as soft-core genes, and as a rule of thumb we can say these must be found in at least 95% of the genomes.
Shell genes. These are shared by a large fraction of the strains, but not all. These genes are not (soft-) core genes, but are still quite common, we could say from 5% up to 95% of the genomes. These genes may carry a lot of information about how the strains are related!
Cloud genes. These are unique to one or very few strains, or we could say less than 5% of the genomes. These genes are often of little interest in comparative analyses, simply because they are too unique. Cloud genes found in one single genome only are sometimes called ORFan genes.

We sometimes denote both shell- and cloud-genes as accessory genes. The accessory genes are ‘nice to have’ type of genes for the specific organism. They may be very important for a strain living in a certain environment, but may also be just randomly acquired and not really be important at all. We must also remember, the cloud genes may very well be a product of mistakes we do in gene prediction! The false positives from gene prediction has a tendency of showing up once, and never seen again (i.e. ORFans). We should not build our analysis on false positive gene predictions! Thus, the shell genes are the most interesting accessory genes.

Note also that a pan-genome may not be restricted to a species. In general, we can say that any clade (e.g. a genus or subspecies) has a pan-genome, but in most practical cases such a clade is a species.

The MLST concept has been extended as a consequence of whole genome sequencing and the existence of pan-genomes. The main reason we only sequenced a few (7) genes in the past was the lack of efficient sequencing technology.

One such extension is the core gene MLST (cgMLST). Instead of 7 genes we instead consider all the core genes shared by all genomes within a species. It is easy to see this gives a higher resolution than classical MLST, any difference in any of these genes will add to differences between genomes. However, some of these differences may be due to errors we make in assembly and gene finding, and the resolution may be lower than it may seem at first. Another difficulty is that the set of core genes is actually unknown. Having sequenced 10 strains of a species, quite many genes may appear to be core, since they are found in all (these 10) genomes. But, as we keep sequencing more and more strains, some of the ‘original’ core genes just happened to be found in the first 10 genomes we (randomly) chose for sequencing, but are lacking in (several) of the later genomes. Thus, as we keep sequencing more and more genomes, the set of core genes will in general shrink. The resolution shrinks, and we need to constantly re-analyse data from the past since the set of genes changes.

An alternative is then the whole-genome MLST (wgMLST) using in principle all genes from all genomes. As we discussed above, the cloud genes may be dubious, and in reality we would perhaps choose to include only core and shell genes in our analysis. We are not entirely free of the fact that the gene set changes over time, but we can make it more stable, and at a higher resolution than with core genes only.

Regardless of MLST extension, the comparison of the sequence variants, and the strict identity criterion, is still very much used. But since we now have ‘sequence type’ vectors with many labels (one for each gene), the resolution is quite high. Thus, both cgMLST and wgMLST are used a lot in comparative genomics. The most studied pathogens typically have global databases of genes, listing all their known sequence variants. Having the genes of a new genome we then get a vector of sequence variants 1, 1, 5, 11, 4, 3, 7, 23, ... and so on, one number (=variant) for each gene. Again, two genomes are compared by comparing these vectors of integers, counting how many differences they have.

2.3 Whole-genome alignment

Instead of extracting genes from each genomes, could we not simply align the entire genomes, and then build a phylogenetic tree from this?

In principle yes. This would then include all genomic data we have, and this sounds like the ultimate source of information for comparing prokaryotic genomes. There are software tools for making such alignment, e.g. mugsy (http://mugsy.sourceforge.net/). We should, however, be aware that this is not a straightforward task. One problem is the computational load. Aligning many and huge sequences like this will have to take time. But, perhaps more important are the problems that arise from how prokaryotes evolve.

Prokaryotes also evolve by mutation and recombination, but here recombination typically means horizontal transfer of genomic material, i.e. entire genomic segments may be transferred from one genome to another. Also, both processes run pretty fast along a time scale in comparison to evolution in larger organisms, and we sometimes find big differences between strains of the same species. Thus, aligning a set of genomes means aligning rather different sequences, even if they are all of the same species. When running a software tools like mugsy on a set of genomes, we typically see that the genomes are broken into smaller pieces that are aligned. The reason for this is that either these pieces occur at different locations, relative to each other, in the various genomes, or that some pieces are lacking entirely in some of the genomes. Thus, we rarely get one big alignment that includes all genomes, but several shorter alignments covering smaller regions of the genomes. And many (most) of these do not include all genomes, but a subset of them.

Thus, the alignments we get may be a little difficult to trust, and for a standard phylogenetic analysis we need alignments where all genomes are included. Such regions may be few and small, and we end with something which is far from ‘whole genome’ after all. Only for species with very small diversity this may a useful approach. Comparisons to the MLST approaches above have been made, and whole-genome alignment approaches are not as popular.

2.4 Single Nucleotide Variant alignments

An alternative to whole-genome alignment is to

Map raw reads from each genome against a reference, just as we did in module 5.
For each genome, collect the positions where the genome differs from the reference, i.e. where the reference has a nucleotide but no reads map. These are the Single Nucleotide Variants (SNVs).
Make a SNV-alignment, i.e. only include in the alignment those positions on the reference where the genomes differ from each other.

A small example to illustrate: Assume we have a reference genome of 100 positions (bases). The first genome we sequence has reads mapping except for in position 10, 50 and 76. The next genome we sequence also lacks mapping in positions 10, 45 and 80. And so on. We then make an alignment by collecting all of the 100 reference positions where we see some difference between the genomes. Positions where all genomes map are of no interest. Positions where no genomes map are of no interest. Only the positions where we see a difference between the genomes. This we can now directly collect as a multiple alignment, no need for further processing.

It is easy to see some potential pitfalls of such an approach as well: First, the choice of reference genome will be decisive. We never really compare our sequenced genomes against each other, only against the same reference, and then see how they differ here. If the genomes we study contain pieces the reference is lacking, we will never include this in our comparison. Reads will only map to whatever genomic features the reference has. Second, by focusing on SNVs we may inflate the differences between genomes severely. We never include the positions where they are similar, this will affect the evolutionary distance computations, and potentially also the tree we make. Also, several SNVs next to each other should probably not be counted as many evolutionary events, but rather as one single (recombination) event. However, there are ways to partly compensate for these potential problems, and SNV-alignments are much in use, e.g. the CFSAN pipeline tool, see https://snp-pipeline.readthedocs.io/en/latest/.

2.5 K-mer based comparisons

The approaches we saw above are based on alignments in one way or another. Such methods have the potential of quite accurately re-construct evolution, but there are also some possible problems.

Building alignments is time-consuming and scales horribly. What if we have sequenced 1000 genomes? For any of the extended MLST approaches we would first need to make 1000 assemblies. Then find genes in 1000 assembled genomes, and if we find around 3000 genes in each, we have to compare all 3 million genes to each other to find which are ‘the same gene’ and which of these are core, shell and cloud genes. The SNV alignment approach is much faster, but also involves 1000 readmappings. The whole-genome alignment will never finish at all with 1000 genomes…

If we then add another 100 genomes to this, some of these steps needs to be repeated for the entire 1100 genomes. We must also remember that making a multiple alignment of many sequences may also result in many errors in the alignment. The more sequences we align, the more likely is it that positions have been erroneously aligned.

A typical situation when monitoring the occurrences of new pathogens is that

We have a database of genomes we have seen before. This may be quite large (thousands of genomes).
We sequence a few new genomes.
We want to (quickly) be able to say how the new genomes relate to those in the database, and to each other.

Since the number of sequenced genomes grows rapidly, new ideas have emerged for how we may compare them quickly. One of these involves comparing the genomes by comparing their K-mers, known as min-hashing. We will look closer at this below. These methods also have their limitations, but are interesting in the sense that they represent a completely new approach compared to the classical alignment-based theory. They also scale extremely well, and may be used even for enormous data sets. In fact, some of the ideas behind them are inherited from really Big Data problems, like the internet search engines.

There is a lot of ongoing research along this path, and new ideas and software tools become available all the time. We will only try to get the basic ideas under our skin here. Let us now do some computations!

3 Pan genome computations

We will now do some basic computations for comparing a set of genome based on their pan genome. The basic steps of a pan genome analysis may be sketched as:

Assemble all sequenced genomes. We have seen before how we can do this, e.g. with spades.
Find the coding genes in each genome, maybe also give them some annotation. We have done this, using prodigal, or alternatively prokka/DRAM to also get some annotations.
Cluster all the genes from all the genomes, i.e. group the sequence who are variants of the same gene. This is new.

We then end up with a set of gene groups or gene clusters or gene families (synonyms). Each group corresponds to a gene, and the members of the group are the variants for this gene, found in the genomes we investigate.

The example data we will use is a subset of a study conducted by NOFIMA here at campus Ås, see the publication Fagerlund et al, 2020. In this subset we have Illumina paired-end sequence data from isolates of Listeria monocytogenes from 4 different food production plants. Listera is known to be very clonal, i.e. the genomes of the various strains are very similar to each other. Let us see if we can detect some differences between them.

3.1 Genome data

Whenever we do an analysis like this, it is nice to have a table listing the metadata, i.e. the data about each sample, in this case about each genome. We have such a table in /mnt/courses/BIN310/module7/genomes_table.txt. Copy this file to your own module7 folder, and read this tab-separated text file into R to inspect:

library(tidyverse)

genomes.tbl <- read_delim("genomes_table.txt")
head(genomes.tbl)

## # A tibble: 6 × 12
##   genome_id   abcZ   bg1A   cat   dapE   dat   ldh   lhkA  food_production_plant
##   <chr>       <chr>  <chr>  <chr> <chr>  <chr> <chr> <chr> <chr>                
## 1 SRR11262134 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1                   
## 2 SRR11262195 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1                   
## 3 SRR11262003 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1                   
## 4 SRR11262009 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1                   
## 5 SRR11262198 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1                   
## 6 SRR11262191 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M4                   
## # ℹ 3 more variables: R1 <chr>, R2 <chr>, contigs_file <chr>

There are 17 rows, one row for each genome. Each genome has a genome_id, which is simply a unique text for each genome. The last 3 columns are file names associated with each genome. First the raw reads file (R1 and R2) and then the file with assembled contigs for each genome (contigs_file). Thus, we have already assembled these genomes, in the same way we did in module 4. You find the raw reads files in the folder /mnt/courses/BIN310/module7/fastq/ and the contig files in /mnt/courses/BIN310/module7/contigs/. Thus, the first bullet point from above, the assembly of the genomes, has already been done.

Before we proceed, let us just briefly mention the MLST results in the genome table. In the table you also find 7 columns listing the MLST genes for this species, with short cryptic names (abcZ, bg1A, cat, dapE, dat, ldh, lhkA). Each row lists the sequence type of the genome, just like you were asked to find in assignment 2. We can see that all the 17 genomes have the exact same sequence type! Thus, these are impossible to separate by using only MLST. We could now compute a 17x17 distance matrix where we sum the number of differing genes between all pairs of genomes, but this matrix would then only contain zeros. This means these genomes are extremely similar to each other.

Let us now turn the attention to the second bullet point from above, to find the genes in the genomes.

3.2 Using array jobs and `awk`

We will use the prokka software that we saw in module 6 for the gene prediction/annotation. The main reason for this is that prokka outputs a GFF file that has the exact format required for the next tool we will use below, i.e. purely convenience!

The only difference to before is that now we must do this for each of the 17 genomes, i.e. each contigs file. We will therefore use this opportunity to see how we can use array jobs in a computing cluster, which is one of the great benefits of having access to such a cluster.

In principle, we could now make a for-loop where we loop over these 17 genomes, and in each iteration of the loop we call upon prokka to get the annotations for each genome. We will come back to loops, but this means we do all computations sequentially, i.e. if each job takes 10 minutes to complete, this would take us 170 minutes (for the 17 genomes). By using array jobs, we can run many jobs in parallel. In principle we could run all 17 jobs in parallel, and finish in the same 10 minutes it takes to annotate a single genome. On a cluster we can spread the jobs out on many nodes, and each node have many threads and may run several jobs at the same time.

In order to automate the annotation of many genomes, it is very convenient to have the genomes_table.txt file that we saw above. We will now read this into a shell script, using the awk programming language. We will only briefly use awk here in BIN310 this, but if you foresee yourself a future in bioinformatics, it could be a good investment to learn more awk coding.

Let us start out by the shell script for running prokka using array jobs:

#!/bin/bash

#SBATCH --array=1-17%10           # This is where we specify using array jobs
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=prokka
#SBATCH --output=prokka_%j_%a.log

#################
### Settings
###
threads=10
genomes_table_file=genomes_table.txt
data_folder=$COURSES/BIN310/module7/contigs
tmp_folder=$SCRATCH/multiprokka
gff_folder=gff            # we want to have the GFF files here, in this folder
if [ ! -d $tmp_folder ]
then
  mkdir $tmp_folder
fi
if [ ! -d $gff_folder ]
then
  mkdir $gff_folder
fi

#############################################
### Using awk to read the genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
contigs_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["contigs_file"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
contigs_file=$(echo $contigs_file_column | awk -vi=$row '{print $i}')

#######################
### Running prokka
###
apptainer exec $HOME/module6/prokka:1.14.6--pl5321hdfd78af_5.sif prokka \
  --force \
  --cpus $threads \
  --outdir $tmp_folder \
  --prefix $genome_id \
  $data_folder/$contigs_file
  
cp $tmp_folder/$genome_id.gff $gff_folder

First, in the header section we added #SBATCH --array=1-17%10. This tells SLURM to run 17 array jobs, enumerating them from 1 to 17. The %10 means we specify never to run more than 10 jobs at the time. This is good behavior on a cluster, otherwise we may consume too much of the resources. Without this SLURM will start all 17 jobs if enough resources are available. If our jobs are short, this is OK, but otherwise we may block the system for other users. You may also notice we specify the log-files to contain an extra number in --output=multiprokka_%j_%a.log. The %a will add the array job number to each log file name, since we will now get one log-file for each array job.

Next, we output to a single directory $SCRATCH/multiprokka, but we also have a local directory gff/ in which we want to store the .gff files for each genome in the end. We are interested in only this GFF file from each genome this time, thus the folder on $SCRATCH can be deleted once this is done.

Next, we read the file named in $genomes_table_file using awk. The line

genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)

reads the file, and the -F"\t" tells awk to split the text by each tab. Then we loop over the tokens in the first line, to find the text "genome_id", and then print the entire column where this is found. This is then stored in the shell variable genome_id_column since the entire expression is inside a $(...). Thus, genome_id_column now contains all the texts in the column named genome_id in the file, including the column name itself.

The next code line is similar, extracting the content of the column named "contigs_file". You may now use this recipe for extracting any column from such tab-separated text files! Just replace the text with the column name, and store into a new shell variable.

When running array jobs, a shell variable named SLURM_ARRAY_TASK_ID is always created, and this will contain the number for each job. Since we specified --array=1-17 this variable will have the value 1 when running the first job, the value 2 when running the second…etc. We now use this to extract each row from the table in genomes_table.txt. The reason we have the row=$(($SLURM_ARRAY_TASK_ID+1)) is simply that in the file, the first row are the column headers. Thus, when $SLURM_ARRAY_TASK_ID has the value 1, we want to read row number 2 in the file, and so on. This is why we create a shell variable row having the value of $SLURM_ARRAY_TASK_ID plus 1. Note that

To do some math (e.g. add numbers) we need to have double parenthesis in bash, e.g $((1+1)).
We could have omitted the row if we had specified --array=2-18 in the heading! Then we would have used $SLURM_ARRAY_TASK_ID in all places we now use $row.

The shell variable genome_id_column contains the texts in the genome_id column. It is not a vector or array, like we are used to from R (or Python)! We cannot refer to $genome_id_column[$row] or something like that to extract text number $row from this. Instead, we use awk again. In

genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')

we echo (print) the full content $genome_id_column and pipe this into awk, and using the $row as the value of the input argument i we simply print token number $row. Again, since the expression is inside a $(...) we capture this and store it into the shell variable genome_id. Thus, genome_id should now contain the text that identifies genome number $SLURM_ARRAY_TASK_ID in our genomes_table.txt. It may seem like a complicated procedure to first read the column, and then pick the element in the column we seek, but it runs super-fast in the shell.

The rest of the code is more or less as we did in module 6. Notice how we use use --prefix $genome_id such that the GFF file we produce for each genome will always have a name on the format <genome_id>.gff. This is actually super-important! Notice that we use the same $tmp_folder as output folder for all array jobs. These jobs will now all write their files to the same folder! Then it is extremely important they produce files with unique names otherwise they will overwrite each other, and it becomes a complete mess (it probably crashes). Beware of this when using array jobs: Always ask yourself if the different jobs now do not overwrite each other! Remember they (may) run at the same time.

Finally, we copy only the $genome_id.gff file from the prokka output to the $gff_folder. Once all array jobs are done (but not before all are done), you may delete the $tmp_folder.

The use of array jobs is a very powerful option, and allows us to speed up work were the same procedure is used to process many files or instances. It is then common to have a tab-separated file with some kind of required input in each row, and then use awk to read this into the shell script, as we saw above. We typically extract the names of input files, or texts used to ‘tag’ each output file/folder.

The maximum number of jobs to run simultaneously (value after % in heading) should depend on how long time you expect the job to take. If a job can be done very quickly (seconds, minute) you may allow many jobs. We typically use %10, but for heavy stuff like assemblies we should stop at %5. The reason for this is that SLURM does not know how long a job will take, and if resources are suddenly available (in the middle of the night) it may start many array jobs. If these go on for days, the entire system is blocked. Note that if there are limited resources, only a few, or none, jobs are started immediately, and you have to wait in the queue as usual.

3.2.1 Exercise - how many genes?

Make the shell script for running prokka on each of the 17 genomes, using array jobs. Copy the shell script code above, and make certain the lines with awk code are one single line (they are too wide to be displayed like this above).

Then, make an R script where you read in the GFF files produced by prokka, one by one (use looping) and collect the number of coding genes in each of the genomes. Store this in a column (n_genes) in the genomes table. What is the mean, minimum and maximum number of coding genes here?

3.2.2 Exercise solution

library(tidyverse)
library(microseq)

genomes.tbl <- read_delim("genomes_table.txt") %>%
  mutate(n_genes = 0)
for(i in 1:nrow(genomes.tbl)){
  gff.tbl <- readGFF(str_c("gff/", genomes.tbl$genome_id[i], ".gff")) %>%
    filter(Type == "CDS")
  genomes.tbl$n_genes[i] <- nrow(gff.tbl)
}
print(min(genomes.tbl$n_genes))
print(mean(genomes.tbl$n_genes))
print(max(genomes.tbl$n_genes))

3.3 Pan-genome clustering using `roary`

Let us turn to the final bullet point of the pan genome computations.

Once we have the predicted genes and annotations from each of the genomes, we need to cluster the genes into gene families or gene clusters. This means we look for the same gene from the various genomes. This is not at all straightforward. How similar must two sequences be in order to be the same gene? There is no clear cut answer to this. Some case are pretty clear, but there will always be some ‘twilight zone’ cases where the (subjective) choice of some threshold will decide the outcome.

The roary is a pipeline software, using several other software tools to do a number of computations needed for a clustering of the genes. Comparing many genes to many other genes will of course be a rather big job, with running time $O(n^2)$, where $n$ is the number of genomes we consider. In roary the software cd-hit is used for the main part of this job. This is a K-mer based clustering of protein sequences that is extremely fast, and is used in the roary pipeline in a first pre-clustering. Then, blast is used for the final clustering, but due to the pre-clustering there is no longer a need for comparing all-versus-all, only those inside the same pre-cluster. This speeds up the computations severely. However, computing the pan-genome clustering for many genomes is still a rather large computational job. You find containers for roaryat galaxy in the usual way, download to your module7 folder.

Let us run roary in a simplistic way. The input are the GFF-files produced by prokka above. Note that roary expects these files to contain both the GFF-table and the fasta-formatted genome sequences, which is exactly how prokka outputs annotations. If you use an alternative to prokka, you must paste the fasta-file content into the GFF-file yourself, in the proper way.

The code needed to run roary is very simple:

##############
### Settings
###
threads=10
out_dir=roary
input_gff_files=gff/*.gff

###################
### Running roary
##
apptainer exec $HOME/module7/roary:3.13.0--pl526h516909a_0.sif roary \
  -p $threads -f $out_dir -e -n -r -v $input_gff_files

Note that we expect the input GFF files to be inside the directory gff/, and that we can specify them all by using the UNIX wildcard notation *.gff.

A short description of the options we used: The -e -n tells roary to make a multiple alignment of the core genes (-e) and use the software mafft to do this (-n). If you have many genomes, this could be an enormous job. Do you really need this alignment? The -r tells roary to make some plots with R, and also output some nice tables we will read into R later. The -v is to make some log file output during computations, just to see what is happening.

One option we did not use above is the -i, specifying the blast identity threshold for assigning two genes to the same gene cluster. This is by default set to 95, i.e. two sequences must be 95% identical to be the same gene. This is very strict, and may very well be relaxed. By using a strict threshold here, we get many, but rather small, clusters. It is impossible to give a universally valid threshold for this, but it may very well be lowered, e.g. to 75%, or even less.

The $out_dir must not be created before you run roary! It has been designed to not overwrite an existing folder, and insists on creating the $out_dir itself. This is to prevent accidental destruction of previous results that you may have used a lot of resources to obtain. Here we just specify a local folder named roary/, verify this is created once the job is started.

Notice that there is no need for array jobs here! The roary takes all GFF files as input, not one by one. There should be no --array= in the heading of the shell script for running roary.

3.3.1 Exercise - run `roary`

Make the script, using the code above, and add the proper heading. Note that you do not use array jobs here, there is nothing to repeat here. You need only 10GB of memory. Download the container as usual, and edit the script accordingly. Then sbatch to run roary. With the 17 genomes we have, this should take some minutes to complete.

3.3.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=roary
#SBATCH --output=roary_%j.log

##############
### Settings
###
threads=10
out_dir=roary
input_gff_files=gff/*.gff

###################
### Running roary
##
apptainer exec $HOME/module7/roary:3.13.0--pl526h516909a_0.sif roary \
  -p $threads -f $out_dir -e -n -v $input_gff_files

3.4 Making trees in R

Let us make a phylogenetic tree from the core-gene alignment we got from roary. This you should find in the file roary/core_gene_alignment.aln. Despite the unusual file extension .aln this is a fasta file.

As you should know from introductory bioinformatics, there are several algorithms for re-constructing trees. We will not dig into this huge topic here now. Let us build a distance-based tree. This means we

Use the alignment to compute some evolutionary distances between the taxa. This results in a distance matrix.
Use the distance matrix and the neighbor joining algorithm to build a tree.
Plot the tree

First, below we will use the R package ape and ggtree, and these you must install. The ape is found on CRAN, but ggtree is at Bioconductor. Search for it there, and use the recipe for installing it. Actually, if you installed this as an exercise in Module 1 you already have it…

Here is how we compute the tree, and plot it in R:

library(tidyverse)
library(ape)
library(ggtree)

genomes.tbl <- read_delim("genomes_table.txt")

msa <- read.FASTA("roary/core_gene_alignment.aln")
dist.mat <- dist.dna(msa, model = "raw")
tree <- nj(dist.mat)
fig <- ggtree(tree, layout = "rect") +
  geom_tiplab()

## Warning: The tree contained negative edge length. If you want to ignore the edge, you
## can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

print(fig)

Note that we read the multiple alignment using the read.FASTA() function from the ape package here, not the usual readFasta() from microseq. This results in a completely different format (not a table) in R, but this time we need this format for the next step. In this next step we compute the evolutionary distances, and here we use the simple "raw" model. Repeat from BIN210 if you do not remember what this is!

This first tree was not displayed very well by the ggtree() function. But, ggtree has a huge array of possibilities, here we have added some of them:

fig <- ggtree(tree, layout = "rect") +
  geom_tiplab(size = 4, hjust = -0.1) +
  geom_treescale(width = 0.00001) +
  ggplot2::xlim(0, 0.0002)

## Warning: The tree contained negative edge length. If you want to ignore the edge, you
## can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

print(fig)

The geom_treescale() adds a ‘ruler’ of the given length to the plot. We can see how long a distance of 0.00001 is here! This means the distance between the genomes are extremely small, as we expected from the fact they all have the same MLST sequence type. Note that the ggplot2::xlim(0, 0.00005) we added here just scales the x-axis. This must be manually tuned each time and is just for fine-tuning a plot. You should probably omit it the first time you plot, and then add it, with proper limits, before you make the final version. Have a look at the size of the largest distances to get an idea about choosing the limits.

The advantage with ggtree are the endless possibilities for creating trees of many types, making use of the Grammar of Graphics inherent in the ggplot2 that we typically use for plotting in R. One nice feature is the possibility of ‘attaching’ a table of meta-data to the tree, and make use of this in the plotting. We have the genomes.tbl from above. In this table the first column contains the exact same texts (genome_ids) as the labels in the tree we have made, this is important. Then, we can attach this table, and use texts in the other columns to color texts or parts of the tree, etc. Here is how we can use it:

fig <- ggtree(tree, layout = "rect") %<+% genomes.tbl +
  geom_tiplab(aes(color = food_production_plant), size = 4, hjust = -0.1) +
  geom_tippoint(aes(color = food_production_plant), size = 3) +
  geom_treescale(width = 0.00001) +
  ggplot2::xlim(0, 0.0002)

## Warning: The tree contained negative edge length. If you want to ignore the edge, you
## can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.

print(fig)

Note we use the operator %<+% to attach a table to a ggtree, and then we may refer to the columns in this table in the geoms we use in the following lines. This operator is defined in the ggtree package, and will only work in this setting (you cannot ‘attach’ tables to plots like this in general).

If we inspect the last tree, we notice that even if the differences between genomes are super-small here, there is a tendency that samples from the same food production plant group together, making them even more similar.

3.5 More on pan-genomes

Let us inspect some of the other results from roary in R. Make a new R script and add:

library(tidyverse)

summary.tbl <- suppressMessages(read_delim("roary/summary_statistics.txt", delim = "\t",
                          col_names = c("description", "thresholds", "N.clusters")))
head(summary.tbl)

## # A tibble: 5 × 3
##   description     thresholds               N.clusters
##   <chr>           <chr>                         <dbl>
## 1 Core genes      (99% <= strains <= 100%)       2640
## 2 Soft core genes (95% <= strains < 99%)            0
## 3 Shell genes     (15% <= strains < 95%)          543
## 4 Cloud genes     (0% <= strains < 15%)            87
## 5 Total genes     (0% <= strains <= 100%)        3270

We notice that roary categorizes the gene clusters into 4 types. The 2640 core genes are found in at least 99% of the genomes, and since we have only 17 genomes it means in all of them. The ‘Soft core genes’ are those found in 95%-99%, which is none here since we have so few genomes. The 543 shell genes are defined as being present in 15%-95% of the genomes, while the 87 cloud genes is found in only less than 15% of the genomes (2 or 1 genome). In total, the pan-genome contains 3270 gene clusters. Compare this to the number of coding genes we find in a single genomes (see exercise above).

This is a rather non-typical result, in the sense that we often find many more cloud genes, and correspondingly fewer core genes, relative to the total number of genes. Again, this reflects that the genomes we now look into are extremely similar to each other.

An informative data structure output by roary is the pan-matrix. This is simply a matrix with one row for each genome and one column for each gene cluster, and in cell [i,j] the value is 1 if gene cluster j is found in genome i, or 0 if not. Let us read this into R, add to the script above:

pan.tbl <- read_delim("roary/gene_presence_absence.Rtab", delim = "\t")

## Rows: 3270 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): Gene
## dbl (17): SRR11261970, SRR11261982, SRR11262003, SRR11262009, SRR11262015, S...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pan.mat <- pan.tbl %>%
  select(-Gene) %>%
  as.matrix() %>%
  t() %>%
  magrittr::set_colnames(pan.tbl$Gene)

The output from roary has the matrix transposed, with genomes in the columns, which is why we transpose the matrix above (t()) . Let us now

Count how many gene clusters are found in 1,2,…,17 genomes
Display this as a bar chart

tibble(N.genomes = 1:nrow(pan.mat),
       frequency = table(factor(colSums(pan.mat), levels = 1:nrow(pan.mat)))) %>%
  ggplot() +
  geom_col(aes(x = N.genomes, y = frequency))

This figure actually says a lot about the species we are analyzing. First, to the right we see the bar at 17 genomes tells us the number of core genes. Then we have rather few gene clusters found in 16, 15… genomes, but there seems to be a small peak at around 7 genomes. Why is this? This must be due to some clade, consisting of 7 genomes! These 7 genomes share extra many gene clusters not shared by the rest, indicating a group of genomes who are very similar to each other.

Towards the left, we find the cloud genes. In this case there are some gene clusters found in 2 genomes, but none found in only 1. This is unusual, often the largest bar of them all is at 1 genome!

A bar not displayed here is the one at 0 genomes! What is this? Well, this is the number of genes we have not yet seen! We need to be aware of the difference between the population and sample pan-genomes:

In this data set we have sampled 17 genomes, but this is a rather small, and pretty random, sample of genomes from the huge population of all Listeria monocytogenes genomes out there. When stating something about this species, we implicitly also include those genomes, even if our current sample only covers 17 genomes. The bar at 0 genomes is then all the gene clusters found in those other strains, that we have not yet seen. If we continue to add further genomes to the data set, we expect to observe some new cloud genes. Eventually, when we have sampled a huge number of genomes, we may expect this to level off, because then we have seen all the genes that circulate among the strains of this species. Thus, the size of the population pan-genome, which is the only interesting one really, is the total number of gene clusters we see if we collected all strains of this species. Of course, we cannot collect all strains, but we may estimate how many gene clusters we will see if we do. This is a problem very similar to estimating the number of whales in the ocean, or fish in a lake. This capture-recapture problem may be described as

Make a fishing trip, catch some fish, tag them and release them again. This is the same as sequencing a genome and finding the gene clusters in it, and ‘tag’ these as ‘found’ (the 1 in the pan-matrix).
Make another fishing trip, catch fish again and note how many where tagged from the first trip, and how many are new. Then tag all un-tagged fish and release them again. This is the same as sequencing a new genome and collect its gene clusters, counting how many we saw in the first genome and how many are new.
Continue repeating this over and over again, new fishing trips/new genomes. The number of tagged fish/observed gene clusters will grow, but sooner or later level off since we have now seen them all.

If the population (of fish or genes) is small, we expect to rather quickly start catching tagged ones, i.e. those we have already seen. If however, the population is huge, the probability of re-capturing a tagged fish/gene is small, and we get mostly new fish/genes each fishing trip/genome. Thus, from the distribution in the bar chart above, we may estimate how large the population pan-genome is, i.e. the size of the bar at 0 genomes (which we must add to those we have observed). We will not dig into this now, but our R package micropan contains some tools for finding such estimates, among other stuff.

It is, by the way, not obvious that the size of a pan-genome is finite, i.e. that the number of ‘new’ genes will eventually level off… This is something we may discuss!

3.6 A pan-matrix tree

Another way of making a tree here is to simply consider the pan-matrix as MLST data, and compute distances between the genomes based on how many gene clusters they do not share. Obviously, the core genes are shared by all, and contributes nothing to such a distance. The important part here are the shell genes, those shared by many, but not all, genomes. This is where we can see what is typical/non-typical for various subsets of genomes. The cloud genes are less interesting, since they just add to all distances, like a scaling. Let us compute such distances:

library(tidyverse)

pan.tbl <- read_delim("roary/gene_presence_absence.Rtab", delim = "\t")
pan.mat <- pan.tbl %>%
  select(-Gene) %>%
  as.matrix() %>%
  t() %>%
  magrittr::set_colnames(pan.tbl$Gene)

dist.mat <- matrix(0, nrow = nrow(pan.mat), ncol = nrow(pan.mat))
rownames(dist.mat) <- colnames(dist.mat) <- rownames(pan.mat)
for(i in 1:nrow(pan.mat)){
  dist.mat[i,] <- apply(pan.mat, 1, function(x){sum(x != pan.mat[i,])})
}

Note how we first create the dist.mat matrix and fill it with zeros. Since we want to compare 17 genomes, it must have 17 rows and columns, and the genome identifiers are natural to have as both row names and column names. Then we loop over all rows, and use the apply() function in R to repeatedly sum up the number of differences between row number i and each of the rows in the matrix. This is exactly how we would have computed the distances based on the MLST data in genome_table.txt as well. Select the columns needed and them into a matrix (as.matrix()), with one row for each genome, and use that instead of pan.mat above.

We can now make a tree as before:

library(ape)
library(ggtree)

tree <- nj(as.dist(dist.mat))
genomes.tbl <- read_delim("genomes_table.txt", delim = "\t")
pan.tree <- ggtree(tree, layout = "rect") %<+% genomes.tbl +
  geom_tiplab(aes(color = food_production_plant), size = 3, hjust = -0.1) +
  geom_tippoint(aes(color = food_production_plant), size = 3) +
  geom_treescale(width = 100) +
  ggplot2::xlim(0, 500)
print(pan.tree)

Here we computed manhattan distances between the genomes. This is simply the number of gene clusters that differ in present/absent (1/0) in the pan-matrix, i.e. the distance between two genomes is how many genes they do not share. Thus, the scale is no longer an evolutionary distance, but a number of genes (gene clusters).

In the roary output directory you find a tree in the file accessory_binary_genes.fa.newick that is very similar to the one above, except for the scale (instead of absolute number of genes, they use fraction of genes). You may read this into R using the read.tree() function in the ape package, and plot with ggtree as above.

4 K-mer based comparisons

The approaches we saw above, and many more, are all based on alignments in one way or another. Such methods have the potential of quite accurately re-construct evolution, but there are also some possible problems.

Building alignments is time-consuming and scales horribly. What if we have sequenced 1000 genomes instead of 17 as we used above? Then we would need to

Make 1000 assemblies
Run prokka 1000 times
If each genome has around 3000 coding genes, compare 3 million genes to 3 million genes to find the gene clusters

This would take quite some time, especially the last step. If we then add another 100 genomes to this, the first two steps will only involve the new genomes, but the last step needs to be repeated for the entire 1100 genomes. Also, having so many genomes, we expect the strict core genes, i.e. those who are present in all genomes, to be rather few. This means making a core gene alignment is fast, but the question is if this will give us any useful resolution at all. We must also remember that making a multiple alignment of many sequences may also result in many errors in the alignment. The more sequences we align, the more likely is it that positions have been erroneously aligned.

A typical situation when monitoring the occurrences of new pathogens is that

We have a database of genomes we have seen before. This may be quite large (thousands of genomes).
We sequence a few new genomes.
We want to (quickly) be able to say how the new genomes relate to those in the database, and to each other.

Since the number of sequenced genomes grows rapidly, new ideas have emerged for how we may compare them quickly. Let us take a look at another approach based on a smart use of K-mers. These methods also have their limitations, but are interesting in the sense that they represent a completely new approach compared to the classical alignment-based theory. They also scale extremely well, and may be used even for enormous data sets. In fact, some of the ideas behind them are inherited from really Big Data problems, like the internet search engines.

4.1 The MASH distance

Please have a look at this paper and as usual we focus on the Methods.

Here is a short lecture explaining the idea behind the use of MinHASH and the MASH distance.

4.2 The `mash` software

Let us compute some MASH-distances, and start with the 17 genomes we used above. Download the newest mash container from galaxy.

The mash software has several sub-commands, but we will focus on the sketch and dist.

With mash sketch we extract K-mers from the genomes and store as a sketch, i.e. a small file where K-mers are stored in a clever way.
With mash dist we compare the sketches, which is then a proxy for comparing the genomes. This is extremely fast compared to comparing the entire genomes.

First, let us build a sketch for each genome. Again we use array jobs since we must repeat the same job for each genome. Here is the central code you may use:

#############
### Settings
###
genomes_table_file=genomes_table.txt
data_dir=$COURSES/BIN310/module7/contigs
sketch_size=1000
K=21
out_dir=$SCRATCH/mash
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

###############################################
### Using awk to read the genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
contigs_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["contigs_file"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
contigs_file=$(echo $contigs_file_column | awk -vi=$row '{print $i}')

###############
### Sketching
###
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash sketch \
-o $out_dir/$genome_id -s $sketch_size -k $K $data_dir/$contigs_file

Add the proper header to the script (using array jobs), and sbatch. You only need 10GB memory, and use 1 thread. There are still 17 genomes…

Files with names <genome_id>.msh should be produced in your out_dir, one for each genome. These are small, binary, files with the hashes from the genome we used as input. Note how we again use the genome_id as file name prefix.

The arguments we give to mash sketch above are rather self-explanatory. Both skecth_size and K are explained in the lecture above.

Once we have some sketch files, we can compute distances between the genomes by mash dist. Since we now have one sketch file for each genome, we will run mash dist for all possible pairs.

4.3 Looping in the shell

Above we used array jobs to repeatedly do similar jobs, we actually created a new job for each genome. An alternative to repeating commands is of course by looping, like we are used to from R or other programming languages.

Could we not always use array jobs? Well, there are cases where looping is the more convenient solution. In this case we will even use a double loop, since we want to

For each genome…
…compare it to each other genome
…and add the result to one single output file.

Note that since we want the result of each comparison to be added to one single file, it is safest to do this sequentially, i.e. we add one result at the time to this file. If we told many parallel processes to write to the same file, we may risk problems, since they must then open the same file, possibly over-writing each other. If we used array jobs, we would have to tell each job to write to its own file, and then merge these files afterwards.

Here is the central shell code for this looping and computing the mash distances:

###############
### Settings
###
threads=10
sketch_dir=$SCRATCH/mash
sketch_files=$(ls $sketch_dir | grep .msh)
dist_file=dist.txt
if [ -f $dist_file ]
then
  rm $dist_file
fi

############
### Looping
###
for ref in $sketch_files;
do
  for query in $sketch_files;
  do
    echo Distance $ref versus $query
    apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash dist \
    -p $threads $sketch_dir/$ref $sketch_dir/$query >> $dist_file
  done
done

Note

How we store the sketch file names in the shell variable sketch_files. We want to loop over these.
We want to output to the file $dist_file, and we delete it if it exists. This will be created when we start looping.
The looping variables are named ref and query and take as content the various file names as the loop progresses.
The mash dist will output a single line of results, and this is appended to the $dist_file by the >> operator.

If we did not delete an existing $dist_file at the beginning, the results we produce here would just be added to the existing content. Whenever you use the >> operator, make certain the file you add to is either empty or have the exact content you assume. It is a common mistake to re-run without deleting, and the output file contains all kinds of rubbish from previous runs as well.

4.3.1 Exercise - make a tree from mash distances

Make both shell scripts above (first sketching using array jobs, then computing distances by looping and without using array jobs) in your module7 folder, and run them. The file dist.txt should appear in this directory, and is a tab-separated text file. Read this file into a table in R. The columns you may name c("ref", "query", "distance", "p", "n"), and you only need to keep the first 3 columns.

NB! Both ref and query will be texts with the file names that contain the genome_id. Strip them down such that they become pure genome_id texts! Hint: use basename() and str_remove().

Turn the table into a distance matrix, i.e. where each genome occur in both rows and columns, and where the number in cell [i,j] contains the mash distance between genome number i and j. Hint: pivot_wider().

Then use the distance matrix to compute a tree by nj() as we have done before, and plot the tree. Is it different from the previous trees for these genomes?

4.3.2 Exercise solution

First the shell script for sketching:

#!/bin/bash

#SBATCH --array=1-17%10           # This is where we specify using array jobs
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=mash_sketch
#SBATCH --output=mash_sketch_%j_%a.log

#############
### Settings
###
genomes_table_file=genomes_table.txt
data_dir=$COURSES/BIN310/module7/contigs
sketch_size=1000
K=21
out_dir=$SCRATCH/mash
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

###############################################
### Using awk to read the genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
contigs_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["contigs_file"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
contigs_file=$(echo $contigs_file_column | awk -vi=$row '{print $i}')

###############
### Sketching
###
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash sketch \
-o $out_dir/$genome_id -s $sketch_size -k $K $data_dir/$contigs_file

Then the shell script for computing distances:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=mash_dist
#SBATCH --output=mash_dist_%j.log


###############
### Settings
###
threads=10
sketch_dir=$SCRATCH/mash
sketch_files=$(ls $sketch_dir | grep .msh)
dist_file=dist.txt
if [ -f $dist_file ]
then
  rm $dist_file
fi

############
### Looping
###
for ref in $sketch_files;
do
  for query in $sketch_files;
  do
    echo Distance $ref versus $query
    apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash dist \
    -p $threads $sketch_dir/$ref $sketch_dir/$query >> $dist_file
  done
done

Finally, the R script for reading the results and plotting:

library(tidyverse)
library(ape)
library(ggtree)

genomes.tbl <- read_delim("/mnt/courses/BIN310/module7/genomes_table.txt", delim = "\t")

mash.tbl <- read_delim("dist.txt", col_names = c("ref", "query", "distance", "p", "n")) %>%
  select(ref, query, distance) %>%
  mutate(ref = str_remove(basename(ref), "_contigs.fasta")) %>%
  mutate(query = str_remove(basename(query), "_contigs.fasta")) %>%
  pivot_wider(names_from = query, values_from = distance)
dist.mat <- mash.tbl %>%
  select(-ref) %>%
  as.matrix() %>%
  magrittr::set_rownames(mash.tbl$ref)

tree <- nj(as.dist(dist.mat))
fig <- ggtree(tree) %<+% genomes.tbl +
  geom_tiplab(aes(color = food_production_plant), hjust = -0.05) +
  geom_tippoint(aes(color = food_production_plant), size = 3) +
  theme(legend.position = "right") +
  ggplot2::xlim(0, 0.003)
print(fig)

4.3.3 Exercise - distances from reads

One potential advantage by mash distances is that we may now compare genomes directly from the reads! There is actually no need for doing the assembly to produce the contigs, the K-mers we get from the contigs should largely be the same as those we have in the reads!

Edit the scripts above to use the reads as input, Actually you only need to edit the sketching, once the sketches are there, the rest is identical.

Have a look at the Help text for mash sketch to understand how to change the command line.

There is one thing we must remember when using reads: If we have sequenced with read coverage $D$ we expect K-mers to occur on average $D_K$ times, see module 3 when we talked about this. Some K-mers occur very rarely, and these are most likely due to sequencing error. Thus, when collecting K-mers to our sketches, we should not collect such K-mers. The option -m should be used to specify we only want K-mers occurring a minimum number of times (at least 2, but higher values may be used).

Make the tree, and compare to what you got before. You could also compare the distances directly, to see if they change much.

4.3.4 Exercise solution

Sketching reads instead of genomes:

#!/bin/bash

#SBATCH --array=1-17%10
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=mash_sketch
#SBATCH --output=mash_sketch_%j_%a.log

#############
### Settings
###
genomes_table_file=genomes_table.txt
data_dir=$COURSES/BIN310/module7/fastq
sketch_size=1000
K=21
out_dir=$SCRATCH/mash_reads
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

##############################################
### Reading genome_id from genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
R1_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R1"])}' $genomes_table_file)
R2_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R2"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
R1=$(echo $R1_column | awk -vi=$row '{print $i}')
R2=$(echo $R2_column | awk -vi=$row '{print $i}')

###############
### Sketching
###
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash sketch \
-o $out_dir/$genome_id -s $sketch_size -k $K -m 5 -r $data_dir/$R1 $data_dir/$R2

The option -m 5 indicates we want K-mers occurring at least 5 times. Could be set even higher…

Once the sketches are there, the computing of distances is identical to what we did above. The R code for making the tree must be modified slightly: In the code above we extract the genome_id information from the file names by (line 3 and 4):

mash.tbl <- read_delim("dist.txt", col_names = c("ref", "query", "distance", "p", "n")) %>%
  select(ref, query, distance) %>%
  mutate(ref = str_remove(basename(ref), "_contigs.fasta")) %>%        # file name to genome_id
  mutate(query = str_remove(basename(query), "_contigs.fasta")) %>%    # file name to genome_id
  pivot_wider(names_from = query, values_from = distance)

This must now be modified, since the file names have a different ending:

mash.tbl <- read_delim("dist_reads.txt", col_names = c("ref", "query", "distance", "p", "n")) %>%
  select(ref, query, distance) %>%
  mutate(ref = str_remove(basename(ref), "_1.fastq.gz")) %>%           # file name to genome_id
  mutate(query = str_remove(basename(query), "_1.fastq.gz")) %>%       # file name to genome_id
  pivot_wider(names_from = query, values_from = distance)

If you forget this, you will not be able to link the genomes.tbl to the ggtree figure object, giving a cryptic error message like Error: Must request at least one colour from a hue palette. You may also need to fine-tune the xlim limits to make the plot nice…

4.4 Choosing `K` and `sketch_size`

The K in our shell code, or $k$ in the original paper, is the length of the K-mers we compare. If we use a too small $k$ value, the probability that K-mers from wildly different genomes are identical by chance becomes severe. We should choose $k$ ‘large enough’ to make the probability of random matches ‘small enough’ for our purpose.

The probability of a given K-mer to occur in a random genome of size $n$ is \[q = 1 - \left(1 - |\Sigma|^{-k}\right)^n\] where $|\Sigma|$ is the size of the alphabet (i.e. 4 for DNA sequences). How is the reasoning here?

First, $|\Sigma|^{-k}$ is the probability of a given K-mer, given that all K-mers are equally likely
Thus, $(1-|\Sigma|^{-k})^n$ is the probability of not seeing a given K-mer in a sample of $n$ randomly sampled K-mers, assuming they are sampled independently of each other, and all are equally likely.
Finally, the expression is the probability that the given K-mer is occurring at least once in the sample of $n$ K-mers.

4.4.1 Exercise - solve for $k$

Solve the equation above for $k$, and use $|\Sigma| = 4$. This gives us a formula where we plug in genome size $n$ and our selected probability $q$, and compute a $k$ value. This should be the smallest $k$ we can use in order to get the probability $q$ at least as small as chosen. Compute $k$ for $n = 3000 000$ and for $q = 0.01$ and $q = 0.001$.

4.4.2 Exercise solution

n <- 3000000   # genome size
q <- 0.01
sigma.size <- 4
k <- -log2(1-(1-q)^(1/n)) / log2(sigma.size)
cat("We need a k of at least", ceiling(k), "to get the probability of random overlap below", q, "\n")

## We need a k of at least 15 to get the probability of random overlap below 0.01

4.4.3 Exercise - sketch sizes

The sketch_size determines how many hashes we store, i.e. the number of distinct K-mers. A larger value means the sketching, as well as the distance computations later, will take some longer time. But, they also become more precise, since we compare more K-mers. How large sketch size $s$ should we use? This depends on the chosen $k$. The subcommand mash bounds give you some error bounds for a given confidence level. Run this directly in the Terminal, You can change the $k$ value and the confidence level. It is assumed the sketch size $s$ is very small compared to the genome size $n$, which is ‘always’ the case. Try to run this for various choices of $k$ (directly in the Terminal). Notice how mash distance error bounds change by sketch size $s$ and K-mer length $k$.

4.4.4 Exercise solution

apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash bounds

We notice that the mash distances become very unreliable as they get bigger. This means we should use this approach mainly when comparing rather similar genomes. There is another software, fastANI, you may try out if you want more reliable larger distances. It is in some ways an extension of mash. Instead of a distance $D$ it computes Average Nucleotide Identity (ANI), which is a common measure of similarity between genomes. In principle $ANI = 1-D$.

BIN310 module 7 - Comparative genomics

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Whole-genome data

2.1 Multi Locus Sequence Typing (MLST)

2.2 The pan-genome and extended MLST

2.3 Whole-genome alignment

2.4 Single Nucleotide Variant alignments

2.5 K-mer based comparisons

3 Pan genome computations

3.1 Genome data

3.2 Using array jobs and `awk`

3.2.1 Exercise - how many genes?

3.2.2 Exercise solution

3.3 Pan-genome clustering using `roary`

3.3.1 Exercise - run `roary`

3.3.2 Exercise solution

3.4 Making trees in R

3.5 More on pan-genomes

3.6 A pan-matrix tree

4 K-mer based comparisons

4.1 The MASH distance

4.2 The `mash` software

4.3 Looping in the shell

4.3.1 Exercise - make a tree from mash distances

4.3.2 Exercise solution

4.3.3 Exercise - distances from reads

4.3.4 Exercise solution

4.4 Choosing `K` and `sketch_size`

4.4.1 Exercise - solve for \(k\)

4.4.2 Exercise solution

4.4.3 Exercise - sketch sizes

4.4.4 Exercise solution

BIN310 module 7 - Comparative genomics

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Whole-genome data

2.1 Multi Locus Sequence Typing (MLST)

2.2 The pan-genome and extended MLST

2.3 Whole-genome alignment

2.4 Single Nucleotide Variant alignments

2.5 K-mer based comparisons

3 Pan genome computations

3.1 Genome data

3.2 Using array jobs and awk

3.2.1 Exercise - how many genes?

3.2.2 Exercise solution

3.3 Pan-genome clustering using roary

3.3.1 Exercise - run roary

3.3.2 Exercise solution

3.4 Making trees in R

3.5 More on pan-genomes

3.6 A pan-matrix tree

4 K-mer based comparisons

4.1 The MASH distance

4.2 The mash software

4.3 Looping in the shell

4.3.1 Exercise - make a tree from mash distances

4.3.2 Exercise solution

4.3.3 Exercise - distances from reads

4.3.4 Exercise solution

4.4 Choosing K and sketch_size

4.4.1 Exercise - solve for \(k\)

4.4.2 Exercise solution

4.4.3 Exercise - sketch sizes

4.4.4 Exercise solution

3.2 Using array jobs and `awk`

3.3 Pan-genome clustering using `roary`

3.3.1 Exercise - run `roary`

4.2 The `mash` software

4.4 Choosing `K` and `sketch_size`