In this module we take a look into the world of comparative genomics. This is a big topic, and we only have a brief visit this time. It is all about how to compare the genomes, once we have sequenced and possibly also annotated them. The comparisons are typically made by building phylogenetic trees, or something similar. Such comparison of genomes is relevant in many applications, typically related to outbreaks food safety or epidemiology. It is quite likely some of you will dig a lot more into this in a Master project or similar.
We aim to
mash
distances and use this for
building trees.Here we now start processing many samples/genomes. This means we also need to focus on
In many applications we are interested in comparing genomes for understanding how they are related. In some cases we are interested in what functions the various genomes may have, but in this module we are more focused on how they are related.
I can think of no better example of an application like this than the pandemic we have just been through. We have been constantly sequencing and monitoring the SARS-cov2 virus genomes, and look for new variants and how these are related, to understand how it spreads. This is then typically displayed in some kind of phylogenetic tree.
A similar approach is also used a lot in tracing the outbreaks of certain pathogens, e.g. in hospitals, in farm animals or the food industry. Typical of these cases is: We have an isolated bacteria, and we know which species we are looking at, but we want to see the more detailed and recent evolution of this particular strain that causes problems. This may reveal where we have seen similar strains before, and then understand how it has been spreading. Thus, we are typically looking for methods that allow us to discriminate between strains within the same species. It is then natural to presume that the whole genome data provide us with the best possible data for digging up this information.
Before we proceed with some computations, let us briefly overview
some commonly used ideas and phrases related to whole genome data.
Historically, the first attempts to compare bacteria based on more than a single marker gene, was the Multi Locus Sequence Typing (MLST), originating in the late 1980’s. In MLST analyses we sequence a small number (usually 7) of housekeeping genes that all genomes within a species contain. This is done by targeted sequencing, i.e. no need to sequence the entire genome, which is both faster and cheaper. Each of these genes has some small variation between the genomes, and each variant is given an arbitrary number, just like a name or label. Then, two genomes are compared by counting how many of these 7 genes have different variants (labels) in the two genomes. Here is an example to illustrate:
Genome A has the variants 1, 3, 2, 1, 11, 5, 1
,
i.e. each of the 7 gene sequences have been given a number (a label).
Such a vector of numbers is referred to as a sequence type
(ST). Then, genome B has the variants 2, 2, 2, 1, 3, 5, 1
.
We see that gene number 3, 4, 6 and 7 are identical between these two
genomes, while genes 1, 2 and 5 differ. Thus, the distance between them
is 3 (3 labels are different).
Note that for the genes who differ, we never consider how different they are, only that they are not identical. This sounds like a potential loss of information, since the difference may everything from a single base to large parts of the gene. However, one may also argue the other way, saying that any change of a single gene, large or small, is most likely due to a single evolutionary event, and should be counted as one. Anyway, having the sequence type vectors from many genomes, we can compute all pairwise differences, and from this we can cluster them or make a tree-like dendrogram indicating which genomes are more or less similar.
A problem with this approach is a limited resolution, since the difference between any two genomes must be either 0 or 1 or 2 or…7. We can imagine two strains, who are not really identical, may still have identical sequence types when we only consider these genes. These genes have been chosen from the start because they are housekeeping genes, and this probably means they evolve very slowly. Essential parts of the genomic machinery is always highly conserved, since messing it up is devastating to the organism.
But, the MLST is still very much in use, simply because we have, worldwide, established huge collections of variants of the housekeeping genes for many relevant species. These are now available at the PubMLST. For routine surveillance it is also essential that methods are standardized, making it possible to compare against the past.
Instead of just collecting the sequence variant label for each MLST gene, we may of course collect the sequences themselves. In this way we can align all the variants seen in the different genomes, and compare the genomes by making phylogenetic trees in a more conventional way.
The concept of a pan-genome was launched around two decades ago, when we started to sequence many bacterial genomes. When sequencing several different strains within the same species, we realized that the genes we find in these genomes are not all the same. The pan-genome of the species is then the collection of all unique genes we find in all genomes of the species. Roughly, we can divide the genes of a pan-genome into
We sometimes denote both shell- and cloud-genes as accessory genes. The accessory genes are ‘nice to have’ type of genes for the specific organism. They may be very important for a strain living in a certain environment, but may also be just randomly acquired and not really be important at all. We must also remember, the cloud genes may very well be a product of mistakes we do in gene prediction! The false positives from gene prediction has a tendency of showing up once, and never seen again (i.e. ORFans). We should not build our analysis on false positive gene predictions! Thus, the shell genes are the most interesting accessory genes.
Note also that a pan-genome may not be restricted to a species. In general, we can say that any clade (e.g. a genus or subspecies) has a pan-genome, but in most practical cases such a clade is a species.
The MLST concept has been extended as a consequence of whole genome sequencing and the existence of pan-genomes. The main reason we only sequenced a few (7) genes in the past was the lack of efficient sequencing technology.
One such extension is the core gene MLST (cgMLST). Instead of 7 genes we instead consider all the core genes shared by all genomes within a species. It is easy to see this gives a higher resolution than classical MLST, any difference in any of these genes will add to differences between genomes. However, some of these differences may be due to errors we make in assembly and gene finding, and the resolution may be lower than it may seem at first. Another difficulty is that the set of core genes is actually unknown. Having sequenced 10 strains of a species, quite many genes may appear to be core, since they are found in all (these 10) genomes. But, as we keep sequencing more and more strains, some of the ‘original’ core genes just happened to be found in the first 10 genomes we (randomly) chose for sequencing, but are lacking in (several) of the later genomes. Thus, as we keep sequencing more and more genomes, the set of core genes will in general shrink. The resolution shrinks, and we need to constantly re-analyse data from the past since the set of genes changes.
An alternative is then the whole-genome MLST (wgMLST) using in principle all genes from all genomes. As we discussed above, the cloud genes may be dubious, and in reality we would perhaps choose to include only core and shell genes in our analysis. We are not entirely free of the fact that the gene set changes over time, but we can make it more stable, and at a higher resolution than with core genes only.
Regardless of MLST extension, the comparison of the sequence
variants, and the strict identity criterion, is still very much used.
But since we now have ‘sequence type’ vectors with many labels (one for
each gene), the resolution is quite high. Thus, both cgMLST and wgMLST
are used a lot in comparative genomics. The most studied pathogens
typically have global databases of genes, listing all their known
sequence variants. Having the genes of a new genome we then get a vector
of sequence variants 1, 1, 5, 11, 4, 3, 7, 23, ...
and so
on, one number (=variant) for each gene. Again, two genomes are compared
by comparing these vectors of integers, counting how many differences
they have.
Instead of extracting genes from each genomes, could we not simply align the entire genomes, and then build a phylogenetic tree from this?
In principle yes. This would then include all genomic data we have,
and this sounds like the ultimate source of information for comparing
prokaryotic genomes. There are software tools for making such alignment,
e.g. mugsy
(http://mugsy.sourceforge.net/). We should, however,
be aware that this is not a straightforward task. One problem is the
computational load. Aligning many and huge sequences like this will have
to take time. But, perhaps more important are the problems that arise
from how prokaryotes evolve.
Prokaryotes also evolve by mutation and recombination, but here
recombination typically means horizontal transfer of genomic material,
i.e. entire genomic segments may be transferred from one genome to
another. Also, both processes run pretty fast along a time scale in
comparison to evolution in larger organisms, and we sometimes find big
differences between strains of the same species. Thus, aligning a set of
genomes means aligning rather different sequences, even if they are all
of the same species. When running a software tools like
mugsy
on a set of genomes, we typically see that the
genomes are broken into smaller pieces that are aligned. The reason for
this is that either these pieces occur at different locations, relative
to each other, in the various genomes, or that some pieces are lacking
entirely in some of the genomes. Thus, we rarely get one big alignment
that includes all genomes, but several shorter alignments covering
smaller regions of the genomes. And many (most) of these do not include
all genomes, but a subset of them.
Thus, the alignments we get may be a little difficult to trust, and
for a standard phylogenetic analysis we need alignments where all
genomes are included. Such regions may be few and small, and we end with
something which is far from ‘whole genome’ after all. Only for species
with very small diversity this may a useful approach. Comparisons to the
MLST approaches above have been made, and whole-genome alignment
approaches are not as popular.
An alternative to whole-genome alignment is to
A small example to illustrate: Assume we have a reference genome of 100 positions (bases). The first genome we sequence has reads mapping except for in position 10, 50 and 76. The next genome we sequence also lacks mapping in positions 10, 45 and 80. And so on. We then make an alignment by collecting all of the 100 reference positions where we see some difference between the genomes. Positions where all genomes map are of no interest. Positions where no genomes map are of no interest. Only the positions where we see a difference between the genomes. This we can now directly collect as a multiple alignment, no need for further processing.
It is easy to see some potential pitfalls of such an approach as
well: First, the choice of reference genome will be decisive. We never
really compare our sequenced genomes against each other, only against
the same reference, and then see how they differ here. If the genomes we
study contain pieces the reference is lacking, we will never include
this in our comparison. Reads will only map to whatever genomic features
the reference has. Second, by focusing on SNVs we may inflate the
differences between genomes severely. We never include the positions
where they are similar, this will affect the evolutionary distance
computations, and potentially also the tree we make. Also, several SNVs
next to each other should probably not be counted as many evolutionary
events, but rather as one single (recombination) event. However, there
are ways to partly compensate for these potential problems, and
SNV-alignments are much in use, e.g. the CFSAN pipeline tool, see https://snp-pipeline.readthedocs.io/en/latest/.
The approaches we saw above are based on alignments in one way or another. Such methods have the potential of quite accurately re-construct evolution, but there are also some possible problems.
Building alignments is time-consuming and scales horribly. What if we have sequenced 1000 genomes? For any of the extended MLST approaches we would first need to make 1000 assemblies. Then find genes in 1000 assembled genomes, and if we find around 3000 genes in each, we have to compare all 3 million genes to each other to find which are ‘the same gene’ and which of these are core, shell and cloud genes. The SNV alignment approach is much faster, but also involves 1000 readmappings. The whole-genome alignment will never finish at all with 1000 genomes…
If we then add another 100 genomes to this, some of these steps needs to be repeated for the entire 1100 genomes. We must also remember that making a multiple alignment of many sequences may also result in many errors in the alignment. The more sequences we align, the more likely is it that positions have been erroneously aligned.
A typical situation when monitoring the occurrences of new pathogens is that
Since the number of sequenced genomes grows rapidly, new ideas have emerged for how we may compare them quickly. One of these involves comparing the genomes by comparing their K-mers, known as min-hashing. We will look closer at this below. These methods also have their limitations, but are interesting in the sense that they represent a completely new approach compared to the classical alignment-based theory. They also scale extremely well, and may be used even for enormous data sets. In fact, some of the ideas behind them are inherited from really Big Data problems, like the internet search engines.
There is a lot of ongoing research along this path, and new ideas and software tools become available all the time. We will only try to get the basic ideas under our skin here. Let us now do some computations!
We will now do some basic computations for comparing a set of genome based on their pan genome. The basic steps of a pan genome analysis may be sketched as:
spades
.prodigal
, or
alternatively prokka
/DRAM
to also get some
annotations.We then end up with a set of gene groups or gene clusters or gene families (synonyms). Each group corresponds to a gene, and the members of the group are the variants for this gene, found in the genomes we investigate.
The example data we will use is a subset of a study conducted by
NOFIMA here at campus Ås, see the publication Fagerlund et al, 2020. In this subset we have
Illumina paired-end sequence data from isolates of Listeria
monocytogenes from 4 different food production plants.
Listera is known to be very clonal, i.e. the genomes of the
various strains are very similar to each other. Let us see if we can
detect some differences between them.
Whenever we do an analysis like this, it is nice to have a table
listing the metadata, i.e. the data about each sample, in this
case about each genome. We have such a table in
/mnt/courses/BIN310/module7/genomes_table.txt
. Copy this
file to your own module7
folder, and read this
tab-separated text file into R to inspect:
library(tidyverse)
genomes.tbl <- read_delim("genomes_table.txt")
head(genomes.tbl)
## # A tibble: 6 × 12
## genome_id abcZ bg1A cat dapE dat ldh lhkA food_production_plant
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 SRR11262134 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1
## 2 SRR11262195 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1
## 3 SRR11262003 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1
## 4 SRR11262009 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1
## 5 SRR11262198 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M1
## 6 SRR11262191 abcZ_6 bglA_5 cat_6 dapE_4 dat_1 ldh_4 lhkA… M4
## # ℹ 3 more variables: R1 <chr>, R2 <chr>, contigs_file <chr>
There are 17 rows, one row for each genome. Each genome has a
genome_id
, which is simply a unique text for each genome.
The last 3 columns are file names associated with each genome. First the
raw reads file (R1
and R2
) and then the file
with assembled contigs for each genome (contigs_file
).
Thus, we have already assembled these genomes, in the same way we did in
module 4. You find the raw reads files in the folder
/mnt/courses/BIN310/module7/fastq/
and the contig files in
/mnt/courses/BIN310/module7/contigs/
. Thus, the first
bullet point from above, the assembly of the genomes, has already been
done.
Before we proceed, let us just briefly mention the MLST results in
the genome table. In the table you also find 7 columns listing the MLST
genes for this species, with short cryptic names (abcZ
,
bg1A
, cat
, dapE
,
dat
, ldh
, lhkA
). Each row lists
the sequence type of the genome, just like you were asked to find in
assignment 2. We can see that all the 17 genomes have the exact same
sequence type! Thus, these are impossible to separate by using only
MLST. We could now compute a 17x17 distance matrix where we sum the
number of differing genes between all pairs of genomes, but this matrix
would then only contain zeros. This means these genomes are extremely
similar to each other.
Let us now turn the attention to the second bullet point from above,
to find the genes in the genomes.
awk
We will use the prokka
software that we saw in module 6
for the gene prediction/annotation. The main reason for this is that
prokka
outputs a GFF file that has the exact format
required for the next tool we will use below, i.e. purely
convenience!
The only difference to before is that now we must do this for each of the 17 genomes, i.e. each contigs file. We will therefore use this opportunity to see how we can use array jobs in a computing cluster, which is one of the great benefits of having access to such a cluster.
In principle, we could now make a for-loop where we loop
over these 17 genomes, and in each iteration of the loop we call upon
prokka
to get the annotations for each genome. We will come
back to loops, but this means we do all computations
sequentially, i.e. if each job takes 10 minutes to complete,
this would take us 170 minutes (for the 17 genomes). By using array
jobs, we can run many jobs in parallel. In principle we could
run all 17 jobs in parallel, and finish in the same 10 minutes it takes
to annotate a single genome. On a cluster we can spread the jobs out on
many nodes, and each node have many threads and may run several jobs at
the same time.
In order to automate the annotation of many genomes, it is very
convenient to have the genomes_table.txt
file that we saw
above. We will now read this into a shell script, using the
awk
programming language. We will only briefly use
awk
here in BIN310 this, but if you foresee yourself a
future in bioinformatics, it could be a good investment to learn more
awk
coding.
Let us start out by the shell script for running prokka
using array jobs:
#!/bin/bash
#SBATCH --array=1-17%10 # This is where we specify using array jobs
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=prokka
#SBATCH --output=prokka_%j_%a.log
#################
### Settings
###
threads=10
genomes_table_file=genomes_table.txt
data_folder=$COURSES/BIN310/module7/contigs
tmp_folder=$SCRATCH/multiprokka
gff_folder=gff # we want to have the GFF files here, in this folder
if [ ! -d $tmp_folder ]
then
mkdir $tmp_folder
fi
if [ ! -d $gff_folder ]
then
mkdir $gff_folder
fi
#############################################
### Using awk to read the genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
contigs_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["contigs_file"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
contigs_file=$(echo $contigs_file_column | awk -vi=$row '{print $i}')
#######################
### Running prokka
###
apptainer exec $HOME/module6/prokka:1.14.6--pl5321hdfd78af_5.sif prokka \
--force \
--cpus $threads \
--outdir $tmp_folder \
--prefix $genome_id \
$data_folder/$contigs_file
cp $tmp_folder/$genome_id.gff $gff_folder
First, in the header section we added
#SBATCH --array=1-17%10
. This tells SLURM to run 17 array
jobs, enumerating them from 1 to 17. The %10
means we
specify never to run more than 10 jobs at the time. This is good
behavior on a cluster, otherwise we may consume too much of the
resources. Without this SLURM will start all 17 jobs if enough resources
are available. If our jobs are short, this is OK, but otherwise we may
block the system for other users. You may also notice we specify the
log-files to contain an extra number in
--output=multiprokka_%j_%a.log
. The %a
will
add the array job number to each log file name, since we will now get
one log-file for each array job.
Next, we output to a single directory
$SCRATCH/multiprokka
, but we also have a local directory
gff/
in which we want to store the .gff
files
for each genome in the end. We are interested in only this GFF file from
each genome this time, thus the folder on $SCRATCH
can be
deleted once this is done.
Next, we read the file named in $genomes_table_file
using awk
. The line
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
reads the file, and the -F"\t"
tells awk
to
split the text by each tab. Then we loop over the tokens in the first
line, to find the text "genome_id"
, and then print the
entire column where this is found. This is then stored in the shell
variable genome_id_column
since the entire expression is
inside a $(...)
. Thus, genome_id_column
now
contains all the texts in the column named genome_id
in the
file, including the column name itself.
The next code line is similar, extracting the content of the column
named "contigs_file"
. You may now use this recipe for
extracting any column from such tab-separated text files! Just replace
the text with the column name, and store into a new shell variable.
When running array jobs, a shell variable named
SLURM_ARRAY_TASK_ID
is always created, and this will
contain the number for each job. Since we specified
--array=1-17
this variable will have the value 1 when
running the first job, the value 2 when running the second…etc. We now
use this to extract each row from the table in
genomes_table.txt
. The reason we have the
row=$(($SLURM_ARRAY_TASK_ID+1))
is simply that in the file,
the first row are the column headers. Thus, when
$SLURM_ARRAY_TASK_ID
has the value 1, we want to read row
number 2 in the file, and so on. This is why we create a shell variable
row
having the value of $SLURM_ARRAY_TASK_ID
plus 1. Note that
$((1+1))
.row
if we had specified
--array=2-18
in the heading! Then we would have used
$SLURM_ARRAY_TASK_ID
in all places we now use
$row
.The shell variable genome_id_column
contains the texts
in the genome_id
column. It is not a vector or
array, like we are used to from R (or Python)! We cannot refer
to $genome_id_column[$row]
or something like that to
extract text number $row
from this. Instead, we use
awk
again. In
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
we echo
(print) the full content
$genome_id_column
and pipe this into awk
, and
using the $row
as the value of the input argument
i
we simply print token number $row
. Again,
since the expression is inside a $(...)
we capture this and
store it into the shell variable genome_id
. Thus,
genome_id
should now contain the text that identifies
genome number $SLURM_ARRAY_TASK_ID
in our
genomes_table.txt
. It may seem like a complicated procedure
to first read the column, and then pick the element in the column we
seek, but it runs super-fast in the shell.
The rest of the code is more or less as we did in module 6. Notice
how we use use --prefix $genome_id
such that the GFF file
we produce for each genome will always have a name on the format
<genome_id>.gff
. This is actually super-important!
Notice that we use the same $tmp_folder
as output folder
for all array jobs. These jobs will now all write their files to the
same folder! Then it is extremely important they produce files
with unique names otherwise they will overwrite each other, and
it becomes a complete mess (it probably crashes). Beware of this when
using array jobs: Always ask yourself if the different jobs now do not
overwrite each other! Remember they (may) run at the same time.
Finally, we copy only the $genome_id.gff
file from the
prokka
output to the $gff_folder
. Once all
array jobs are done (but not before all are done), you may
delete the $tmp_folder
.
The use of array jobs is a very powerful option, and allows us to
speed up work were the same procedure is used to process many files or
instances. It is then common to have a tab-separated file with some kind
of required input in each row, and then use awk
to read
this into the shell script, as we saw above. We typically extract the
names of input files, or texts used to ‘tag’ each output
file/folder.
The maximum number of jobs to run simultaneously (value after
%
in heading) should depend on how long time you expect the
job to take. If a job can be done very quickly (seconds, minute) you may
allow many jobs. We typically use %10
, but for heavy stuff
like assemblies we should stop at %5
. The reason for this
is that SLURM does not know how long a job will take, and if resources
are suddenly available (in the middle of the night) it may start many
array jobs. If these go on for days, the entire system is blocked. Note
that if there are limited resources, only a few, or none, jobs are
started immediately, and you have to wait in the queue as usual.
Make the shell script for running prokka
on each of the
17 genomes, using array jobs. Copy the shell script code above, and make
certain the lines with awk
code are one single
line (they are too wide to be displayed like this above).
Then, make an R script where you read in the GFF files produced by
prokka
, one by one (use looping) and collect the number of
coding genes in each of the genomes. Store this in a column
(n_genes
) in the genomes table. What is the mean, minimum
and maximum number of coding genes here?
library(tidyverse)
library(microseq)
genomes.tbl <- read_delim("genomes_table.txt") %>%
mutate(n_genes = 0)
for(i in 1:nrow(genomes.tbl)){
gff.tbl <- readGFF(str_c("gff/", genomes.tbl$genome_id[i], ".gff")) %>%
filter(Type == "CDS")
genomes.tbl$n_genes[i] <- nrow(gff.tbl)
}
print(min(genomes.tbl$n_genes))
print(mean(genomes.tbl$n_genes))
print(max(genomes.tbl$n_genes))
roary
Let us turn to the final bullet point of the pan genome computations.
Once we have the predicted genes and annotations from each of the genomes, we need to cluster the genes into gene families or gene clusters. This means we look for the same gene from the various genomes. This is not at all straightforward. How similar must two sequences be in order to be the same gene? There is no clear cut answer to this. Some case are pretty clear, but there will always be some ‘twilight zone’ cases where the (subjective) choice of some threshold will decide the outcome.
The roary
is a pipeline software, using several other
software tools to do a number of computations needed for a clustering of
the genes. Comparing many genes to many other genes will of course be a
rather big job, with running time \(O(n^2)\), where \(n\) is the number of genomes we consider.
In roary
the software cd-hit
is used for the
main part of this job. This is a K-mer based clustering of protein
sequences that is extremely fast, and is used in the roary
pipeline in a first pre-clustering. Then, blast
is used for
the final clustering, but due to the pre-clustering there is no longer a
need for comparing all-versus-all, only those inside the same
pre-cluster. This speeds up the computations severely. However,
computing the pan-genome clustering for many genomes is still a rather
large computational job. You find containers for roary
at
galaxy in the usual way, download to your module7
folder.
Let us run roary
in a simplistic way. The input are the
GFF-files produced by prokka
above. Note that
roary
expects these files to contain both the GFF-table and
the fasta-formatted genome sequences, which is exactly how
prokka
outputs annotations. If you use an alternative to
prokka
, you must paste the fasta-file content into the
GFF-file yourself, in the proper way.
The code needed to run roary
is very simple:
##############
### Settings
###
threads=10
out_dir=roary
input_gff_files=gff/*.gff
###################
### Running roary
##
apptainer exec $HOME/module7/roary:3.13.0--pl526h516909a_0.sif roary \
-p $threads -f $out_dir -e -n -r -v $input_gff_files
Note that we expect the input GFF files to be inside the directory
gff/
, and that we can specify them all by using the UNIX
wildcard notation *.gff
.
A short description of the options we used: The -e -n
tells roary
to make a multiple alignment of the core genes
(-e
) and use the software mafft
to do this
(-n
). If you have many genomes, this could be an enormous
job. Do you really need this alignment? The -r
tells
roary
to make some plots with R, and also output some nice
tables we will read into R later. The -v
is to make some
log file output during computations, just to see what is happening.
One option we did not use above is the -i
, specifying
the blast
identity threshold for assigning two genes to the
same gene cluster. This is by default set to 95, i.e. two sequences must
be 95% identical to be the same gene. This is very strict, and may very
well be relaxed. By using a strict threshold here, we get many, but
rather small, clusters. It is impossible to give a universally valid
threshold for this, but it may very well be lowered, e.g. to 75%, or
even less.
The $out_dir
must not be created before you run
roary
! It has been designed to not overwrite an existing
folder, and insists on creating the $out_dir
itself. This
is to prevent accidental destruction of previous results that you may
have used a lot of resources to obtain. Here we just specify a local
folder named roary/
, verify this is created once the job is
started.
Notice that there is no need for array jobs here! The
roary
takes all GFF files as input, not one by one. There
should be no --array=
in the heading of the shell script
for running roary
.
roary
Make the script, using the code above, and add the proper heading.
Note that you do not use array jobs here, there is nothing to
repeat here. You need only 10GB of memory. Download the container as
usual, and edit the script accordingly. Then sbatch
to run
roary
. With the 17 genomes we have, this should take some
minutes to complete.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=roary
#SBATCH --output=roary_%j.log
##############
### Settings
###
threads=10
out_dir=roary
input_gff_files=gff/*.gff
###################
### Running roary
##
apptainer exec $HOME/module7/roary:3.13.0--pl526h516909a_0.sif roary \
-p $threads -f $out_dir -e -n -v $input_gff_files
Let us make a phylogenetic tree from the core-gene alignment we got
from roary
. This you should find in the file
roary/core_gene_alignment.aln
. Despite the unusual file
extension .aln
this is a fasta file.
As you should know from introductory bioinformatics, there are several algorithms for re-constructing trees. We will not dig into this huge topic here now. Let us build a distance-based tree. This means we
First, below we will use the R package ape
and
ggtree
, and these you must install. The ape
is
found on CRAN, but ggtree
is at Bioconductor. Search for it
there, and use the recipe for installing it. Actually, if you installed
this as an exercise in Module 1 you already have it…
Here is how we compute the tree, and plot it in R:
library(tidyverse)
library(ape)
library(ggtree)
genomes.tbl <- read_delim("genomes_table.txt")
msa <- read.FASTA("roary/core_gene_alignment.aln")
dist.mat <- dist.dna(msa, model = "raw")
tree <- nj(dist.mat)
fig <- ggtree(tree, layout = "rect") +
geom_tiplab()
## Warning: The tree contained negative edge length. If you want to ignore the edge, you
## can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.
print(fig)
Note that we read the multiple alignment using the
read.FASTA()
function from the ape
package
here, not the usual readFasta()
from microseq
.
This results in a completely different format (not a table) in R, but
this time we need this format for the next step. In this next step we
compute the evolutionary distances, and here we use the simple
"raw"
model. Repeat from BIN210 if you do not remember what
this is!
This first tree was not displayed very well by the
ggtree()
function. But, ggtree
has a huge
array of possibilities, here we have added some of them:
fig <- ggtree(tree, layout = "rect") +
geom_tiplab(size = 4, hjust = -0.1) +
geom_treescale(width = 0.00001) +
ggplot2::xlim(0, 0.0002)
## Warning: The tree contained negative edge length. If you want to ignore the edge, you
## can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.
print(fig)
The geom_treescale()
adds a ‘ruler’ of the given length
to the plot. We can see how long a distance of 0.00001 is here! This
means the distance between the genomes are extremely small, as we
expected from the fact they all have the same MLST sequence type. Note
that the ggplot2::xlim(0, 0.00005)
we added here just
scales the x-axis. This must be manually tuned each time and is just for
fine-tuning a plot. You should probably omit it the first time you plot,
and then add it, with proper limits, before you make the final version.
Have a look at the size of the largest distances to get an idea about
choosing the limits.
The advantage with ggtree
are the endless possibilities
for creating trees of many types, making use of the Grammar of Graphics
inherent in the ggplot2
that we typically use for plotting
in R. One nice feature is the possibility of ‘attaching’ a table of
meta-data to the tree, and make use of this in the plotting. We have the
genomes.tbl
from above. In this table the first
column contains the exact same texts (genome_id
s) as
the labels in the tree we have made, this is important. Then, we can
attach this table, and use texts in the other columns to color texts or
parts of the tree, etc. Here is how we can use it:
fig <- ggtree(tree, layout = "rect") %<+% genomes.tbl +
geom_tiplab(aes(color = food_production_plant), size = 4, hjust = -0.1) +
geom_tippoint(aes(color = food_production_plant), size = 3) +
geom_treescale(width = 0.00001) +
ggplot2::xlim(0, 0.0002)
## Warning: The tree contained negative edge length. If you want to ignore the edge, you
## can set 'options(ignore.negative.edge=TRUE)', then re-run ggtree.
print(fig)
Note we use the operator %<+%
to attach a table to a
ggtree
, and then we may refer to the columns in this table
in the geoms we use in the following lines. This operator is defined in
the ggtree
package, and will only work in this setting (you
cannot ‘attach’ tables to plots like this in general).
If we inspect the last tree, we notice that even if the differences between genomes are super-small here, there is a tendency that samples from the same food production plant group together, making them even more similar.
Let us inspect some of the other results from roary
in
R. Make a new R script and add:
library(tidyverse)
summary.tbl <- suppressMessages(read_delim("roary/summary_statistics.txt", delim = "\t",
col_names = c("description", "thresholds", "N.clusters")))
head(summary.tbl)
## # A tibble: 5 × 3
## description thresholds N.clusters
## <chr> <chr> <dbl>
## 1 Core genes (99% <= strains <= 100%) 2640
## 2 Soft core genes (95% <= strains < 99%) 0
## 3 Shell genes (15% <= strains < 95%) 543
## 4 Cloud genes (0% <= strains < 15%) 87
## 5 Total genes (0% <= strains <= 100%) 3270
We notice that roary
categorizes the gene clusters into
4 types. The 2640 core genes are found in at least 99% of the genomes,
and since we have only 17 genomes it means in all of them. The ‘Soft
core genes’ are those found in 95%-99%, which is none here since we have
so few genomes. The 543 shell genes are defined as being present in
15%-95% of the genomes, while the 87 cloud genes is found in only less
than 15% of the genomes (2 or 1 genome). In total, the pan-genome
contains 3270 gene clusters. Compare this to the number of coding genes
we find in a single genomes (see exercise above).
This is a rather non-typical result, in the sense that we often find many more cloud genes, and correspondingly fewer core genes, relative to the total number of genes. Again, this reflects that the genomes we now look into are extremely similar to each other.
An informative data structure output by roary
is the
pan-matrix. This is simply a matrix with one row for each
genome and one column for each gene cluster, and in cell
[i,j]
the value is 1 if gene cluster j
is
found in genome i
, or 0 if not. Let us read this into R,
add to the script above:
pan.tbl <- read_delim("roary/gene_presence_absence.Rtab", delim = "\t")
## Rows: 3270 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): Gene
## dbl (17): SRR11261970, SRR11261982, SRR11262003, SRR11262009, SRR11262015, S...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pan.mat <- pan.tbl %>%
select(-Gene) %>%
as.matrix() %>%
t() %>%
magrittr::set_colnames(pan.tbl$Gene)
The output from roary
has the matrix transposed, with
genomes in the columns, which is why we transpose the matrix above
(t()
) . Let us now
tibble(N.genomes = 1:nrow(pan.mat),
frequency = table(factor(colSums(pan.mat), levels = 1:nrow(pan.mat)))) %>%
ggplot() +
geom_col(aes(x = N.genomes, y = frequency))
This figure actually says a lot about the species we are analyzing. First, to the right we see the bar at 17 genomes tells us the number of core genes. Then we have rather few gene clusters found in 16, 15… genomes, but there seems to be a small peak at around 7 genomes. Why is this? This must be due to some clade, consisting of 7 genomes! These 7 genomes share extra many gene clusters not shared by the rest, indicating a group of genomes who are very similar to each other.
Towards the left, we find the cloud genes. In this case there are some gene clusters found in 2 genomes, but none found in only 1. This is unusual, often the largest bar of them all is at 1 genome!
A bar not displayed here is the one at 0 genomes! What is this? Well, this is the number of genes we have not yet seen! We need to be aware of the difference between the population and sample pan-genomes:
In this data set we have sampled 17 genomes, but this is a rather small, and pretty random, sample of genomes from the huge population of all Listeria monocytogenes genomes out there. When stating something about this species, we implicitly also include those genomes, even if our current sample only covers 17 genomes. The bar at 0 genomes is then all the gene clusters found in those other strains, that we have not yet seen. If we continue to add further genomes to the data set, we expect to observe some new cloud genes. Eventually, when we have sampled a huge number of genomes, we may expect this to level off, because then we have seen all the genes that circulate among the strains of this species. Thus, the size of the population pan-genome, which is the only interesting one really, is the total number of gene clusters we see if we collected all strains of this species. Of course, we cannot collect all strains, but we may estimate how many gene clusters we will see if we do. This is a problem very similar to estimating the number of whales in the ocean, or fish in a lake. This capture-recapture problem may be described as
If the population (of fish or genes) is small, we expect to rather
quickly start catching tagged ones, i.e. those we have already seen. If
however, the population is huge, the probability of re-capturing a
tagged fish/gene is small, and we get mostly new fish/genes each fishing
trip/genome. Thus, from the distribution in the bar chart above, we may
estimate how large the population pan-genome is, i.e. the size of the
bar at 0 genomes (which we must add to those we have observed). We will
not dig into this now, but our R package micropan
contains
some tools for finding such estimates, among other stuff.
It is, by the way, not obvious that the size of a pan-genome is
finite, i.e. that the number of ‘new’ genes will eventually level off…
This is something we may discuss!
Another way of making a tree here is to simply consider the pan-matrix as MLST data, and compute distances between the genomes based on how many gene clusters they do not share. Obviously, the core genes are shared by all, and contributes nothing to such a distance. The important part here are the shell genes, those shared by many, but not all, genomes. This is where we can see what is typical/non-typical for various subsets of genomes. The cloud genes are less interesting, since they just add to all distances, like a scaling. Let us compute such distances:
library(tidyverse)
pan.tbl <- read_delim("roary/gene_presence_absence.Rtab", delim = "\t")
pan.mat <- pan.tbl %>%
select(-Gene) %>%
as.matrix() %>%
t() %>%
magrittr::set_colnames(pan.tbl$Gene)
dist.mat <- matrix(0, nrow = nrow(pan.mat), ncol = nrow(pan.mat))
rownames(dist.mat) <- colnames(dist.mat) <- rownames(pan.mat)
for(i in 1:nrow(pan.mat)){
dist.mat[i,] <- apply(pan.mat, 1, function(x){sum(x != pan.mat[i,])})
}
Note how we first create the dist.mat
matrix and fill it
with zeros. Since we want to compare 17 genomes, it must have 17 rows
and columns, and the genome identifiers are natural to have as both row
names and column names. Then we loop over all rows, and use the
apply()
function in R to repeatedly sum up the number of
differences between row number i
and each of the rows in
the matrix. This is exactly how we would have computed the distances
based on the MLST data in genome_table.txt
as well. Select
the columns needed and them into a matrix (as.matrix()
),
with one row for each genome, and use that instead of
pan.mat
above.
We can now make a tree as before:
library(ape)
library(ggtree)
tree <- nj(as.dist(dist.mat))
genomes.tbl <- read_delim("genomes_table.txt", delim = "\t")
pan.tree <- ggtree(tree, layout = "rect") %<+% genomes.tbl +
geom_tiplab(aes(color = food_production_plant), size = 3, hjust = -0.1) +
geom_tippoint(aes(color = food_production_plant), size = 3) +
geom_treescale(width = 100) +
ggplot2::xlim(0, 500)
print(pan.tree)
Here we computed manhattan distances between the genomes. This is simply the number of gene clusters that differ in present/absent (1/0) in the pan-matrix, i.e. the distance between two genomes is how many genes they do not share. Thus, the scale is no longer an evolutionary distance, but a number of genes (gene clusters).
In the roary
output directory you find a tree in the
file accessory_binary_genes.fa.newick
that is very similar
to the one above, except for the scale (instead of absolute number of
genes, they use fraction of genes). You may read this into R using the
read.tree()
function in the ape
package, and
plot with ggtree
as above.
The approaches we saw above, and many more, are all based on alignments in one way or another. Such methods have the potential of quite accurately re-construct evolution, but there are also some possible problems.
Building alignments is time-consuming and scales horribly. What if we have sequenced 1000 genomes instead of 17 as we used above? Then we would need to
This would take quite some time, especially the last step. If we then add another 100 genomes to this, the first two steps will only involve the new genomes, but the last step needs to be repeated for the entire 1100 genomes. Also, having so many genomes, we expect the strict core genes, i.e. those who are present in all genomes, to be rather few. This means making a core gene alignment is fast, but the question is if this will give us any useful resolution at all. We must also remember that making a multiple alignment of many sequences may also result in many errors in the alignment. The more sequences we align, the more likely is it that positions have been erroneously aligned.
A typical situation when monitoring the occurrences of new pathogens is that
Since the number of sequenced genomes grows rapidly, new ideas have
emerged for how we may compare them quickly. Let us take a look at
another approach based on a smart use of K-mers. These methods also have
their limitations, but are interesting in the sense that they represent
a completely new approach compared to the classical alignment-based
theory. They also scale extremely well, and may be used even for
enormous data sets. In fact, some of the ideas behind them are inherited
from really Big Data problems, like the internet search engines.
mash
softwareLet us compute some MASH-distances, and start with the 17 genomes we
used above. Download the newest mash
container from
galaxy.
The mash
software has several sub-commands, but we will
focus on the sketch
and dist
.
mash sketch
we extract K-mers from the genomes and
store as a sketch, i.e. a small file where K-mers are stored in
a clever way.mash dist
we compare the sketches, which is then a
proxy for comparing the genomes. This is extremely fast compared to
comparing the entire genomes.First, let us build a sketch for each genome. Again we use array jobs since we must repeat the same job for each genome. Here is the central code you may use:
#############
### Settings
###
genomes_table_file=genomes_table.txt
data_dir=$COURSES/BIN310/module7/contigs
sketch_size=1000
K=21
out_dir=$SCRATCH/mash
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
###############################################
### Using awk to read the genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
contigs_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["contigs_file"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
contigs_file=$(echo $contigs_file_column | awk -vi=$row '{print $i}')
###############
### Sketching
###
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash sketch \
-o $out_dir/$genome_id -s $sketch_size -k $K $data_dir/$contigs_file
Add the proper header to the script (using array jobs), and
sbatch
. You only need 10GB memory, and use 1 thread. There
are still 17 genomes…
Files with names <genome_id>.msh
should be
produced in your out_dir
, one for each genome. These are
small, binary, files with the hashes from the genome we used as input.
Note how we again use the genome_id
as file name
prefix.
The arguments we give to mash sketch
above are rather
self-explanatory. Both skecth_size
and K
are
explained in the lecture above.
Once we have some sketch files, we can compute distances between the
genomes by mash dist
. Since we now have one sketch file for
each genome, we will run mash dist
for all possible pairs.
Above we used array jobs to repeatedly do similar jobs, we actually created a new job for each genome. An alternative to repeating commands is of course by looping, like we are used to from R or other programming languages.
Could we not always use array jobs? Well, there are cases where looping is the more convenient solution. In this case we will even use a double loop, since we want to
Note that since we want the result of each comparison to be added to one single file, it is safest to do this sequentially, i.e. we add one result at the time to this file. If we told many parallel processes to write to the same file, we may risk problems, since they must then open the same file, possibly over-writing each other. If we used array jobs, we would have to tell each job to write to its own file, and then merge these files afterwards.
Here is the central shell code for this looping and computing the mash distances:
###############
### Settings
###
threads=10
sketch_dir=$SCRATCH/mash
sketch_files=$(ls $sketch_dir | grep .msh)
dist_file=dist.txt
if [ -f $dist_file ]
then
rm $dist_file
fi
############
### Looping
###
for ref in $sketch_files;
do
for query in $sketch_files;
do
echo Distance $ref versus $query
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash dist \
-p $threads $sketch_dir/$ref $sketch_dir/$query >> $dist_file
done
done
Note
sketch_files
. We want to loop over these.$dist_file
, and we delete
it if it exists. This will be created when we start looping.ref
and
query
and take as content the various file names as the
loop progresses.mash dist
will output a single line of results, and
this is appended to the $dist_file
by the
>>
operator.If we did not delete an existing $dist_file
at the
beginning, the results we produce here would just be added to the
existing content. Whenever you use the >>
operator,
make certain the file you add to is either empty or have the exact
content you assume. It is a common mistake to re-run without deleting,
and the output file contains all kinds of rubbish from previous runs as
well.
Make both shell scripts above (first sketching using array
jobs, then computing distances by looping and without using
array jobs) in your module7
folder, and run them. The
file dist.txt
should appear in this directory, and is a
tab-separated text file. Read this file into a table in R. The columns
you may name c("ref", "query", "distance", "p", "n")
, and
you only need to keep the first 3 columns.
NB! Both ref
and query
will be texts with
the file names that contain the genome_id
. Strip them down
such that they become pure genome_id
texts! Hint: use
basename()
and str_remove()
.
Turn the table into a distance matrix, i.e. where each genome occur
in both rows and columns, and where the number in cell
[i,j]
contains the mash distance between genome number
i
and j
. Hint: pivot_wider()
.
Then use the distance matrix to compute a tree by nj()
as we have done before, and plot the tree. Is it different from the
previous trees for these genomes?
First the shell script for sketching:
#!/bin/bash
#SBATCH --array=1-17%10 # This is where we specify using array jobs
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=mash_sketch
#SBATCH --output=mash_sketch_%j_%a.log
#############
### Settings
###
genomes_table_file=genomes_table.txt
data_dir=$COURSES/BIN310/module7/contigs
sketch_size=1000
K=21
out_dir=$SCRATCH/mash
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
###############################################
### Using awk to read the genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
contigs_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["contigs_file"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
contigs_file=$(echo $contigs_file_column | awk -vi=$row '{print $i}')
###############
### Sketching
###
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash sketch \
-o $out_dir/$genome_id -s $sketch_size -k $K $data_dir/$contigs_file
Then the shell script for computing distances:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=mash_dist
#SBATCH --output=mash_dist_%j.log
###############
### Settings
###
threads=10
sketch_dir=$SCRATCH/mash
sketch_files=$(ls $sketch_dir | grep .msh)
dist_file=dist.txt
if [ -f $dist_file ]
then
rm $dist_file
fi
############
### Looping
###
for ref in $sketch_files;
do
for query in $sketch_files;
do
echo Distance $ref versus $query
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash dist \
-p $threads $sketch_dir/$ref $sketch_dir/$query >> $dist_file
done
done
Finally, the R script for reading the results and plotting:
library(tidyverse)
library(ape)
library(ggtree)
genomes.tbl <- read_delim("/mnt/courses/BIN310/module7/genomes_table.txt", delim = "\t")
mash.tbl <- read_delim("dist.txt", col_names = c("ref", "query", "distance", "p", "n")) %>%
select(ref, query, distance) %>%
mutate(ref = str_remove(basename(ref), "_contigs.fasta")) %>%
mutate(query = str_remove(basename(query), "_contigs.fasta")) %>%
pivot_wider(names_from = query, values_from = distance)
dist.mat <- mash.tbl %>%
select(-ref) %>%
as.matrix() %>%
magrittr::set_rownames(mash.tbl$ref)
tree <- nj(as.dist(dist.mat))
fig <- ggtree(tree) %<+% genomes.tbl +
geom_tiplab(aes(color = food_production_plant), hjust = -0.05) +
geom_tippoint(aes(color = food_production_plant), size = 3) +
theme(legend.position = "right") +
ggplot2::xlim(0, 0.003)
print(fig)
One potential advantage by mash
distances is that we may
now compare genomes directly from the reads! There is actually no need
for doing the assembly to produce the contigs, the K-mers we get from
the contigs should largely be the same as those we have in the
reads!
Edit the scripts above to use the reads as input, Actually you only need to edit the sketching, once the sketches are there, the rest is identical.
Have a look at the Help text for mash sketch
to
understand how to change the command line.
There is one thing we must remember when using reads: If we have
sequenced with read coverage \(D\) we
expect K-mers to occur on average \(D_K\) times, see module 3 when we talked
about this. Some K-mers occur very rarely, and these are most likely due
to sequencing error. Thus, when collecting K-mers to our sketches, we
should not collect such K-mers. The option -m
should be
used to specify we only want K-mers occurring a minimum number of times
(at least 2, but higher values may be used).
Make the tree, and compare to what you got before. You could also compare the distances directly, to see if they change much.
Sketching reads instead of genomes:
#!/bin/bash
#SBATCH --array=1-17%10
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=mash_sketch
#SBATCH --output=mash_sketch_%j_%a.log
#############
### Settings
###
genomes_table_file=genomes_table.txt
data_dir=$COURSES/BIN310/module7/fastq
sketch_size=1000
K=21
out_dir=$SCRATCH/mash_reads
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
##############################################
### Reading genome_id from genomes_table.txt
###
genome_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["genome_id"])}' $genomes_table_file)
R1_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R1"])}' $genomes_table_file)
R2_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R2"])}' $genomes_table_file)
row=$(($SLURM_ARRAY_TASK_ID+1))
genome_id=$(echo $genome_id_column | awk -vi=$row '{print $i}')
R1=$(echo $R1_column | awk -vi=$row '{print $i}')
R2=$(echo $R2_column | awk -vi=$row '{print $i}')
###############
### Sketching
###
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash sketch \
-o $out_dir/$genome_id -s $sketch_size -k $K -m 5 -r $data_dir/$R1 $data_dir/$R2
The option -m 5
indicates we want K-mers occurring at
least 5 times. Could be set even higher…
Once the sketches are there, the computing of distances is identical to what we did above. The R code for making the tree must be modified slightly: In the code above we extract the genome_id information from the file names by (line 3 and 4):
mash.tbl <- read_delim("dist.txt", col_names = c("ref", "query", "distance", "p", "n")) %>%
select(ref, query, distance) %>%
mutate(ref = str_remove(basename(ref), "_contigs.fasta")) %>% # file name to genome_id
mutate(query = str_remove(basename(query), "_contigs.fasta")) %>% # file name to genome_id
pivot_wider(names_from = query, values_from = distance)
This must now be modified, since the file names have a different ending:
mash.tbl <- read_delim("dist_reads.txt", col_names = c("ref", "query", "distance", "p", "n")) %>%
select(ref, query, distance) %>%
mutate(ref = str_remove(basename(ref), "_1.fastq.gz")) %>% # file name to genome_id
mutate(query = str_remove(basename(query), "_1.fastq.gz")) %>% # file name to genome_id
pivot_wider(names_from = query, values_from = distance)
If you forget this, you will not be able to link the
genomes.tbl
to the ggtree
figure object,
giving a cryptic error message like
Error: Must request at least one colour from a hue palette.
You may also need to fine-tune the xlim
limits to make the
plot nice…
K
and sketch_size
The K
in our shell code, or \(k\) in the original
paper, is the length of the K-mers we compare. If we use a too small
\(k\) value, the probability that
K-mers from wildly different genomes are identical by chance becomes
severe. We should choose \(k\) ‘large
enough’ to make the probability of random matches ‘small enough’ for our
purpose.
The probability of a given K-mer to occur in a random genome of size \(n\) is \[q = 1 - \left(1 - |\Sigma|^{-k}\right)^n\] where \(|\Sigma|\) is the size of the alphabet (i.e. 4 for DNA sequences). How is the reasoning here?
Solve the equation above for \(k\), and use \(|\Sigma| = 4\). This gives us a formula where we plug in genome size \(n\) and our selected probability \(q\), and compute a \(k\) value. This should be the smallest \(k\) we can use in order to get the probability \(q\) at least as small as chosen. Compute \(k\) for \(n = 3000 000\) and for \(q = 0.01\) and \(q = 0.001\).
n <- 3000000 # genome size
q <- 0.01
sigma.size <- 4
k <- -log2(1-(1-q)^(1/n)) / log2(sigma.size)
cat("We need a k of at least", ceiling(k), "to get the probability of random overlap below", q, "\n")
## We need a k of at least 15 to get the probability of random overlap below 0.01
The sketch_size
determines how many hashes we store,
i.e. the number of distinct K-mers. A larger value means the sketching,
as well as the distance computations later, will take some longer time.
But, they also become more precise, since we compare more K-mers. How
large sketch size \(s\) should we use?
This depends on the chosen \(k\). The
subcommand mash bounds
give you some error bounds for a
given confidence level. Run this directly in the Terminal, You can
change the \(k\) value and the
confidence level. It is assumed the sketch size \(s\) is very small compared to the genome
size \(n\), which is ‘always’ the case.
Try to run this for various choices of \(k\) (directly in the Terminal). Notice how
mash distance error bounds change by sketch size \(s\) and K-mer length \(k\).
apptainer exec $HOME/module7/mash:2.3--he348c14_1.sif mash bounds
We notice that the mash
distances become very unreliable
as they get bigger. This means we should use this approach mainly when
comparing rather similar genomes. There is another software,
fastANI
, you may try out if you want more reliable larger
distances. It is in some ways an extension of mash
. Instead
of a distance \(D\) it computes Average
Nucleotide Identity (ANI), which is a common measure of similarity
between genomes. In principle \(ANI =
1-D\).