This time we focus on gene prediction and annotation of predicted genes. We aim to
conda
environments as an
alternative to using container software
The first step in genome annotation is often gene prediction. This is also sometimes referred to as gene finding or gene calling. The latter sounds like genes are easy to find, you can even call out, and they will come running…(like trained dogs?). Dream on…
There are different types of genes in a prokaryotic genome
First, gene prediction is all about the coding genes. Finding the non-coding RNA-genes is almost exclusively done by searching for something that is similar to what we have seen before. Coding genes are more difficult, even if some of these are also quite conserved. They are many, and at least some of them, will vary a lot from what we have seen in other organisms.
There is a tendency to think that gene prediction in prokaryotes is
simple. Indeed, it is simpler than in eukaryotes, in the sense that a
prokaryotic coding gene (in general) has no introns. But, this is by no
means a ‘solved problem’. Given the rate of newly assembled genomes from
NGS sequencing, and that millions of genes are being predicted from
these, even rather small error-rates in the gene prediction step means
we are filling our databases with many rubbish ‘genes’ who are not
really genes, as well as missing many real genes.
prodigal
softwareThere are several software tools for doing prokaryotic gene
prediction. When we used quast
, we could also ask for gene
predictions, and the software GeneMarkS
would have been
used by the quast
software. Perhaps the most popular gene
prediction software these days is prodigal
. This is very
easy to use, and also runs very fast. In addition, the results are in
most cases at least as good as other tools.
Let us make a shell script where we use prodigal
to
predict the coding genes in a complete genome. Download a container from
galaxy in the usual way. Note that you want prodigal
and
not prodigal-gv
, the latter being a special version meant
for viruses.
The genome file we use in this example we will talk more about below, but any file with assembled contigs or completed chromosomes/plasmids may be used:
###################
### Settings
###
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_dir=$SCRATCH/prodigal
gff_file=$out_dir/prodigal.gff
prot_file=$out_dir/prodigal.ffn
nuc_file=$out_dir/prodigal.faa
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
#####################
### Running prodigal
###
apptainer exec prodigal:2.6.3--hec16e2b_5.sif prodigal \
-p single -f gff -a $prot_file -d $nuc_file -o $gff_file -i $genome_file
Make a shell script in your module6
folder, add the
proper heading, and sbatch
. This require no more than 10GB
of memory, and only 1 thread, it is still super fast.
Notice the shell code we added above, to test if
$out_dir
is not existing. If this is TRUE, then we create
it, since otherwise prodigal
will crash, just like we saw
for bowtie2
and minimap2
in the previous
module.
The options specify a single genome (-p single
) and
output format should be GFF (-f gff
). We also name a file
to put protein sequences (-a $prot_file
) and their DNA
versions (-d $nuc_file
). Both will be fasta files. If we do
not include these in the command line, they will not be produced, only
the GFF-file in -o $gff_file
.
There are a few more options, look up the online help text for details, but in general this is a very small and simple software.
prodigal
scriptMake the full shell script based on the code above, and save in your
module6
folder. Then sbatch
and inspect the
$out_dir
and verify there are two files with the predicted
genes (DNA and protein versions) and a text file with a GFF-table. We
will need these files below.
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=1 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=prodigal # Sensible name for the job
#SBATCH --output=prodigal_%j.log # Logfile output here
#############
### Settings
###
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_dir=$SCRATCH/prodigal
gff_file=$out_dir/prodigal.gff
prot_file=$out_dir/prodigal.ffn
nuc_file=$out_dir/prodigal.faa
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
#####################
### Running prodigal
###
apptainer exec prodigal:2.6.3--hec16e2b_5.sif prodigal \
-p single -f gff -a $prot_file -d $nuc_file -o $gff_file -i $genome_file
From prodigal
we got a file with a .gff
file extension. This format, sometimes called General Feature Format,
(GFF) is used a lot to list genes and other genomic features in a
genome. A GFF-file is a tab-delimited text file, with 9 compulsory
columns. See https://m.ensembl.org/info/website/upload/gff3.html
for some description of the format.
In the R package microseq
we have a function
readGFF()
for reading such files:
library(tidyverse)
library(microseq)
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") # edit path!
Notice you need to edit the path to read the file from your
$out_dir
. If we inspect the first rows of this table:
head(prodigal.gff)
## # A tibble: 6 × 9
## Seqid Source Type Start End Score Strand Phase Attributes
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 NC_003197.2 Prodigal_v2.6.3 CDS 337 2799 365. + 0 ID=1_1;parti…
## 2 NC_003197.2 Prodigal_v2.6.3 CDS 2801 3730 131. + 0 ID=1_2;parti…
## 3 NC_003197.2 Prodigal_v2.6.3 CDS 3734 5020 193. + 0 ID=1_3;parti…
## 4 NC_003197.2 Prodigal_v2.6.3 CDS 5114 5887 129. - 0 ID=1_4;parti…
## 5 NC_003197.2 Prodigal_v2.6.3 CDS 5966 7396 156. - 0 ID=1_5;parti…
## 6 NC_003197.2 Prodigal_v2.6.3 CDS 7665 8618 158 + 0 ID=1_6;parti…
…we see the nine columns, and one row for each predicted gene. The
Seqid
column indicates in which genome sequence the gene is
found. In this case there are 2 genome sequences, the chromosome and a
plasmid. If the genome was a set of contigs, we would have seen many
more. The Start
, End
and Strand
indicates where in the genome sequence the gene is found. Note that if
Strand
is positive, the Start
is the start of
the gene and End
is where it ends, but if
Strand
is negative it is opposite! (for this reason I would
have replaced Start and End with Left and Right, but no one asked me
when this format was decided!)
You can look up in the GFF description (see link above) for
explanations of the Source
, Type
and
Phase
columns. The last column, Attributes
,
contains a lot of potentially useful information, and each token is
separated by a semicolon ;
. Different rows may have
different information in this column. In a GFF file produced by
prodigal
we typically find
partial=00
, or 01
or 10
. With
00
the gene is complete, but if it is 01
the
end is lacking, or 10
the start is lacking. Partial genes
occur when we have many contigs, where genes are near the contig ends.
Naturally, such predicted genes are dubious.conf=99
is a confidence score, i.e. in this case
prodigal
has confidence 99% this is a real gene. We should,
however, not interpret this as a probability…score=...
is a summary of them all. The higher
score
the more it indicates a real gene, according to the
prodigal
logic.prodigal
scoresExtend the R script above by extracting, for each predicted gene, the
This means you add 3 new columns to the gff.tbl
. The
length you find from the Start
and End
columns, while the confidence and score must be extracted from the
Attributes
column texts. Extracting some information from
within some texts is something we need quite often in data wrangling.
This is a skill that separates bioinformaticians from other
biologists…;) Here are some hints:
Inspect the Attributes
column. What is the
pattern we need to look for in order to extract the confidence
information? It looks to me like something along this line:
conf=x.y;
where x.y
is some decimal number.
With the function str_extract()
we can extract a pattern
from texts. The regular expression "conf=.+?;"
should match
what we seek here. This regular expression may be read like this:
"conf="
followed by any symbol
("."
) matched any number of times ("+"
) up
until the first match ("?"
) of the symbol
";"
.To learn more about regular expressions, type ?regexp
in
the RStudio Console window and read the Help file. Regular expressions
is a very powerful tool! The code should now look something like
this:
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
mutate(confidence = str_extract(___, pattern = ___))
Copy this and fill in the missing parts.
Once you have extracted the text matching the pattern above, you need
to remove the parts which is not the numerical value, e.g. if the
extracted text is "conf=81.23;"
you want to strip it down
to just "81.23"
. To achieve this, use
str_remove_all()
and remove everything matching the regular
expression "conf=|;"
. This pattern says:
"conf="
or
";"
. The vertical bar "|"
means
or.The code grows to:
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
mutate(confidence = str_extract(___, pattern = ___)) %>%
mutate(confidence = str_remove_all(___, pattern = ___))
Finally, convert the text to a number by as.numeric()
.
The code grows to:
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
mutate(confidence = str_extract(___, pattern = ___)) %>%
mutate(confidence = str_remove_all(___, pattern = ___)) %>%
mutate(confidence = as.numeric(___))
The score information is found in a similar way, by adding a similar series of mutate statements.
Make a plot where each gene is a point in the coordinate system of length (x-axis) and score (y-axis), and color by its confidence. How are these three quantities (length, score and confidence) related?
library(tidyverse)
library(microseq)
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
mutate(confidence = str_extract(Attributes, pattern = "conf=.+?;")) %>%
mutate(confidence = str_remove_all(confidence, pattern = "conf=|;")) %>%
mutate(confidence = as.numeric(confidence)) %>%
mutate(score = str_extract(Attributes, pattern = ";score=.+?;")) %>%
mutate(score = str_remove_all(score, ";score=|;")) %>%
mutate(score = as.numeric(score)) %>%
mutate(length = abs(Start - End) + 1)
fig <- ggplot(prodigal.gff) +
geom_point(aes(x = length, y = score, color = confidence)) +
scale_x_log10() + scale_y_log10()
print(fig)
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (`geom_point()`).
It looks like all genes above a certain length (just under 1000) all get high score and confidence. For genes much shorter than 1000, the score varies by something else than length. The score and confidence are highly correlated.
prodigal
predict?The genome that we used as input to prodigal
above is
one of the prokaryotic reference genomes at NCBI, see https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/refseq_category:reference.
We cite from the description:
“Reference genomes are genome assemblies that are annotated and updated by the assembly submitters and chosen by the RefSeq curatorial staff based on their quality and importance to the community as anchors for the analysis of other genomes in their taxonomic group. Some reference genomes are selected based on a long history of collaboration and wide recognition as a community standard…”
Even if we cannot state these genomes are perfectly annotated, this is probably the closest we get to a gold standard or ‘truth’. In
$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff
we find the GFF file for the genes in this genome, and now we treat these as the actual genes. Let us read this into R in the same way as above:
reference.gff <- readGFF("/mnt/courses/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff") %>%
filter(Type == "CDS")
Notice we filtered the rows, keeping only the coding sequences. This GFF file contains a lot more, but we are only interested in the coding genes now.
In order to compare genes listed in the two table, we
mutate
in a new column named Signature
, like
this:
prodigal.gff <- prodigal.gff %>%
mutate(Signature = str_c(Seqid, Start, End, Strand))
…and do similar for the reference.gff
. If a gene has
identical signature in the two tables, it means prodigal
has predicted the exact correct gene (same genome sequence, same start,
same end and on the same strand). If not, it is incorrect in some
way.
In the lecture above I mentioned the concepts of Precision
and Recall. In our case the total number of actual Positives
(\(P\)) are all the reference genes. We
define as all True Positives (\(TP\))
the cases where prodigal
signatures are identical to the
reference signatures. The False Positives (\(FP\)) are genes predicted by
prodigal
whose signatures does not match any of the
reference signatures. Let us compute the Precision and Recall.
The Recall is \(Recall = TP/P\). The Precision is \(Precision = TP/(TP+FP)\). We compute this as:
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature %in% reference.gff$Signature)
FP <- sum(!(prodigal.gff$Signature %in% reference.gff$Signature))
cat("Precision = ", TP/(TP+FP), "\n")
## Precision = 0.8442428
cat("Recall = ", TP/P, "\n")
## Recall = 0.8590899
This recall means prodigal
finds around 85% of the
actual genes. This precision means that around 85% of the genes that
prodigal
report are correct genes. Thus, there are around
15% errors both ways. We see that the reference has around 4500 genes,
and 15% of this is 675 genes!
Make the R script reading both the prodigal and the refrecne GFF-file, and compute Precision and Recall, use the code snippets above.
Perhaps we were too strict in our analysis above? As I mentioned in
the lecture, the most difficult part of gene prediction is finding the
exact start of the gene. But in some applications this is not critically
important as long as we have found ‘most of the gene’. If we modify the
Signature
to ignore the start, how do the results
improve?
Hint: You cannot simply remove the Start
from the
computing of the Signature
, since the actual start of the
gene is the End
coordinate when the gene is on the negative
Strand
! Thus, you must remove the Start
for
genes on the positive and End
for genes on the negative
strand. In this case you need something like
prodigal.gff <- prodigal.gff %>%
mutate(Signature_modified = if_else(___, str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand))
and similar for the reference.
Make these modifications to the script from above, and re-compute Precision and Recall. Does it change much?
library(tidyverse)
library(microseq)
### Read prodigal GFF-file
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
mutate(confidence = str_extract(Attributes, pattern = "conf=.+?;")) %>%
mutate(confidence = str_remove_all(confidence, pattern = "conf=|;")) %>%
mutate(confidence = as.numeric(confidence)) %>%
mutate(score = str_extract(Attributes, pattern = ";score=.+?;")) %>%
mutate(score = str_remove_all(score, ";score=|;")) %>%
mutate(score = as.numeric(score)) %>%
mutate(length = abs(Start - End) + 1)
### Read reference GFF-file
reference.gff <- readGFF("/mnt/courses/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff") %>%
filter(Type == "CDS")
### Add signatures
reference.gff <- reference.gff %>%
mutate(Signature = str_c(Seqid, Start, End, Strand)) %>%
mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))
prodigal.gff <- prodigal.gff %>%
mutate(Signature = str_c(Seqid, Start, End, Strand)) %>%
mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))
### Precision and recall
cat("With full signature:\n")
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature %in% reference.gff$Signature)
FP <- sum(!(prodigal.gff$Signature %in% reference.gff$Signature))
cat("Precision = ", TP/(TP+FP), "\n")
cat("Recall = ", TP/P, "\n")
cat("With modified signatures:\n")
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified)
FP <- sum(!(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified))
cat("Precision = ", TP/(TP+FP), "\n")
cat("Recall = ", TP/P, "\n")
If the Strand
is positive (equals "+"
) then
the Signature should be made up of Seqid
, End
and Strand
, but otherwise it should be made up of
Seqid
, Start
and Strand
. Note the
double equal signs ==
when comparing if something is equal
to something else.
Both Precision and Recall are now quite high. This illustrates the fact that finding the exact start is a bottleneck in prokaryotic gene finding.
Keep using the modified signature from the previous exercise.
Let us then filter out predicted genes from the prodigal
predictions having score
below some threshold, e.g. 10, and
re-compute. Does it improve Precision and/or Recall? Experiment with
different thresholds (20, 30, 40), and see if you can improve the
results by being stricter on the score threshold.
library(tidyverse)
library(microseq)
score.threshold <- 10 # edit this to other values
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
mutate(score = str_extract(Attributes, pattern = ";score=.+?;")) %>%
mutate(score = as.numeric(str_remove_all(score, ";score=|;"))) %>%
filter(score >= score.threshold) %>%
mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))
reference.gff <- readGFF("/mnt/courses/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff") %>%
filter(Type == "CDS") %>%
mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))
cat("Modified signature, score threshold", score.threshold, ":\n")
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified)
FP <- sum(!(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified))
cat("Precision = ", TP/(TP+FP), "\n")
cat("Recall = ", TP/P, "\n")
By increasing the score threshold we can make the Precision very high, but we soon start to lose Recall, i.e. we miss some real genes by being too strict on the score.
Finding the gene candidates (e.g. with prodigal
) is the
first step in an annotation of a genome, but perhaps the most
important part is the following recognition of function. This
is usually what people mean when they talk about annotation.
First, we must be aware that function is here used in a very wide an perhaps inexact way. The actual function of some gene is in general very hard to nail down to some short description. It depends on the context, and we also know that in some cases a destruction of a gene may not alter the function at all, because there may be ‘backup solutions’ in the genome. Thus, when we talk about function here now, we refer to categorizing genes into more or less meaningful groups we have constructed based on current knowledge. Such categories may be
Often these pieces of information are linked, i.e. if you have the Pfam accession number, you also have a related GO accession, KO-accession etc.
The entire process means we
We will not discover new functions this way, just recognize what we have seen before. It is quite obvious that the results of this depends on how good the databases are! For this reason, there are several online annotation solutions (e.g. RAST), i.e. you simply upload your genome or gene predictions to some website, and your receive the results at some later time. This may be a good solution in many cases, we simply trust this service to have up-to-date databases and good search tools.
Since the uploading to a website requires no skills, we will not
pursue this here. Instead, we have a look at how we could do annotations
locally. This is perhaps most relevant in cases we have sequenced many
genomes, and want to automate the annotation process in some
way.
prokka
The prokka
software is a quick and simple-to-use
software for annotating a genome. It is not a comprehensive scan for all
information, but can still be useful in many cases. The
prokka
is implemented as a pipeline using several other
tools in an automated way, e.g. it uses prodigal
for gene
prediction.
The input to prokka
is a fasta-file with genomic DNA.
This can be a genome, as contigs or a whole genomic sequence. We will
run prokka
on the same genome we used as input to
prodigal
above, to see how it turns out. The
prokka
will then first use prodigal
to predict
the coding genes, and some other tools for finding rRNA, tRNA and other
non-coding genes. Then, it will use other tools again to start searching
with the coding genes against some small databases of known genes
commonly found in prokaryotes. Thus, the annotation is rather shallow,
and we do not expect to find good functional description for all coding
genes, but it is a nice first start. And prokka
runs
extremely fast compared to more thorough searches. If you only need a
“shallow” annotation of many genomes, this is a nice tool.
Before we start using prokka
, let us first take some
minutes to talk about the Help texts we get displayed from the command
line. We looked into this in an earlier module, here we go through this
again for prokka
. Download the newest
apptainer
container from galaxy, and store in your
module6
folder. Open you Terminal and run
apptainer exec <prokka-container> prokka -h
Replace <prokka-container>
with your container
name. Here is the first lines of the Help text you should see
displayed:
Option h is ambiguous (help, hmms)
Name:
Prokka 1.14.6 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
rapid bacterial genome annotation
Usage:
prokka [options] <contigs.fasta>
General:
--help This help
--version Print version and exit
--citation Print citation for referencing Prokka
--quiet No screen output (default OFF)
--debug Debug mode: keep all temporary files (default OFF)
We notice that
prokka
actually invoked
prokka
! This is not obvious. We have seen earlier that to
start spades
you type spades.py
,
quast
starts by quast.py
etc . The name of the
software is not always the exact name of the command we use! We usually
try the name as a command first, and look up the software at GitHub for
more help if this fails. Or just google for the manual…-h
was ambiguous! It could mean both
help
and hmms
here, both are valid options to
prokka
. But, the Help text was displayed anyway… (the
proper Help-option is --help
as we can see above).Usage:
. Always look for this first! This
describes exactly how the command line should look like. It should here
contain
prokka
.[options]
).<contigs.fasta>
by your filename (genome, contigs, or
any fasta file with DNA).Always stick to the pattern described in Usage
. The
prokka
is actually quite liberal in terms of input order,
but in general command line tools require inputs in the exact
order specified under Usage
!
The options are typically of two types:
--force
option we will use (see
below). Such options require no further input, you simply add the option
or you don’t.--prefix
option we will use below. Such options need
some additional input to be meaningful, i.e. it must be followed by a
text or a number. In the Help-text you see that the options followed by
[X]
or [N]
need some additional input.The prokka
is a little bit special regarding the output.
Many tools require us to create the output folder before we
run, and some tools will create it for us if it is lacking. The
prokka
is opposite! It will not write to an
existing folder, in fear of overwriting something. However, by turning
on the --force
it will overwrite. If you do not use
--force
, the output folder must be non-existing, and
prokka
will create it.
May the --force
be with you…
prokka
You should now try to make use of prokka
on your own,
based on Help text. Make a shell script in your module6
folder and
prodigal
above as
input.$SCRATCH/prokka
.--prefix
option to name the output files.--force
option.It should only take some minutes to complete, but this varies a lot
between nodes. You can inspect the .log
file to see how
things are going. The prokka
produces an extensive log
during its work.
It is a good idea to copy a script from previous, and try to change it in order to run this software. But, never assume options have the same names, or that input should be specified in the same way as for some other software! You have to read the Help for this software!
Here is how the script could look like:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=prokka
#SBATCH --output=prokka_%j.log
##############
### Settings
###
threads=10
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_dir=$SCRATCH/prokka
prefix=prokka
###################
### Running prokka
###
apptainer exec $HOME/module6/prokka:1.14.6--pl5321hdfd78af_5.sif prokka \
--force \
--cpus $threads \
--outdir $out_dir \
--prefix $prefix \
$genome_file
prokka
outputOnce prokka
is completed, there are several files in the
$out_dir
folder, all with the prefix you specified. We will
only look at some of these.
The .fna
file is simply a copy of the input, the genome.
The predicted genes are in the .ffn
file, while the
.faa
file contains the proteins we get from translating the
predicted coding genes. All these are fasta-files.
The files .tsv
and .gff
are tables with the
annotation information for all genes. The .tsv
is in
principle a simple tab-separated table, but the number of tabs on each
row varies, making it a mess if you try to read it! The
.gff
should adhere to the General Feature Format (GFF),
that we saw was an output from prodigal
as well. This
.gff
file contains more than what we saw for
prodigal
, since it includes more than just coding genes.
Note also that the GFF-file from prokka
has some additional
content! After the GFF-table itself, there is a ##FASTA
line, followed by lines of fasta-formatted sequence. This is actually a
copy of the genome that was annotated, and that you find in the
.fna
file.
Since the .gff
file is the most comprehensive one, we
focus on this when we read the results into R.
Make an R-script that reads the .gff
file from
prokka
and put its content into a table. Use the
readGFF()
function from our microseq
package
again. Then, the ##FASTA
content of the file will be added
as an attribute (named FASTA
) to the object we get
in R.
If you inspect the resulting .gff
table in R, the
predicted coding genes should be the same we got from
prodigal
earlier. In addition, we now also get listed some
RNA-genes and possibly some other genomic features.
The Attributes
column contains a number of
semicolon-separated information. We would like to collect the
product information for each gene, which is a text description
of what kind of gene it is. This should appear as
product=<some text>
as the last piece of information
in most rows. Make a new column named Product
and put this
text (minus "product="
) into this column. Hint: Use
str_extract()
to extract the product-information from each
row of Attributes
, then str_remove()
to remove
the product=
part from this extracted text. We did
something similar when handling prodigal
results above.
Note that the product information does not end with a semicolon…(since
it is the last information in the Attributes
texts)
prokka
list?Type
is
CDS
), what are the most common product descriptions?Genes described as "hypothetical protein"
are those
without any annotation, i.e. they have not been recognized as some known
gene. These are the best candidates for being false positive
predictions. What is the fraction of "hypothetical protein"
genes among the coding genes?
library(tidyverse)
library(microseq)
prokka.tbl <- readGFF("/mnt/SCRATCH/larssn/prokka/prokka.gff") %>% # edit path!
mutate(Product = str_extract(Attributes, "product=.+")) %>%
mutate(Product = str_remove(Product, "product="))
cat("prokka lists", nrow(prokka.tbl), "genes in this genome\n")
cat("prokka found", length(unique(prokka.tbl$Product)), "unique genes\n")
cds.tbl <- prokka.tbl %>%
filter(Type == "CDS") %>%
group_by(Product) %>%
summarise(N = n()) %>%
arrange(desc(N))
head(cds.tbl)
cat("Fraction of un-annotated:", cds.tbl$N[1] / sum(cds.tbl$N), "\n")
As we can see, quite many genes are left un-annotated. This may be
because they are false positive gene predictions, but more likely
because prokka
do not search wide enough to recognize all
functions.
conda
instead of containersBefore we proceed, we will stop and see an alternative way of installing software on an HPC. We will need this for our next software tool.
We have seen how we may use apptainer
containers to
avoid installing software and to be able to use all kinds of versions of
software. But, there are cases where this solution is not the best
choice, or even possible. We now want to use a software
DRAM
in the next section. This is a tool for
annotation of genes in a genome, much like prokka
above, but more extensive. This software comes with a huge collection of
databases. You will find containers for dram
at galaxy, or
other places, but these containers do not contain the massive databases.
These must be downloaded separately. In principle, this should not be a
problem, we could still use container software in such cases, but the
DRAM
software is a little special as we will see.
It turns out that using DRAM
is easier if we install it
as a conda environment on orion. This is always an option, and
most of (all?) the other software tools we have used in BIN310 could
also have been installed in this way. We must therefore be familiar with
this way of working, and take this opportunity to look closer at how we
can do this on orion or any other HPC.
You may be familiar with conda
from before, and relate
this to python coding. But, conda
may be used to install
all kinds of software as long as it has a conda
installation option. When we install a software with conda
it is installed inside an environment, which is in a folder
under your $HOME
. This is similar to a container in the
sense that all dependencies required by the software are also installed
inside the same environment. When we use the software we first
activate this environment, then use it, and then
deactivate it. This means you may have many different versions
of the same software, but in different environments without conflicting
with each other. Instead of having a container file, you now get an
environment folder under you $HOME
.
conda
Open a Terminal window and maneuver to your $HOME
folder. First, we need to load the Miniconda3
software:
module load Miniconda3/4.9.2
This contains, among other things, the command conda
.
Verify it has been loaded by the command
conda -h
and you should get some help text on the screen.
In most cases we will install a software package using a variant of this command
conda create -c <channel> --name <environment-name> <package>
First, we use conda create
to create a new
environment. You may also install a (new) package into an
existing environment, that you have created some time in the past. You
would then use conda install
instead. You may install
several packages into the same environment as long as they do not
conflict, but in general we make a new environment for each tool.
The first time you run conda create
a folder named
.conda
wil be created under your $HOME
. All
new environments will now be found in the subfolder
.conda/envs/
. This folder will get the name you specify in
<environment-name>
. In most cases we use the same
name as the package (software) we install, but you may actually decide
the environment name.
Next, the -c <channel>
specifies the
channel we should use for downloading. The conda
can install from various channels/repositories on the internet. The
channel bioconda
is very relevant for our use, see the bioconda webpage for
more information information about this. There are also a number of
other channels you may collect software from
Finally, we specify the name of the package we want to install.
prodigal
as a conda
environmentFor practice, let us install the software prodigal
, that
we used above, as a conda
environment.
First, we need to know of a channel from which we can
download the software. There are many such conda channels around. One of
them is bioconda, containing a lot of software used in biology.
verify there is a package named prodigal
at the bioconda
website. There is a small search-box in the left margin. Use
this.
Then, use the template shown above, and install this to an
environment you name genefinders
. We most typically name
the environment the same as the software, but here I make an exception,
just to highlight that the environment is not the same as the
software itself.
Ignore any warning about a newer version of conda
.
Answer yes (y
) when asked if you want to proceed with the
installation.
After completed installation inspect the folder
.conda/envs/
under you $HOME
and verify the
new environment has been created.
Next, we need to activate the new environment by
conda activate genefinders
. But, this will typically give
an error (CommandNotFoundError) the first time you try. In order to make
this work, you must first run the following code in your Terminal:
eval "$(conda shell.bash hook)"
This will link conda
to your current shell. Then we
activate the environment by
conda activate genefinders
Notice how the command prompt changes when you activate the
environment! We can now run prodigal -h
to see if this
starts prodigal
as it should.
Finally, deactivate it again with conda deactivate
.
A short video showing how to do the steps above
DRAM
Install DRAM
by conda
. Instead of using the
standard command line seen above, use the procedure outlined at the GitHub
site of the software.
Miniconda3
in the Terminal as shown
above..yaml
file using wget
…conda
as described in the
DRAM
website. Name the environment DRAM
as
suggested.It is quite typical that once we have the name of some software, we search for its GitHub site. Most tools have this. Then, look for descriptions on how to install.
DRAM
databaseAs mentioned above the DRAM
requires a collection of
databases. This has already been installed on orion at
/mnt/databases/DRAM_1.4.6/
, and we should make use of this.
To make the DRAM
software aware of this we need to supply
DRAM
with a configuration file with this
information. This is not a common solution, and the reason this tool
works best from a conda
installation. Other tools making
use of an external database typically have an option
(e.g. -db
) where we the name the path to the database.
Read the text file /mnt/databases/DRAM_1.4.6/README.md
(use cat
). Here you find the description of how you
import the configuration file into your environment. Follow
this.
Then deactivate the environment and your DRAM
environment should now be ready for use!
DRAM
We will now use the software DRAM
that we installed as a
conda
environment above, to do an automated annotation of a
genome. This tool was originally made for an automated annotation of
several genomes re-constructed from metagenomes, what we denote MAGs
(Metagenome Assembled Genomes). We will, however, use it here for a
single genome, just to get to know it.
Here is a suggested code for using DRAM
on orion:
######################
### Settings
###
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_folder=$SCRATCH/dram
threads=10
#######################
### Loading Miniconda
###
module load Miniconda3/4.9.2
eval "$(conda shell.bash hook)"
######################
### Running DRAM
###
conda activate DRAM
DRAM.py annotate \
--input_fasta $genome_file \
--output_dir $out_folder \
--threads 10
DRAM.py distill \
--input_file $out_folder/annotations.tsv \
--trna_path $out_folder/trnas.tsv \
--rrna_path $out_folder/rrnas.tsv \
--output_dir $out_folder/distilled
conda deactivate
Notice how we first have to load Miniconda3
. Then
follows the command eval "$(conda shell.bash hook)"
we saw
above, to configure the shell for activating conda environments. This
applies to all conda environments on orion, not just DRAM
.
Try to run just conda shell.bash hook
directly in the
Terminal (after loading Miniconda3
) and you will see this
creates a file. Then eval
of this will execute it. We will
not dig more into this, just accept it as a necessary command for using
conda
in our shell scripts. Remember that each time you
sbatch
and start a new job in SLURM, it will work inside a
new shell. Thus, this statement must always be included in
shell scripts on orion where you need to activate some
conda
environment.
Next we activate the DRAM
environment, and then we can
run the DRAM
software.
First we use DRAM.py annotate
with some required
input/output, to get the basic annotations for the genome. This will
result in a number of files in the folder we specify in
$out_folder
.
Next, we use the DRAM.py distill
to summarise what we
found in the previous step. This is mostly a benefit when you annotate
many genomes together, but we still use it here, and will have a look at
the output.
Note that we specify as input to distill
some files
produced in the annotate
step above. There is no guarantee
all these files were actually created! For instance, the file
rrna.tsv
is a small table listing the ribosomal RNA genes
found in the genome. Should not all genomes have these, you may ask?
Yes, they should, but experience tells us these may be badly assembled
(why?) and may sometimes not be detected at all in the contigs we
produce. Thus, you should actually run only the annotate
step first, inspect the $out_folder
and then run the
distill
step. If the rrna.tsv
file is missing,
you may still run the distill
, just omit the
--rrna_path
option altogether.
Finally, we deactivate the conda
environment again.
Add the proper heading to this, and sbatch
. Use 50GB
memory, which should be more than enough even if you annotate many
genomes.
annotate
outputLet us have a look at the content of the folder specified in
$out_folder
above.
Also, look up the scientific paper describing DRAM
. You find it here. Focus on the section called
DRAM annotation overview.
Try to answer these questions, by reading the paper and/or inspect the output (read output files into R):
DRAM
uses a gene finding tool. Which
tool?DRAM
? Do you know
anything about what these databases contain?MMseqs2
. What is this used
for in DRAM
?--use_uniref
. What would be the effect of adding this to
our command line?annotations.tsv
has one row for each gene. Read this into
R, extension .tsv
means tab-separated text file.annotations.tsv
has a column named
rank
. What is this? Hint: See paper.rank
column to a factor with levels A to
E, and then use table()
on it. Based on what the paper
says, what may we infer from this?DRAM
is using prodigal
, but
insists on using the anonymous/metagenome mode. This may result in
slightly different results than we got when we used standard
prodigal
above. If you want to check it out, read the file
genes.gff
in the $out_folder
and compare to
what we got before…DRAM
uses KOfam,
which is a HMM-based database built from KEGG.MMseqs2
is used for various sequence searches.
Think of it as an alternative to BLAST, that we saw in an earlier
module. It is a more general tool than BLAST, and can do many different
kinds of searches.--use_uniref
option, i.e. this is not by default
included. The reason is that is slows the search considerably. When we
used DRAM
above, it took some minutes to annotate this
single genome. With --use_uniref
option included, expect it
to take many hours (even days)…rank
column ranks the genes by how “well” they are
annotated. Rank A requires reciprocal best hit (RBH) to a KEGG protein,
and since we have no KEGG installed on orion we will never see this. The
B category requires RBH to a UniRef90 protein, and this of course
requires the use of --use_uniref
option that we saw above.
Categories C and D are OK annotations. Category E means a gene is
predicted, but has no hits to any of the current databases. Such genes
may not be genes at all, or something entirely new we have no function
for yet.library(tidyverse)
### The raw annotations
annotations.tbl <- suppressMessages(read_delim("/mnt/SCRATCH/larssn/dram/annotations.tsv"))
cat("DRAM annotated:", nrow(annotations.tbl), "genes\n")
annotations.tbl <- annotations.tbl %>%
mutate(rank = factor(rank, levels = c("A", "B", "C", "D", "E")))
print(table(annotations.tbl$rank))
Since we have no genes of rank A and B, we can conclude the KEGG and UniRef90 databases were not really searched. Either that or this genome is really poorly assembled, or is from some other planet(!). No real earthly genome will have 0 genes found in either KEGG or UniRef90!
We happen to know the DRAM
database on orion lacks KEGG,
and that we did not search UniRef90 since we omitted the
--use_uniref
option. The latter you may add, but this will
slow down computations a lot, and require much more memory. We will not
do this in BIN310…
distilled
outputIn the folder $out_folder/distilled/
you should also
have some output from DRAM
. Let us focus on the excel-file
named metabolism_summary.xlsx
.
To read an excel-file into R, install the R package
readxl
. You can then read the file by
library(readxl)
tbl <- read_excel("/mnt/SCRATCH/larssn/dram/distilled/metabolism_summary.xlsx") # edit path
Excel-files may have multiple sheets, and this is also the
case here. The code above will read the first sheet only. By adding
sheet = "sheet_name"
, where "sheet_name"
is a
text with the name of a sheet, you specify which sheet to read.
To see the sheet names included in an excel-file, use
library(readxl)
sheets <- excel_sheets("/mnt/SCRATCH/larssn/dram/distilled/metabolism_summary.xlsx") # edit path
The object sheets
will now contain the names of the
various sheets.
Read the sheet for carbon utilization. You should get a table, and the last column lists the number of genes from each category found in the current genome. How many carbon utilization genes were found in this genome?
Perhaps you could also find where on the genome these genes are located? More specifically, we would be interested in operons, i.e. regions where several such genes are located next to each other. Do you find any such cases? Hint: The annotations table from above has the information about position and strand, you just need to filter out the carbon utilization genes in this table.
library(tidyverse)
library(readxl)
### The carbon utilization genes
sheets <- excel_sheets("/mnt/SCRATCH/larssn/dram/distills/metabolism_summary.xlsx")
carb_util.tbl <- read_excel("/mnt/SCRATCH/larssn/dram/distills/metabolism_summary.xlsx", sheet = sheets[2])
cat("DRAM found", sum(as.numeric(carb_util.tbl$GCF_000006945.2_ASM694v2_genomic)), "carbon utilization genes\n")
### The location of the carb_util genes
### First, we filter the table, keeping only the genes with 1 copy or more
carb_util.tbl <- carb_util.tbl %>%
filter(GCF_000006945.2_ASM694v2_genomic != "0")
### Next, we need to filter the annotations.tbl from the previous exercise,
### having the location information,
### keeping only the carb_util genes.
### The gene_id in the carb_util.tbl may match
### ko_id, peptidase_id or cazy_ids in the annotations.tbl
annot_carb_util.tbl <- annotations.tbl %>%
filter(ko_id %in% carb_util.tbl$gene_id |
peptidase_id %in% carb_util.tbl$gene_id |
cazy_ids %in% carb_util.tbl$gene_id)
### To find operons we can now inspect this table, and use the columns gene_position and strandedness.
### We immediately find that genes with gene_position 149-150-151 are all on the positive strand (standeness 1), a potential operon
View(annot_carb_util.tbl)