1 Learning goals

This time we focus on gene prediction and annotation of predicted genes. We aim to

Get the basic insight into how we may find coding genes in prokaryotes
Predict genes in genomes using some standard software
Make automated annotations by using tools searching with gene sequences in databases
Learn how to install conda environments as an alternative to using container software

1.1 Software used in this module

2 Gene prediction

The first step in genome annotation is often gene prediction. This is also sometimes referred to as gene finding or gene calling. The latter sounds like genes are easy to find, you can even call out, and they will come running…(like trained dogs?). Dream on…

There are different types of genes in a prokaryotic genome

Coding genes. These are the majority, and usually what we think of when we talk about genes. These are transcribed and then translated into proteins.
Non-coding genes. These are typically RNA-genes (rRNA and tRNA) used during the translation of the coding genes. These are fewer, and also quite conserved across all species, and therefore usually possible to recognize based on what we have seen before in other genomes.

First, gene prediction is all about the coding genes. Finding the non-coding RNA-genes is almost exclusively done by searching for something that is similar to what we have seen before. Coding genes are more difficult, even if some of these are also quite conserved. They are many, and at least some of them, will vary a lot from what we have seen in other organisms.

There is a tendency to think that gene prediction in prokaryotes is simple. Indeed, it is simpler than in eukaryotes, in the sense that a prokaryotic coding gene (in general) has no introns. But, this is by no means a ‘solved problem’. Given the rate of newly assembled genomes from NGS sequencing, and that millions of genes are being predicted from these, even rather small error-rates in the gene prediction step means we are filling our databases with many rubbish ‘genes’ who are not really genes, as well as missing many real genes.

2.1 Lecture

Here is a video lecture on gene prediction.

2.2 The `prodigal` software

There are several software tools for doing prokaryotic gene prediction. When we used quast, we could also ask for gene predictions, and the software GeneMarkS would have been used by the quast software. Perhaps the most popular gene prediction software these days is prodigal. This is very easy to use, and also runs very fast. In addition, the results are in most cases at least as good as other tools.

Let us make a shell script where we use prodigal to predict the coding genes in a complete genome. Download a container from galaxy in the usual way. Note that you want prodigal and not prodigal-gv, the latter being a special version meant for viruses.

The genome file we use in this example we will talk more about below, but any file with assembled contigs or completed chromosomes/plasmids may be used:

###################
### Settings
###
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_dir=$SCRATCH/prodigal
gff_file=$out_dir/prodigal.gff
prot_file=$out_dir/prodigal.ffn
nuc_file=$out_dir/prodigal.faa
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

#####################
### Running prodigal
###
apptainer exec prodigal:2.6.3--hec16e2b_5.sif prodigal \
-p single -f gff -a $prot_file -d $nuc_file -o $gff_file -i $genome_file

Make a shell script in your module6 folder, add the proper heading, and sbatch. This require no more than 10GB of memory, and only 1 thread, it is still super fast.

Notice the shell code we added above, to test if $out_dir is not existing. If this is TRUE, then we create it, since otherwise prodigal will crash, just like we saw for bowtie2 and minimap2 in the previous module.

The options specify a single genome (-p single) and output format should be GFF (-f gff). We also name a file to put protein sequences (-a $prot_file) and their DNA versions (-d $nuc_file). Both will be fasta files. If we do not include these in the command line, they will not be produced, only the GFF-file in -o $gff_file.

There are a few more options, look up the online help text for details, but in general this is a very small and simple software.

2.2.1 Exercise - make the `prodigal` script

Make the full shell script based on the code above, and save in your module6 folder. Then sbatch and inspect the $out_dir and verify there are two files with the predicted genes (DNA and protein versions) and a text file with a GFF-table. We will need these files below.

2.2.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                         # We always use 1 node
#SBATCH --reservation=BIN310              # For BIN310 course only
#SBATCH --account=bin310                  # For BIN310 course only
#SBATCH --ntasks=1                        # The number of threads reserved
#SBATCH --mem=10G                         # The amount of memory reserved
#SBATCH --time=01:00:00                   # Runs for maximum this time
#SBATCH --job-name=prodigal               # Sensible name for the job
#SBATCH --output=prodigal_%j.log          # Logfile output here

#############
### Settings
###
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_dir=$SCRATCH/prodigal
gff_file=$out_dir/prodigal.gff
prot_file=$out_dir/prodigal.ffn
nuc_file=$out_dir/prodigal.faa
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

#####################
### Running prodigal
###
apptainer exec prodigal:2.6.3--hec16e2b_5.sif prodigal \
-p single -f gff -a $prot_file -d $nuc_file -o $gff_file -i $genome_file

2.3 The Generic Feature Format

From prodigal we got a file with a .gff file extension. This format, sometimes called General Feature Format, (GFF) is used a lot to list genes and other genomic features in a genome. A GFF-file is a tab-delimited text file, with 9 compulsory columns. See https://m.ensembl.org/info/website/upload/gff3.html for some description of the format.

In the R package microseq we have a function readGFF() for reading such files:

library(tidyverse)
library(microseq)
 
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff")  # edit path!

Notice you need to edit the path to read the file from your $out_dir. If we inspect the first rows of this table:

head(prodigal.gff)

## # A tibble: 6 × 9
##   Seqid       Source          Type  Start   End Score Strand Phase Attributes   
##   <chr>       <chr>           <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>        
## 1 NC_003197.2 Prodigal_v2.6.3 CDS     337  2799  365. +          0 ID=1_1;parti…
## 2 NC_003197.2 Prodigal_v2.6.3 CDS    2801  3730  131. +          0 ID=1_2;parti…
## 3 NC_003197.2 Prodigal_v2.6.3 CDS    3734  5020  193. +          0 ID=1_3;parti…
## 4 NC_003197.2 Prodigal_v2.6.3 CDS    5114  5887  129. -          0 ID=1_4;parti…
## 5 NC_003197.2 Prodigal_v2.6.3 CDS    5966  7396  156. -          0 ID=1_5;parti…
## 6 NC_003197.2 Prodigal_v2.6.3 CDS    7665  8618  158  +          0 ID=1_6;parti…

…we see the nine columns, and one row for each predicted gene. The Seqid column indicates in which genome sequence the gene is found. In this case there are 2 genome sequences, the chromosome and a plasmid. If the genome was a set of contigs, we would have seen many more. The Start, End and Strand indicates where in the genome sequence the gene is found. Note that if Strand is positive, the Start is the start of the gene and End is where it ends, but if Strand is negative it is opposite! (for this reason I would have replaced Start and End with Left and Right, but no one asked me when this format was decided!)

You can look up in the GFF description (see link above) for explanations of the Source, Type and Phase columns. The last column, Attributes, contains a lot of potentially useful information, and each token is separated by a semicolon ;. Different rows may have different information in this column. In a GFF file produced by prodigal we typically find

partial=00, or 01 or 10. With 00 the gene is complete, but if it is 01 the end is lacking, or 10 the start is lacking. Partial genes occur when we have many contigs, where genes are near the contig ends. Naturally, such predicted genes are dubious.
conf=99 is a confidence score, i.e. in this case prodigal has confidence 99% this is a real gene. We should, however, not interpret this as a probability…
There are several scores listed for each gene, and the score=... is a summary of them all. The higher score the more it indicates a real gene, according to the prodigal logic.

2.3.1 Exercise - `prodigal` scores

Extend the R script above by extracting, for each predicted gene, the

length
the confidence
the score

This means you add 3 new columns to the gff.tbl. The length you find from the Start and End columns, while the confidence and score must be extracted from the Attributes column texts. Extracting some information from within some texts is something we need quite often in data wrangling. This is a skill that separates bioinformaticians from other biologists…;) Here are some hints:

Inspect the Attributes column. What is the pattern we need to look for in order to extract the confidence information? It looks to me like something along this line: conf=x.y; where x.y is some decimal number. With the function str_extract() we can extract a pattern from texts. The regular expression "conf=.+?;" should match what we seek here. This regular expression may be read like this:

Match the occurrence of "conf=" followed by any symbol (".") matched any number of times ("+") up until the first match ("?") of the symbol ";".

To learn more about regular expressions, type ?regexp in the RStudio Console window and read the Help file. Regular expressions is a very powerful tool! The code should now look something like this:

prodigal.gff  <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>% 
  mutate(confidence = str_extract(___, pattern = ___))

Copy this and fill in the missing parts.

Once you have extracted the text matching the pattern above, you need to remove the parts which is not the numerical value, e.g. if the extracted text is "conf=81.23;" you want to strip it down to just "81.23". To achieve this, use str_remove_all() and remove everything matching the regular expression "conf=|;". This pattern says:

Match everything that is either "conf=" or ";". The vertical bar "|" means or.

The code grows to:

prodigal.gff  <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>% 
  mutate(confidence = str_extract(___, pattern = ___)) %>% 
  mutate(confidence = str_remove_all(___, pattern = ___))

Finally, convert the text to a number by as.numeric(). The code grows to:

prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>% 
  mutate(confidence = str_extract(___, pattern = ___)) %>% 
  mutate(confidence = str_remove_all(___, pattern = ___)) %>% 
  mutate(confidence = as.numeric(___))

The score information is found in a similar way, by adding a similar series of mutate statements.

Make a plot where each gene is a point in the coordinate system of length (x-axis) and score (y-axis), and color by its confidence. How are these three quantities (length, score and confidence) related?

2.3.2 Exercise solution

library(tidyverse)
library(microseq)

prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
  mutate(confidence = str_extract(Attributes, pattern = "conf=.+?;")) %>%
  mutate(confidence = str_remove_all(confidence, pattern = "conf=|;")) %>%
  mutate(confidence = as.numeric(confidence)) %>%
  mutate(score = str_extract(Attributes, pattern = ";score=.+?;")) %>%
  mutate(score = str_remove_all(score, ";score=|;")) %>%
  mutate(score = as.numeric(score)) %>%
  mutate(length = abs(Start - End) + 1)

fig <- ggplot(prodigal.gff) +
  geom_point(aes(x = length, y = score, color = confidence)) +
  scale_x_log10() + scale_y_log10()
print(fig)

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 1 rows containing missing values (`geom_point()`).

It looks like all genes above a certain length (just under 1000) all get high score and confidence. For genes much shorter than 1000, the score varies by something else than length. The score and confidence are highly correlated.

2.4 How well did `prodigal` predict?

The genome that we used as input to prodigal above is one of the prokaryotic reference genomes at NCBI, see https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/refseq_category:reference. We cite from the description:

“Reference genomes are genome assemblies that are annotated and updated by the assembly submitters and chosen by the RefSeq curatorial staff based on their quality and importance to the community as anchors for the analysis of other genomes in their taxonomic group. Some reference genomes are selected based on a long history of collaboration and wide recognition as a community standard…”

Even if we cannot state these genomes are perfectly annotated, this is probably the closest we get to a gold standard or ‘truth’. In

$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff

we find the GFF file for the genes in this genome, and now we treat these as the actual genes. Let us read this into R in the same way as above:

reference.gff <- readGFF("/mnt/courses/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff") %>% 
  filter(Type == "CDS")

Notice we filtered the rows, keeping only the coding sequences. This GFF file contains a lot more, but we are only interested in the coding genes now.

In order to compare genes listed in the two table, we mutate in a new column named Signature, like this:

prodigal.gff <- prodigal.gff %>% 
  mutate(Signature = str_c(Seqid, Start, End, Strand))

…and do similar for the reference.gff. If a gene has identical signature in the two tables, it means prodigal has predicted the exact correct gene (same genome sequence, same start, same end and on the same strand). If not, it is incorrect in some way.

In the lecture above I mentioned the concepts of Precision and Recall. In our case the total number of actual Positives ($P$) are all the reference genes. We define as all True Positives ($TP$) the cases where prodigal signatures are identical to the reference signatures. The False Positives ($FP$) are genes predicted by prodigal whose signatures does not match any of the reference signatures. Let us compute the Precision and Recall.

The Recall is $Recall = TP/P$. The Precision is $Precision = TP/(TP+FP)$. We compute this as:

P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature %in% reference.gff$Signature)
FP <- sum(!(prodigal.gff$Signature %in% reference.gff$Signature))
cat("Precision = ", TP/(TP+FP), "\n")

## Precision =  0.8442428

cat("Recall = ", TP/P, "\n")

## Recall =  0.8590899

This recall means prodigal finds around 85% of the actual genes. This precision means that around 85% of the genes that prodigal report are correct genes. Thus, there are around 15% errors both ways. We see that the reference has around 4500 genes, and 15% of this is 675 genes!

2.4.1 Exercise - modified signature

Make the R script reading both the prodigal and the refrecne GFF-file, and compute Precision and Recall, use the code snippets above.

Perhaps we were too strict in our analysis above? As I mentioned in the lecture, the most difficult part of gene prediction is finding the exact start of the gene. But in some applications this is not critically important as long as we have found ‘most of the gene’. If we modify the Signature to ignore the start, how do the results improve?

Hint: You cannot simply remove the Start from the computing of the Signature, since the actual start of the gene is the End coordinate when the gene is on the negative Strand! Thus, you must remove the Start for genes on the positive and End for genes on the negative strand. In this case you need something like

prodigal.gff <- prodigal.gff %>% 
  mutate(Signature_modified = if_else(___, str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand))

and similar for the reference.

Make these modifications to the script from above, and re-compute Precision and Recall. Does it change much?

2.4.2 Exercise solution

library(tidyverse)
library(microseq)

### Read prodigal GFF-file
prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
  mutate(confidence = str_extract(Attributes, pattern = "conf=.+?;")) %>%
  mutate(confidence = str_remove_all(confidence, pattern = "conf=|;")) %>%
  mutate(confidence = as.numeric(confidence)) %>%
  mutate(score = str_extract(Attributes, pattern = ";score=.+?;")) %>%
  mutate(score = str_remove_all(score, ";score=|;")) %>%
  mutate(score = as.numeric(score)) %>%
  mutate(length = abs(Start - End) + 1)

### Read reference GFF-file
reference.gff <- readGFF("/mnt/courses/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff") %>%
  filter(Type == "CDS")

### Add signatures
reference.gff <- reference.gff %>%
  mutate(Signature = str_c(Seqid, Start, End, Strand)) %>%
  mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))
prodigal.gff <- prodigal.gff %>%
  mutate(Signature = str_c(Seqid, Start, End, Strand)) %>%
  mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))

### Precision and recall
cat("With full signature:\n")
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature %in% reference.gff$Signature)
FP <- sum(!(prodigal.gff$Signature %in% reference.gff$Signature))
cat("Precision = ", TP/(TP+FP), "\n")
cat("Recall = ", TP/P, "\n")

cat("With modified signatures:\n")
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified)
FP <- sum(!(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified))
cat("Precision = ", TP/(TP+FP), "\n")
cat("Recall = ", TP/P, "\n")

If the Strand is positive (equals "+") then the Signature should be made up of Seqid, End and Strand, but otherwise it should be made up of Seqid, Start and Strand. Note the double equal signs == when comparing if something is equal to something else.

Both Precision and Recall are now quite high. This illustrates the fact that finding the exact start is a bottleneck in prokaryotic gene finding.

2.4.3 Exercise - remove low-scoring predicted genes

Keep using the modified signature from the previous exercise.

Let us then filter out predicted genes from the prodigal predictions having score below some threshold, e.g. 10, and re-compute. Does it improve Precision and/or Recall? Experiment with different thresholds (20, 30, 40), and see if you can improve the results by being stricter on the score threshold.

2.4.4 Exercise solution

library(tidyverse)
library(microseq)

score.threshold <- 10   # edit this to other values

prodigal.gff <- readGFF("/mnt/SCRATCH/larssn/prodigal/prodigal.gff") %>%
  mutate(score = str_extract(Attributes, pattern = ";score=.+?;")) %>%
  mutate(score = as.numeric(str_remove_all(score, ";score=|;"))) %>%
  filter(score >= score.threshold) %>%
  mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))
reference.gff <- readGFF("/mnt/courses/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.gff") %>%
  filter(Type == "CDS") %>%
  mutate(Signature_modified = if_else(Strand == "+", str_c(Seqid, End, Strand), str_c(Seqid, Start, Strand)))

cat("Modified signature, score threshold", score.threshold, ":\n")
P <- nrow(reference.gff)
TP <- sum(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified)
FP <- sum(!(prodigal.gff$Signature_modified %in% reference.gff$Signature_modified))
cat("Precision = ", TP/(TP+FP), "\n")
cat("Recall = ", TP/P, "\n")

By increasing the score threshold we can make the Precision very high, but we soon start to lose Recall, i.e. we miss some real genes by being too strict on the score.

3 Annotation - recognizing gene function

Finding the gene candidates (e.g. with prodigal) is the first step in an annotation of a genome, but perhaps the most important part is the following recognition of function. This is usually what people mean when they talk about annotation.

First, we must be aware that function is here used in a very wide an perhaps inexact way. The actual function of some gene is in general very hard to nail down to some short description. It depends on the context, and we also know that in some cases a destruction of a gene may not alter the function at all, because there may be ‘backup solutions’ in the genome. Thus, when we talk about function here now, we refer to categorizing genes into more or less meaningful groups we have constructed based on current knowledge. Such categories may be

Some standard texts describing a family of genes, e.g. ‘ABC transporter’.
Some rather cryptic accessions number referring to some system, e.g. the Gene Onthology (GO) hierarchy or the Enzyme Commission (EC) hierarchy.
Some accessions identifying the involvement in some known biochemical reaction pathways, e.g the Kyoto Encyclopedia of Gene Genomes (KEGG) or KEGG Orthologs (KO).
Some accession numbers describing conserved patterns known to found in proteins, and to be looked up in databases like Pfam, [TIGRFAM]http://tigrfams.jcvi.org/cgi-bin/index.cgi), CDD etc.

Often these pieces of information are linked, i.e. if you have the Pfam accession number, you also have a related GO accession, KO-accession etc.

The entire process means we

Have some predicted coding gene, usually translated into protein.
Search one or several databases, to see if something like this has been seen before.
If we find something similar enough, we predict the function of our gene is the same as that in the database.

We will not discover new functions this way, just recognize what we have seen before. It is quite obvious that the results of this depends on how good the databases are! For this reason, there are several online annotation solutions (e.g. RAST), i.e. you simply upload your genome or gene predictions to some website, and your receive the results at some later time. This may be a good solution in many cases, we simply trust this service to have up-to-date databases and good search tools.

Since the uploading to a website requires no skills, we will not pursue this here. Instead, we have a look at how we could do annotations locally. This is perhaps most relevant in cases we have sequenced many genomes, and want to automate the annotation process in some way.

3.1 Simple annotations by `prokka`

The prokka software is a quick and simple-to-use software for annotating a genome. It is not a comprehensive scan for all information, but can still be useful in many cases. The prokka is implemented as a pipeline using several other tools in an automated way, e.g. it uses prodigal for gene prediction.

The input to prokka is a fasta-file with genomic DNA. This can be a genome, as contigs or a whole genomic sequence. We will run prokka on the same genome we used as input to prodigal above, to see how it turns out. The prokka will then first use prodigal to predict the coding genes, and some other tools for finding rRNA, tRNA and other non-coding genes. Then, it will use other tools again to start searching with the coding genes against some small databases of known genes commonly found in prokaryotes. Thus, the annotation is rather shallow, and we do not expect to find good functional description for all coding genes, but it is a nice first start. And prokka runs extremely fast compared to more thorough searches. If you only need a “shallow” annotation of many genomes, this is a nice tool.

Before we start using prokka, let us first take some minutes to talk about the Help texts we get displayed from the command line. We looked into this in an earlier module, here we go through this again for prokka. Download the newest apptainer container from galaxy, and store in your module6 folder. Open you Terminal and run

apptainer exec <prokka-container> prokka -h

Replace <prokka-container> with your container name. Here is the first lines of the Help text you should see displayed:

Option h is ambiguous (help, hmms)
Name:
  Prokka 1.14.6 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
  rapid bacterial genome annotation
Usage:
  prokka [options] <contigs.fasta>
General:
  --help             This help
  --version          Print version and exit
  --citation         Print citation for referencing Prokka
  --quiet            No screen output (default OFF)
  --debug            Debug mode: keep all temporary files (default OFF)

We notice that

The command prokka actually invoked prokka! This is not obvious. We have seen earlier that to start spades you type spades.py, quast starts by quast.py etc . The name of the software is not always the exact name of the command we use! We usually try the name as a command first, and look up the software at GitHub for more help if this fails. Or just google for the manual…
The option -h was ambiguous! It could mean both help and hmms here, both are valid options to prokka. But, the Help text was displayed anyway… (the proper Help-option is --help as we can see above).
Notice the Usage:. Always look for this first! This describes exactly how the command line should look like. It should here contain
- First the command prokka.
- Then options, as many or few as you like, and in any order you like (the [options]).
- Finally the input fasta file to annotate. Replace <contigs.fasta> by your filename (genome, contigs, or any fasta file with DNA).

Always stick to the pattern described in Usage. The prokka is actually quite liberal in terms of input order, but in general command line tools require inputs in the exact order specified under Usage!

The options are typically of two types:

Options that turn on or turn off some behavior. An example of this is the --force option we will use (see below). Such options require no further input, you simply add the option or you don’t.
Options that require some additional argument. An example of this is the --prefix option we will use below. Such options need some additional input to be meaningful, i.e. it must be followed by a text or a number. In the Help-text you see that the options followed by [X] or [N] need some additional input.

The prokka is a little bit special regarding the output. Many tools require us to create the output folder before we run, and some tools will create it for us if it is lacking. The prokka is opposite! It will not write to an existing folder, in fear of overwriting something. However, by turning on the --force it will overwrite. If you do not use --force, the output folder must be non-existing, and prokka will create it.

May the --force be with you…

3.1.1 Exercise - run `prokka`

You should now try to make use of prokka on your own, based on Help text. Make a shell script in your module6 folder and

Use the same genome file we used with prodigal above as input.
Output to the folder $SCRATCH/prokka.
Use the --prefix option to name the output files.
Use the --force option.
Use 10 threads or CPUs.
Use 30GB of memory.
Use 1 hour wall time.

It should only take some minutes to complete, but this varies a lot between nodes. You can inspect the .log file to see how things are going. The prokka produces an extensive log during its work.

It is a good idea to copy a script from previous, and try to change it in order to run this software. But, never assume options have the same names, or that input should be specified in the same way as for some other software! You have to read the Help for this software!

3.1.2 Exercise solution

Here is how the script could look like:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=prokka
#SBATCH --output=prokka_%j.log

##############
### Settings
###
threads=10
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_dir=$SCRATCH/prokka
prefix=prokka

###################
### Running prokka
###
apptainer exec $HOME/module6/prokka:1.14.6--pl5321hdfd78af_5.sif prokka \
--force \
--cpus $threads \
--outdir $out_dir \
--prefix $prefix \
$genome_file

3.2 The `prokka` output

Once prokka is completed, there are several files in the $out_dir folder, all with the prefix you specified. We will only look at some of these.

The .fna file is simply a copy of the input, the genome. The predicted genes are in the .ffn file, while the .faa file contains the proteins we get from translating the predicted coding genes. All these are fasta-files.

The files .tsv and .gff are tables with the annotation information for all genes. The .tsv is in principle a simple tab-separated table, but the number of tabs on each row varies, making it a mess if you try to read it! The .gff should adhere to the General Feature Format (GFF), that we saw was an output from prodigal as well. This .gff file contains more than what we saw for prodigal, since it includes more than just coding genes. Note also that the GFF-file from prokka has some additional content! After the GFF-table itself, there is a ##FASTA line, followed by lines of fasta-formatted sequence. This is actually a copy of the genome that was annotated, and that you find in the .fna file.

Since the .gff file is the most comprehensive one, we focus on this when we read the results into R.

3.2.1 Exercise - overview of annotations

Make an R-script that reads the .gff file from prokka and put its content into a table. Use the readGFF() function from our microseq package again. Then, the ##FASTA content of the file will be added as an attribute (named FASTA) to the object we get in R.

If you inspect the resulting .gff table in R, the predicted coding genes should be the same we got from prodigal earlier. In addition, we now also get listed some RNA-genes and possibly some other genomic features.

The Attributes column contains a number of semicolon-separated information. We would like to collect the product information for each gene, which is a text description of what kind of gene it is. This should appear as product=<some text> as the last piece of information in most rows. Make a new column named Product and put this text (minus "product=") into this column. Hint: Use str_extract() to extract the product-information from each row of Attributes, then str_remove() to remove the product= part from this extracted text. We did something similar when handling prodigal results above. Note that the product information does not end with a semicolon…(since it is the last information in the Attributes texts)

How many genes (both coding and non-coding) does prokka list?
How many different product texts are there, i.e. number of unique gene products?
If we consider only the coding genes (Type is CDS), what are the most common product descriptions?

Genes described as "hypothetical protein" are those without any annotation, i.e. they have not been recognized as some known gene. These are the best candidates for being false positive predictions. What is the fraction of "hypothetical protein" genes among the coding genes?

3.2.2 Exercise solution

library(tidyverse)
library(microseq)

prokka.tbl <- readGFF("/mnt/SCRATCH/larssn/prokka/prokka.gff") %>%           # edit path!
  mutate(Product = str_extract(Attributes, "product=.+")) %>%
  mutate(Product = str_remove(Product, "product="))
cat("prokka lists", nrow(prokka.tbl), "genes in this genome\n")

cat("prokka found", length(unique(prokka.tbl$Product)), "unique genes\n")
cds.tbl <- prokka.tbl %>%
  filter(Type == "CDS") %>%
  group_by(Product) %>%
  summarise(N = n()) %>%
  arrange(desc(N))
head(cds.tbl)
cat("Fraction of un-annotated:", cds.tbl$N[1] / sum(cds.tbl$N), "\n")

As we can see, quite many genes are left un-annotated. This may be because they are false positive gene predictions, but more likely because prokka do not search wide enough to recognize all functions.

4 Using `conda` instead of containers

Before we proceed, we will stop and see an alternative way of installing software on an HPC. We will need this for our next software tool.

We have seen how we may use apptainer containers to avoid installing software and to be able to use all kinds of versions of software. But, there are cases where this solution is not the best choice, or even possible. We now want to use a software DRAM in the next section. This is a tool for annotation of genes in a genome, much like prokka above, but more extensive. This software comes with a huge collection of databases. You will find containers for dram at galaxy, or other places, but these containers do not contain the massive databases. These must be downloaded separately. In principle, this should not be a problem, we could still use container software in such cases, but the DRAM software is a little special as we will see.

It turns out that using DRAM is easier if we install it as a conda environment on orion. This is always an option, and most of (all?) the other software tools we have used in BIN310 could also have been installed in this way. We must therefore be familiar with this way of working, and take this opportunity to look closer at how we can do this on orion or any other HPC.

You may be familiar with conda from before, and relate this to python coding. But, conda may be used to install all kinds of software as long as it has a conda installation option. When we install a software with conda it is installed inside an environment, which is in a folder under your $HOME. This is similar to a container in the sense that all dependencies required by the software are also installed inside the same environment. When we use the software we first activate this environment, then use it, and then deactivate it. This means you may have many different versions of the same software, but in different environments without conflicting with each other. Instead of having a container file, you now get an environment folder under you $HOME.

4.1 Installing software with `conda`

Open a Terminal window and maneuver to your $HOME folder. First, we need to load the Miniconda3 software:

module load Miniconda3/4.9.2

This contains, among other things, the command conda. Verify it has been loaded by the command

conda -h

and you should get some help text on the screen.

In most cases we will install a software package using a variant of this command

conda create -c <channel> --name <environment-name> <package>

First, we use conda create to create a new environment. You may also install a (new) package into an existing environment, that you have created some time in the past. You would then use conda install instead. You may install several packages into the same environment as long as they do not conflict, but in general we make a new environment for each tool.

The first time you run conda create a folder named .conda wil be created under your $HOME. All new environments will now be found in the subfolder .conda/envs/. This folder will get the name you specify in <environment-name>. In most cases we use the same name as the package (software) we install, but you may actually decide the environment name.

Next, the -c <channel> specifies the channel we should use for downloading. The conda can install from various channels/repositories on the internet. The channel bioconda is very relevant for our use, see the bioconda webpage for more information information about this. There are also a number of other channels you may collect software from

Finally, we specify the name of the package we want to install.

4.2 Installing `prodigal` as a `conda` environment

For practice, let us install the software prodigal, that we used above, as a conda environment.

First, we need to know of a channel from which we can download the software. There are many such conda channels around. One of them is bioconda, containing a lot of software used in biology. verify there is a package named prodigal at the bioconda website. There is a small search-box in the left margin. Use this.

Then, use the template shown above, and install this to an environment you name genefinders. We most typically name the environment the same as the software, but here I make an exception, just to highlight that the environment is not the same as the software itself.

Ignore any warning about a newer version of conda. Answer yes (y) when asked if you want to proceed with the installation.

After completed installation inspect the folder .conda/envs/ under you $HOME and verify the new environment has been created.

Next, we need to activate the new environment by conda activate genefinders. But, this will typically give an error (CommandNotFoundError) the first time you try. In order to make this work, you must first run the following code in your Terminal:

eval "$(conda shell.bash hook)"

This will link conda to your current shell. Then we activate the environment by

conda activate genefinders

Notice how the command prompt changes when you activate the environment! We can now run prodigal -h to see if this starts prodigal as it should.

Finally, deactivate it again with conda deactivate.

A short video showing how to do the steps above

4.2.1 Exercise - installing `DRAM`

Install DRAM by conda. Instead of using the standard command line seen above, use the procedure outlined at the GitHub site of the software.

First, load Miniconda3 in the Terminal as shown above.
Then follow the suggested procedure, by first downloading a .yaml file using wget…
…and then install with conda as described in the DRAM website. Name the environment DRAM as suggested.

It is quite typical that once we have the name of some software, we search for its GitHub site. Most tools have this. Then, look for descriptions on how to install.

4.2.2 Exercise solution

A short video where we install DRAM

4.2.3 Exercise - access the `DRAM` database

As mentioned above the DRAM requires a collection of databases. This has already been installed on orion at /mnt/databases/DRAM_1.4.6/, and we should make use of this. To make the DRAM software aware of this we need to supply DRAM with a configuration file with this information. This is not a common solution, and the reason this tool works best from a conda installation. Other tools making use of an external database typically have an option (e.g. -db) where we the name the path to the database.

Read the text file /mnt/databases/DRAM_1.4.6/README.md (use cat). Here you find the description of how you import the configuration file into your environment. Follow this.

Then deactivate the environment and your DRAM environment should now be ready for use!

4.2.4 Exercise solution

A short video importing the configuration file

5 Annotations with `DRAM`

We will now use the software DRAM that we installed as a conda environment above, to do an automated annotation of a genome. This tool was originally made for an automated annotation of several genomes re-constructed from metagenomes, what we denote MAGs (Metagenome Assembled Genomes). We will, however, use it here for a single genome, just to get to know it.

5.1 The shell script

Here is a suggested code for using DRAM on orion:

######################
### Settings
###
genome_file=$COURSES/BIN310/module6/GCF_000006945.2_ASM694v2_genomic.fna
out_folder=$SCRATCH/dram
threads=10

#######################
### Loading Miniconda
###
module load Miniconda3/4.9.2
eval "$(conda shell.bash hook)"

######################
### Running DRAM
###
conda activate DRAM

DRAM.py annotate \
 --input_fasta $genome_file \
 --output_dir $out_folder \
 --threads 10
 
DRAM.py distill \
 --input_file $out_folder/annotations.tsv \
 --trna_path $out_folder/trnas.tsv \
 --rrna_path $out_folder/rrnas.tsv \
 --output_dir $out_folder/distilled

conda deactivate

Notice how we first have to load Miniconda3. Then follows the command eval "$(conda shell.bash hook)" we saw above, to configure the shell for activating conda environments. This applies to all conda environments on orion, not just DRAM. Try to run just conda shell.bash hook directly in the Terminal (after loading Miniconda3) and you will see this creates a file. Then eval of this will execute it. We will not dig more into this, just accept it as a necessary command for using conda in our shell scripts. Remember that each time you sbatch and start a new job in SLURM, it will work inside a new shell. Thus, this statement must always be included in shell scripts on orion where you need to activate some conda environment.

Next we activate the DRAM environment, and then we can run the DRAM software.

First we use DRAM.py annotate with some required input/output, to get the basic annotations for the genome. This will result in a number of files in the folder we specify in $out_folder.

Next, we use the DRAM.py distill to summarise what we found in the previous step. This is mostly a benefit when you annotate many genomes together, but we still use it here, and will have a look at the output.

Note that we specify as input to distill some files produced in the annotate step above. There is no guarantee all these files were actually created! For instance, the file rrna.tsv is a small table listing the ribosomal RNA genes found in the genome. Should not all genomes have these, you may ask? Yes, they should, but experience tells us these may be badly assembled (why?) and may sometimes not be detected at all in the contigs we produce. Thus, you should actually run only the annotate step first, inspect the $out_folder and then run the distill step. If the rrna.tsv file is missing, you may still run the distill, just omit the --rrna_path option altogether.

Finally, we deactivate the conda environment again.

Add the proper heading to this, and sbatch. Use 50GB memory, which should be more than enough even if you annotate many genomes.

5.1.1 Exercise - the `annotate` output

Let us have a look at the content of the folder specified in $out_folder above.

Also, look up the scientific paper describing DRAM. You find it here. Focus on the section called DRAM annotation overview.

Try to answer these questions, by reading the paper and/or inspect the output (read output files into R):

To predict genes, DRAM uses a gene finding tool. Which tool?
What are the databases scanned by DRAM? Do you know anything about what these databases contain?
The paper mentions the tool MMseqs2. What is this used for in DRAM?
In our code above we never used the option --use_uniref. What would be the effect of adding this to our command line?
How many genes were annotated from our input genome? Hint: The file annotations.tsv has one row for each gene. Read this into R, extension .tsv means tab-separated text file.
The table in annotations.tsv has a column named rank. What is this? Hint: See paper.
What is a ‘reciprocal best hit’?
Count how many genes were given the various ranks, from A to E. Hint: Convert the rank column to a factor with levels A to E, and then use table() on it. Based on what the paper says, what may we infer from this?

5.1.2 Exercise solutions

It looks like DRAM is using prodigal, but insists on using the anonymous/metagenome mode. This may result in slightly different results than we got when we used standard prodigal above. If you want to check it out, read the file genes.gff in the $out_folder and compare to what we got before…
The databases seems to be: Pfam, KEGG, UniProt, CAZY, MEROPS, VOGDB. Note that Pfam is now hosted by InterPro (from this year). Note also that the full KEGG database is not included on orion. This is not an open (free) database, you need to get personal access. As a substitute, DRAM uses KOfam, which is a HMM-based database built from KEGG.
The MMseqs2 is used for various sequence searches. Think of it as an alternative to BLAST, that we saw in an earlier module. It is a more general tool than BLAST, and can do many different kinds of searches.
The UniRef90 is a (huge) collection of proteins from the UniProt database. To search against it, you need to use the --use_uniref option, i.e. this is not by default included. The reason is that is slows the search considerably. When we used DRAM above, it took some minutes to annotate this single genome. With --use_uniref option included, expect it to take many hours (even days)…
R code, see below.
The rank column ranks the genes by how “well” they are annotated. Rank A requires reciprocal best hit (RBH) to a KEGG protein, and since we have no KEGG installed on orion we will never see this. The B category requires RBH to a UniRef90 protein, and this of course requires the use of --use_uniref option that we saw above. Categories C and D are OK annotations. Category E means a gene is predicted, but has no hits to any of the current databases. Such genes may not be genes at all, or something entirely new we have no function for yet.
A reciprocal best hit between sequence A and B means: You search with sequence A against a sequence collection where B is found, and B gives the best hit to A. Then you do the opposite, search with B, and A gives the best hit. It means these are the closest relatives we have seen, involving any of these two.

library(tidyverse)

### The raw annotations
annotations.tbl <- suppressMessages(read_delim("/mnt/SCRATCH/larssn/dram/annotations.tsv"))
cat("DRAM annotated:", nrow(annotations.tbl), "genes\n")
annotations.tbl <- annotations.tbl %>%
  mutate(rank = factor(rank, levels = c("A", "B", "C", "D", "E")))
print(table(annotations.tbl$rank))

Since we have no genes of rank A and B, we can conclude the KEGG and UniRef90 databases were not really searched. Either that or this genome is really poorly assembled, or is from some other planet(!). No real earthly genome will have 0 genes found in either KEGG or UniRef90!

We happen to know the DRAM database on orion lacks KEGG, and that we did not search UniRef90 since we omitted the --use_uniref option. The latter you may add, but this will slow down computations a lot, and require much more memory. We will not do this in BIN310…

5.1.3 Exercise - the `distilled` output

In the folder $out_folder/distilled/ you should also have some output from DRAM. Let us focus on the excel-file named metabolism_summary.xlsx.

To read an excel-file into R, install the R package readxl. You can then read the file by

library(readxl)
tbl <- read_excel("/mnt/SCRATCH/larssn/dram/distilled/metabolism_summary.xlsx")  # edit path

Excel-files may have multiple sheets, and this is also the case here. The code above will read the first sheet only. By adding sheet = "sheet_name", where "sheet_name" is a text with the name of a sheet, you specify which sheet to read.

To see the sheet names included in an excel-file, use

library(readxl)
sheets <- excel_sheets("/mnt/SCRATCH/larssn/dram/distilled/metabolism_summary.xlsx")  # edit path

The object sheets will now contain the names of the various sheets.

Read the sheet for carbon utilization. You should get a table, and the last column lists the number of genes from each category found in the current genome. How many carbon utilization genes were found in this genome?

Perhaps you could also find where on the genome these genes are located? More specifically, we would be interested in operons, i.e. regions where several such genes are located next to each other. Do you find any such cases? Hint: The annotations table from above has the information about position and strand, you just need to filter out the carbon utilization genes in this table.

5.1.4 Exercise solution

library(tidyverse)
library(readxl)

### The carbon utilization genes
sheets <- excel_sheets("/mnt/SCRATCH/larssn/dram/distills/metabolism_summary.xlsx")
carb_util.tbl <- read_excel("/mnt/SCRATCH/larssn/dram/distills/metabolism_summary.xlsx", sheet = sheets[2])
cat("DRAM found", sum(as.numeric(carb_util.tbl$GCF_000006945.2_ASM694v2_genomic)), "carbon utilization genes\n")

### The location of the carb_util genes
### First, we filter the table, keeping only the genes with 1 copy or more
carb_util.tbl <- carb_util.tbl %>%
  filter(GCF_000006945.2_ASM694v2_genomic != "0")
### Next, we need to filter the annotations.tbl from the previous exercise,
### having the location information,
### keeping only the carb_util genes.
### The gene_id in the carb_util.tbl may match
### ko_id, peptidase_id or cazy_ids in the annotations.tbl
annot_carb_util.tbl <- annotations.tbl %>%
  filter(ko_id %in% carb_util.tbl$gene_id |
           peptidase_id %in% carb_util.tbl$gene_id |
           cazy_ids %in% carb_util.tbl$gene_id)
### To find operons we can now inspect this table, and use the columns gene_position and strandedness.
### We immediately find that genes with gene_position 149-150-151 are all on the positive strand (standeness 1), a potential operon
View(annot_carb_util.tbl)

BIN310 module 6 - Annotations

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Gene prediction

2.1 Lecture

2.2 The `prodigal` software

2.2.1 Exercise - make the `prodigal` script

2.2.2 Exercise solution

2.3 The Generic Feature Format

2.3.1 Exercise - `prodigal` scores

2.3.2 Exercise solution

2.4 How well did `prodigal` predict?

2.4.1 Exercise - modified signature

2.4.2 Exercise solution

2.4.3 Exercise - remove low-scoring predicted genes

2.4.4 Exercise solution

3 Annotation - recognizing gene function

3.1 Simple annotations by `prokka`

3.1.1 Exercise - run `prokka`

3.1.2 Exercise solution

3.2 The `prokka` output

3.2.1 Exercise - overview of annotations

3.2.2 Exercise solution

4 Using `conda` instead of containers

4.1 Installing software with `conda`

4.2 Installing `prodigal` as a `conda` environment

4.2.1 Exercise - installing `DRAM`

4.2.2 Exercise solution

4.2.3 Exercise - access the `DRAM` database

4.2.4 Exercise solution

5 Annotations with `DRAM`

5.1 The shell script

5.1.1 Exercise - the `annotate` output

5.1.2 Exercise solutions

5.1.3 Exercise - the `distilled` output

5.1.4 Exercise solution

BIN310 module 6 - Annotations

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Gene prediction

2.1 Lecture

2.2 The prodigal software

2.2.1 Exercise - make the prodigal script

2.2.2 Exercise solution

2.3 The Generic Feature Format

2.3.1 Exercise - prodigal scores

2.3.2 Exercise solution

2.4 How well did prodigal predict?

2.4.1 Exercise - modified signature

2.4.2 Exercise solution

2.4.3 Exercise - remove low-scoring predicted genes

2.4.4 Exercise solution

3 Annotation - recognizing gene function

3.1 Simple annotations by prokka

3.1.1 Exercise - run prokka

3.1.2 Exercise solution

3.2 The prokka output

3.2.1 Exercise - overview of annotations

3.2.2 Exercise solution

4 Using conda instead of containers

4.1 Installing software with conda

4.2 Installing prodigal as a conda environment

4.2.1 Exercise - installing DRAM

4.2.2 Exercise solution

4.2.3 Exercise - access the DRAM database

4.2.4 Exercise solution

5 Annotations with DRAM

5.1 The shell script

5.1.1 Exercise - the annotate output

5.1.2 Exercise solutions

5.1.3 Exercise - the distilled output

5.1.4 Exercise solution

2.2 The `prodigal` software

2.2.1 Exercise - make the `prodigal` script

2.3.1 Exercise - `prodigal` scores

2.4 How well did `prodigal` predict?

3.1 Simple annotations by `prokka`

3.1.1 Exercise - run `prokka`

3.2 The `prokka` output

4 Using `conda` instead of containers

4.1 Installing software with `conda`

4.2 Installing `prodigal` as a `conda` environment

4.2.1 Exercise - installing `DRAM`

4.2.3 Exercise - access the `DRAM` database

5 Annotations with `DRAM`

5.1.1 Exercise - the `annotate` output

5.1.3 Exercise - the `distilled` output