Here are some ‘pipelines’ and recommendations concerning the taxonomic classification of 16S sequences, typically obtained from amplicon sequencing. Regardless of how you processed your amplicon data, you get a set of sequences representing a group of reads, the centroids. We often then want to assign these centroids to some taxonomy, if possible. This is what we refer to as taxonomic classification here.
The scripts and data we refer to here can be copied from the folder
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/
when
you are on orion.
You must always copy the code to your own folder first, and then you may edit. Think of the code here as a template, that you give your personal touch once you have your own copy.
It is also recommended that you copy all the
apptainer containers you use (in
/mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/
) to
your own project folder, and edit all paths to use these local copies.
In this way your code is reproducible forever after. In the site
mentioned above the container versions will change over time.
In the shell script TAXONOMY_SINTAX.sh
we use the SINTAX
method that is implemented in the VSEARCH software. This works very
fast, and the results are fairly good, giving a confidence
score for each assignment which is not totally crap. The drawback
of this method is that is has never been propely published, and is only
available at BioRxiv (https://www.biorxiv.org/content/10.1101/074161v1).
The Settings
section looks like this:
centroids_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/centroids.fasta
out_file=sintax_GTDB.tsv
database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_GTDB_reps_r220.fasta
# database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_GSR-DB_full-16S.fasta
# database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_EzBioCloud.fasta
# database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_RDP_16s_v18.fasta
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
You need to edit the file name assigned to
centroids_file
, this is your input fasta file.
Also, the out_file
must be given a relevant file name.
This is where the SINTAX output will end. This is a simple text
file.
The SINTAX needs also a database_file
, which is also a
fasta file, but where the Header
has the required format
containing the taxonomy information. There are several files to choose
from, they all start with DBFILE_SINTAX_
. We recommend
using the GTDB taxonomy these days. The reasons are:
gtdbtk
,
typically used for assigning taxonomy to MAGsThe GSR-DB database is also a nice initiative (https://journals.asm.org/doi/10.1128/msystems.00950-23), but uses the NCBI Taxonomy. The RDP-file we supply is just for old times sake, and will be discontinued.
You run this on orion in the usual way:
sbatch TAXONOMY_SINTAX.sh
The job is named sintax
. There should be no need to edit
the SLURM header of the script. Note this only uses 1 thread, but still
should only take rather short time (few minutes) as long as you use one
of the supplied database files. A larger database will mean longer
running time.
As mentioned above the SINTAX outputs a ‘confidence’ score for each classification. From the output file you can filter this in R, and try out various thresholds without re-running the SINTAX. Here is how you may do this.
We have run the code displayed above, and have got a file named
sintax_GTDB.tsv
. From this we want to set as
"unclassified"
all OTU’s with a genus_score
below, say 0.8:
taxonomy.tbl <- read_tsv("sintax_GTDB.tsv") %>%
mutate(genus = if_else(genus_score < 0.8, "unclassified", genus))
By adding similar mutate()
statements you can do the
same for the other ranks as well. In this way you can very quickly use
different confidence thresholds without re-running the classification
(which may take some time if the database file is large).
In the R package dada2
there is an implementation of the
classical RDP-classifier method, published ages ago (https://journals.asm.org/doi/full/10.1128/aem.00062-07).
The R function for this is named assignTaxonomy()
, but here
we refer to this method as DADATAX. It is (much) slower than SINTAX (as
usual for dada2
), but the results are usually of a similar
quality. It also outputs some confidence scores, but these are based on
a bootstrap idea that is strange, to say the least. You are hereby
warned!
The shell script TAXONOMY_DADATAX.sh
is all you need. It
is actually all R code, but we put it into a shell script just to make
it similar to the TAXONOMY_SINTAX.sh
.
You run this on orion in the usual way:
sbatch TAXONOMY_DADATAX.sh
The job is named dadatax
.
As with SINTAX above, you need to specify the same input and output files. We have made DADATAX-versions of the same database files as for SINTAX (not the RDP though), and you may therefore use both methods on the same centroids, using the same database, and compare the results you get.
As mentioned, a confidence (bootstrap) score is also output here. We convert it to values between 0.0 and 1.0 instead of percentages, and you may filter the results in R in exactly the same way as described for SINTAX above.
We may implement other solutions to the taxonomy assignment in the future.
The choice of algorithm is often not crucial. All else equal, different methods tend to give quite similar results.
What is much more important is the database of sequences used by these methods. Classification is all about recognizing something. You cannot recognize something you have never seen, and if the database used is full of errors, the algorithm is mislead into believing it recognizes something. Thus, in order to improve classification, the emphasis should be put into having the best possible database or training data. We are working on improving the databases for classification, and the taxonomic classification we recommend may change in the future, based on this.