1 Introduction

Here are some ‘pipelines’ and recommendations concerning the taxonomic classification of 16S sequences, typically obtained from amplicon sequencing. Regardless of how you processed your amplicon data, you get a set of sequences representing a group of reads, the centroids. We often then want to assign these centroids to some taxonomy, if possible. This is what we refer to as taxonomic classification here.

The scripts and data we refer to here can be copied from the folder /mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/ when you are on orion.

You must always copy the code to your own folder first, and then you may edit. Think of the code here as a template, that you give your personal touch once you have your own copy.

It is also recommended that you copy all the apptainer containers you use (in /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/) to your own project folder, and edit all paths to use these local copies. In this way your code is reproducible forever after. In the site mentioned above the container versions will change over time.

2 The SINTAX

In the shell script TAXONOMY_SINTAX.sh we use the SINTAX method that is implemented in the VSEARCH software. This works very fast, and the results are fairly good, giving a confidence score for each assignment which is not totally crap. The drawback of this method is that is has never been propely published, and is only available at BioRxiv (https://www.biorxiv.org/content/10.1101/074161v1).

2.1 The script

The Settings section looks like this:

centroids_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/centroids.fasta
out_file=sintax_GTDB.tsv
database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_GTDB_r226.fasta
# database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_GSR-DB_full-16S.fasta
# database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_EzBioCloud.fasta
# database_file=/mnt/users/larssn/MIDIV/Amplicon_sequencing/Taxonomy/DBFILE_SINTAX_RDP_16s_v18.fasta
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"

You need to edit the file name assigned to centroids_file, this is your input fasta file.

Also, the out_file must be given a relevant file name. This is where the SINTAX output will end. This is a simple text file.

The SINTAX needs also a database_file, which is also a fasta file, but where the Header has the required format containing the taxonomy information. There are several files to choose from, they all start with DBFILE_SINTAX_. We recommend using the GTDB taxonomy these days. The reasons are:

This taxonomy is based on reproducible rules, described in https://www.nature.com/articles/nbt.4229
It is the same taxonomy used by the tool gtdbtk, typically used for assigning taxonomy to MAGs
This taxonomy is complete, i.e. there are no ‘holes’ with missing families, order etc.

NOTE: There is, however, one major flaw with the GTDB taxonomy: In some branches the exact same name is used for multiple ranks! This may create problems in later analyses. To avoid this, we have added a prefix to all GTDB taxon names to ensure we know exactly which rank they are describing, e.g. all phylum names starts with p__, all class names with c__ etc. You may of course remove these prefixes later, but be aware of the potential confusion it may create.

The GSR-DB database is also a nice initiative (https://journals.asm.org/doi/10.1128/msystems.00950-23), but uses the NCBI Taxonomy. The RDP-file we supply is just for old times sake, and will be discontinued.

You run this on orion in the usual way:

sbatch TAXONOMY_SINTAX.sh

The job is named sintax. There should be no need to edit the SLURM header of the script. Note this only uses 1 thread, but still should only take rather short time (few minutes) as long as you use one of the supplied database files. A larger database will mean longer running time.

2.2 Filtering SINTAX results in R

As mentioned above the SINTAX outputs a ‘confidence’ score for each classification. From the output file you can filter this in R, and try out various thresholds without re-running the SINTAX. Here is how you may do this.

We have run the code displayed above, and have got a file named sintax_GTDB.tsv. From this we want to set as "unclassified" all OTU’s with a genus_score below, say 0.8:

taxonomy.tbl <- read_tsv("sintax_GTDB.tsv") %>% 
  mutate(genus = if_else(genus_score < 0.8, "unclassified", genus))

By adding similar mutate() statements you can do the same for the other ranks as well. In this way you can very quickly use different confidence thresholds without re-running the classification (which may take some time if the database file is large).

3 The DADATAX

In the R package dada2 there is an implementation of the classical RDP-classifier method, published ages ago (https://journals.asm.org/doi/full/10.1128/aem.00062-07). The R function for this is named assignTaxonomy(), but here we refer to this method as DADATAX. It is (much) slower than SINTAX (as usual for dada2), but the results are usually of a similar quality. It also outputs some confidence scores, but these are based on a bootstrap idea that is strange, to say the least. You are hereby warned!

3.1 The script

The shell script TAXONOMY_DADATAX.sh is all you need. It is actually all R code, but we put it into a shell script just to make it similar to the TAXONOMY_SINTAX.sh.

You run this on orion in the usual way:

sbatch TAXONOMY_DADATAX.sh

The job is named dadatax.

As with SINTAX above, you need to specify the same input and output files. We have made DADATAX-versions of the same database files as for SINTAX (not the RDP though), and you may therefore use both methods on the same centroids, using the same database, and compare the results you get.

As mentioned, a confidence (bootstrap) score is also output here. We convert it to values between 0.0 and 1.0 instead of percentages, and you may filter the results in R in exactly the same way as described for SINTAX above.

4 Future methods

We may implement other solutions to the taxonomy assignment in the future.

The choice of algorithm is often not crucial. All else equal, different methods tend to give quite similar results.

What is much more important is the database of sequences used by these methods. Classification is all about recognizing something. You cannot recognize something you have never seen, and if the database used is full of errors, the algorithm is mislead into believing it recognizes something. Thus, in order to improve classification, the emphasis should be put into having the best possible database or training data. We are working on improving the databases for classification, and the taxonomic classification we recommend may change in the future, based on this.

Taxonomic assignment of amplicon data

Lars Snipen

June 2025