HumGut: A comprehensive Human Gut prokaryotic genomes collection filtered by metagenome data

 

Pre-print at bioRxiv:

 

From this site you can download the HumGut genome collection and accompanying metadata. As described in our paper, genomes encountered in healthy human guts worldwide were clustered by whole genome identity (97.5%). Genomes representing the clusters, 30 691 in total, were retained as HumGut.

 

HumGut.tar.gz

This compressed archive contains the (compressed) FASTA-files for all 30 691 HumGut genomes. It is roughly 18GB.

 

HumGut.tsv

Here is the table (tab-separated text file) listing metadata about each HumGut genome. We refer to clusters below here, since each HumGut genome is the representative of a cluster of genomes. A description of all genomes is in the next table listed below. The column names are mostly self-explanatory, but here is a short description anyway:

HumGut_name

Unique HumGut name for each cluster (97.5%).

HumGut_tax_id

Unique HumGut ID for each cluster (97.5%). These are integers from 3 000 000 and up, a choice made to not interfere with the NCBI Taxonomy database integers.

cluster975

The highest resolution cluster this genome belongs to (97.5% sequence identity)

cluster95

The coarser resolution cluster this genome belongs to (95% sequence identity)

gtdbtk_tax_id

GTDB-tk genome taxonomy ID. Note that the GTDB database (https://gtdb.ecogenomic.org/) has no such integer identifiers, and we have just artificially created these here. This is required for building kraken2/bracken/krakenUniq databases using this taxonomy. These integers are from 4 000 000 and up, a choice made to not interfere with the NCBI Taxonomy or the HumGut_tax_id mentioned above.

gtdbtk_organism_name

Genome name as given by GTDB-tk

gtdbtk_taxonomy

The full GTDB-tk taxonomy, from domain and down

ncbi_tax_id

The taxonomy ID from the NCBI Taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy/)

ncbi_organism_name

Genome name at the NCBI Taxonomy database

ncbi_rank

The rank at the NCBI Taxonomy database

prevalence_score

The average sequence identity with 3,534 healthy human gut metagenomes.

metagenomes_present

The number of metagenomes where the genome was found by screening, using ≥ 95% sequence identity as a threshold.

completeness

The estimated completeness (%) of the genome.

contamination

The estimated contamination (%) of the genome.

GC

Genome GC content.

genome_size

Number of basepairs in genome

source

Either RefSeq (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/) or UHGG (https://www.ebi.ac.uk/metagenomics/)

genome_type

The completion level as listed in RefSeq or MAG (all UHGG genomes).

cluster975_size

Number of genomes in the same cluster of the highest resolution (97.5%).

cluster95_size

Number of genomes in the same coarse cluster (95%).

genome_file

The name of the FASTA file in the archive HumGut.tar.gz described above.

ftp_download

The ftp address from where we downloaded the genome.

 

 

All_genomes.tsv

This is the table (tab-separated text file) listing metadata about all genomes used for obtaining the 30 691 clusters that makes up the HumGut collection. This huge table lists metadata about 381 779 genomes. The columns are a subset of those in the above table, see the above description of column names. Note that we do not provide the FASTA files for all these genomes at this website. Use the FTP addresses in the column ftp_download in case you want to download them.

 

Databases for taxonomic classification

If you want to build a custom kraken2/bracken/krakenUniq database from our HumGut genomes, please visit our GitHub site at https://github.com/larssnip/HumGut.

Here we supply the files needed to describe the taxonomy-tree, using either the GTDB or the NCBI taxonomy:

gtdb_names.dmp

gtdb_nodes.dmp

These two files describe together the taxonomy-tree for the HumGut genomes when using the GTDB taxonomy. It also includes the human genome, and its branch, in the NCBI Taxonomy. All HumGut genomes have the HumGut_tax_id listed in the table HumGut.tsv above, while the corresponding gtdbtk_tax_id is the parent.

ncbi_names.dmp

ncbi_nodes.dmp

These two files describe together the taxonomy-tree for the HumGut genomes when using the NCBI taxonomy. It also includes the human genome, and its branch, in the NCBI Taxonomy. All HumGut genomes have the HumGut_tax_id listed in the table HumGut.tsv above, while the corresponding ncbi_tax_id is the parent.