Pre-print
at bioRxiv:
From this site you can download the HumGut genome collection and accompanying metadata. As described in our paper, genomes
encountered in healthy human guts worldwide were clustered by whole genome
identity (97.5%). Genomes representing the clusters, 30 691 in total, were
retained as HumGut.
This compressed archive contains the (compressed) FASTA-files for all 30 691 HumGut genomes. It is roughly 18GB.
Here is the table (tab-separated text file) listing metadata about each HumGut genome. We refer to clusters below here, since each HumGut genome is the representative of a cluster of genomes. A description of all genomes is in the next table listed below. The column names are mostly self-explanatory, but here is a short description anyway:
HumGut_name |
Unique HumGut
name for each cluster (97.5%). |
HumGut_tax_id |
Unique HumGut
ID for each cluster (97.5%). These are integers from
3 000 000 and up, a choice made to not interfere with the
NCBI Taxonomy database integers. |
cluster975 |
The highest resolution cluster this
genome belongs to (97.5% sequence identity) |
cluster95 |
The coarser resolution cluster this
genome belongs to (95% sequence identity) |
gtdbtk_tax_id |
GTDB-tk genome
taxonomy ID. Note that the GTDB database (https://gtdb.ecogenomic.org/) has no such
integer identifiers, and we have just artificially created these here. This
is required for building kraken2/bracken/krakenUniq
databases using this taxonomy. These integers are from 4 000 000
and up, a choice made to not interfere with the NCBI Taxonomy or the HumGut_tax_id mentioned above. |
gtdbtk_organism_name |
Genome name as given by GTDB-tk |
gtdbtk_taxonomy |
The full GTDB-tk
taxonomy, from domain and down |
ncbi_tax_id |
The taxonomy ID from the NCBI
Taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy/) |
ncbi_organism_name |
Genome name at the NCBI Taxonomy
database |
ncbi_rank |
The rank at the NCBI Taxonomy
database |
prevalence_score |
The average sequence identity with
3,534 healthy human gut metagenomes. |
metagenomes_present |
The number of metagenomes where the
genome was found by screening, using ≥ 95% sequence identity as a
threshold. |
completeness |
The estimated completeness (%) of
the genome. |
contamination |
The estimated contamination (%) of
the genome. |
GC |
Genome GC content. |
genome_size |
Number of basepairs
in genome |
source |
Either RefSeq
(https://ftp.ncbi.nlm.nih.gov/genomes/refseq/)
or UHGG (https://www.ebi.ac.uk/metagenomics/) |
genome_type |
The completion level as listed in RefSeq or MAG (all UHGG genomes). |
cluster975_size |
Number of genomes in the same
cluster of the highest resolution (97.5%). |
cluster95_size |
Number of genomes in the same
coarse cluster (95%). |
genome_file |
The name of the FASTA file in the
archive HumGut.tar.gz described above. |
ftp_download |
The ftp address from where we
downloaded the genome. |
This is
the table (tab-separated text file) listing metadata about all genomes used for
obtaining the 30 691 clusters that makes up the HumGut
collection. This huge table lists metadata about 381 779 genomes. The columns
are a subset of those in the above table, see the above description of column
names. Note that we do not provide the FASTA files for all these genomes at
this website. Use the FTP addresses in the column ftp_download in case you want to download them.
If you want to build a custom kraken2/bracken/krakenUniq database from our HumGut genomes, please visit our GitHub site at https://github.com/larssnip/HumGut.
Here we supply the files needed to describe the taxonomy-tree, using either the GTDB or the NCBI taxonomy:
These two files describe together the taxonomy-tree for the HumGut genomes when using the GTDB taxonomy. It also includes the human genome, and its branch, in the NCBI Taxonomy. All HumGut genomes have the HumGut_tax_id listed in the table HumGut.tsv above, while the corresponding gtdbtk_tax_id is the parent. |
|
These two files describe together the taxonomy-tree for the HumGut genomes when using the NCBI taxonomy. It also includes the human genome, and its branch, in the NCBI Taxonomy. All HumGut genomes have the HumGut_tax_id listed in the table HumGut.tsv above, while the corresponding ncbi_tax_id is the parent. |