bio taxon: show taxonomies

The bio package provides utilities to visualize NCBI taxonomies. See the help with:

bio taxon -h

Download or build the taxonomy

Before using the taxonomy related database needs to downloaded with:

bio --download 

Check database

bio taxon

prints:

Content: 2,359,690 taxon names; 209,096 parent ranks

There are a total of 2,353,384 nodes (taxonomical entries) out of which 207,447 are nodes that are non-terminal (non-leaf) nodes. For these numbers we see that the vast majority of the taxonomy annotations are for terminal, leaf nodes.

Searching for taxids

Searches the taxonomy for a word

bio taxon jawed 

prints:

# searching taxonomy for: jawed
clade, Gnathostomata (jawed vertebrates), 7776
species, Gillichthys mirabilis (long-jawed mudsucker), 8222
species, Pseudamia amblyuroptera (white-jawed cardinalfish), 1431476
species, Myctophum brachygnathum (short-jawed lanternfish), 1519985
species, Oryzias orthognathus (sharp-jawed buntingi), 1645897
species, Longjawed orbweaver circular virus 1, 2293294
species, Longjawed orbweaver circular virus 2, 2293295

The search words may use regular expression control characters, selecting species names starting with jawed:

bio taxon '^jawed'

produces:

# searching taxonomy for: ^jawed
clade, Gnathostomata (jawed vertebrates), 7776

View taxonomy for data

Once you fetch the data

bio fetch NC_045512 > genome.gb
    

you can view the lineage of the taxonomical id in the GenBank

bio taxon genome.gb --lineage

that prints:

superkingdom, 10239, Viruses
  clade, 2559587, Riboviria
    kingdom, 2732396, Orthornavirae
      phylum, 2732408, Pisuviricota
        class, 2732506, Pisoniviricetes
          order, 76804, Nidovirales
            suborder, 2499399, Cornidovirineae
              family, 11118, Coronaviridae
                subfamily, 2501931, Orthocoronavirinae
                  genus, 694002, Betacoronavirus
                    subgenus, 2509511, Sarbecovirus
                      species, 694009, Severe acute respiratory syndrome-related coronavirus
                        no rank, 2697049, Severe acute respiratory syndrome coronavirus 2

View taxonomy by tax id

Pass a NCBI taxonomical id to see all the descendants of it:

bio taxon 117565 | head

prints:

class, 117565, Myxini
  order, 7761, Myxiniformes
    family, 7762, Myxinidae
      subfamily, 30309, Eptatretinae
        genus, 7763, Eptatretus
          species, 7764, Eptatretus burgeri
          species, 7765, Eptatretus stoutii
          species, 7767, Eptatretus okinoseanus
          species, 50612, Eptatretus atami
          species, 78394, Eptatretus cirrhatus

we can limit the depth (how many levels to show):

bio taxon 117565 -d 3

prints:

class, 117565, Myxini
  order, 7761, Myxiniformes
    family, 7762, Myxinidae
    no rank, 727487, unclassified Myxiniformes

Filter results by taxonomic clades

In bioinformatics we frequently have files where a column corresponds to taxonomic ids and we wish to filter rows by clades. For example keep all rows where the taxid is belongs to a subtree in the taxonomy. Here is a file

11138,AB008939.1
11138,AB008940.1
215681,AB242262.1
215681,AB242263.1
215681,AB242264.1
229992,AB257344.1
391355,AB263618.1
11128,AB277098.1
422141,AB277099.1
11133,AB277100.1

We can summarize the taxonomies in the file with:

cat data.txt | bio taxon -d 1

to print:

no rank,11138,Murine hepatitis virus
no rank,215681,Canine respiratory coronavirus
no rank,229992,SARS coronavirus Frankfurt 1
no rank,391355,SARS coronavirus Frankfurt1-v01
no rank,11128,Bovine coronavirus
no rank,422141,Giraffe coronavirus US/OH3-TC/2006
no rank,11133,Bovine coronavirus strain Quebec

Which data are a kind of “Bovine Coronavirus”?

cat data.txt | bio taxon -d 1 -keep 11128

prints:

11128,AB277098.1
422141,AB277099.1
11133,AB277100.1

Which data are not a kind of “Bovine Coronavirus”?

cat data.txt | bio taxon -d 1 -remove 11128

prints:

11138,AB008939.1
11138,AB008940.1
215681,AB242262.1
215681,AB242263.1
215681,AB242264.1
229992,AB257344.1
391355,AB263618.1

Chain the actions

cat data.txt | bio taxon -d 1 -remove 11128 | bio taxon -d 1

prints:

no rank,11138,Murine hepatitis virus
no rank,215681,Canine respiratory coronavirus
no rank,229992,SARS coronavirus Frankfurt 1
no rank,391355,SARS coronavirus Frankfurt1-v01

List the content of the database:

bio taxon --list | head

prints:

no rank, root, 1
superkingdom, Bacteria (eubacteria), 2
genus, Azorhizobium, 6
species, Azorhizobium caulinodans, 7
species, Buchnera aphidicola, 9
genus, Cellvibrio, 10
species, Cellulomonas gilvus, 11
genus, Dictyoglomus, 13
species, Dictyoglomus thermophilum, 14
genus, Methylophilus, 16

Note: this command benefits greatly from using --preload. This command will only finish in reasonable time (in 10 seconds) with preload. Since the entire database is visited, it has to be loaded into memory for it to be fast.

bio taxon --list --preload | wc -l

Update the taxonomy

You may build the newest version locally:

bio taxon --build

The command will download and build a new taxonomy using the latest NCBI taxonomy data. The efficiency of the process depends on the speed of the hard drive and takes around 20 minutes.

Preloading data

For many use cases, the default behavior is plenty fast and can produce family, genus and species level information in a fraction of a second.

Internally, during operation, the software will query the database for each child node. When selecting a rank where the number of descendant nodes is large (over 10,000 nodes) the run time of the independent queries adds up to a substantial overhead.

For example the command below attempts to render the complete NCBI taxonomic tree with over 2.2 million (!) descendant nodes. When run like so it will take a very long time to produce the output (more than two hours):

bio taxon 1  | head

The data starts scrolling immediately, but the individual queries add up to potentially hours of runtime.

The software can operate in a different mode to speed up the process by preloading all the data into memory at the cost of imposing a 8 second pre-loading penalty.

bio taxon 1 --preload | wc -l

When run with the --preload flag the command takes a total of just 11 seconds to generate the same large tree of the entire NCBI taxonomical tree. We don’t apply this mode by default because all queries would then take at least 6 seconds, even those that currently finish very quickly.

For queries that take more than 10 seconds to complete (have more than 10,000 descendant nodes) we recommend applying the --preload flag.

Build the newest version locally

You may build the newest version locally, will take about an hour:

bio taxon --build