TAMER: taxonomic assignment of metagenomic sequence reads

 

Here we describe the details and provide the blast and R codes for TAMER algorithm as described in the following paper:

 

Jiang H, An L, Lin SM, Feng G, Qiu Y (2012) A Statistical Framework for Accurate Taxonomic Assignment of Metagenomic Sequencing Reads. PLoS ONE 7(10): e46450. doi:10.1371/journal.pone.0046450

 

 

0.      Prepare the data.

The proposed method identifies multiple genomes based on the hits obtained by aligning sequence reads to reference sequences catalogued in databases. The input file for the R codes is a text file containing the following columns:

(1) read name,

(2) NCBI sequence identifier to which the read find a match,

(3) taxon ID,

(4) scientific name of the taxon,

(5) number of identical matches,

(6) read length, and

(7) alignment length.

 

The test data, i.e., the input file for the R codes of our method is the output from running blastn on the reads. The test data “online-454.d6c4961d.blast.outputcan be downloaded by clicking on the file name.

 

Note:

1)      The raw reads data in fasta format can be downloaded here.

2)      To get the test data from the raw reads, here are our instructions. To speed up the mapping gid and taxon ID and scientific name, we also provide c codes count.c here.

 

 

There are two R functions which should be run sequentially for the TAMER algorithm:

 

1.   Process the data.

In the paper, we store the matching score (number of identical matches) in the M matrix, where the rows represent the reads and the columns represent the genomes. As M is a sparse matrix and the number of reads is often large, it takes too much memory when M is used. In practice, we only need to store the non-zero matching scores in a matrix, where the first column is a read ID, the second column is the taxon ID, and the third column is the non-zero matching score. For the convenience of computing, the read ID and the taxon ID are presented using integer numbers, and we need to record the mapping between read ID and read name, and between taxon ID and taxon name.

 

R function: Preprocess.R

 

2.   Identify the multiple genomes and estimate number of reads generated by each genome.

 

R function: TAMER.R

 

3.   Demonstration using an example.

In the following, we show how to use the above R codes for a test data which contains 10,000 reads with average read length of 100bp for the high complexity (simHC) simulated using Metasim. The following are the summary of the true numbers of reads generated from each genome:

 

       632 Reads: `Agrobacterium tumefaciens str. C58 chromosome circular'

      1422 Reads: `Anabaena variabilis ATCC 29413 chromosome'

       513 Reads: `Archaeoglobus fulgidus DSM 4304'

       828 Reads: `Bdellovibrio bacteriovorus HD100'

       372 Reads: `Campylobacter jejuni subsp. jejuni 81-176'

       967 Reads: `Clostridium acetobutylicum ATCC 824'

       579 Reads: `Lactococcus lactis subsp. cremoris SK11'

       629 Reads: `Nitrosomonas europaea ATCC 19718'

      1545 Reads: `Pseudomonas aeruginosa PA7'

      1937 Reads: `Streptomyces coelicolor A3(2) chromosome'

       576 Reads: `Sulfolobus tokodaii str. 7 chromosome'

 

 

Example codes:

 

>Example.blast.output <-read.delim("online-454.d6c4961d.blast.output", header=F)

 

> source("Preprocess.R")

> source("TAMER.R")

 

>step1.output<-Preprocess(Example.blast.output)

>step2.output<-RAMS(step1.output)

 

>step2.output$R.est  ## check the estimation of the proportions for all species

>step2.output$p.est  ## check the estimation of the error

 

> write.csv(step2.output$Final, file=myresult.csv, header=TRUE)

> step2.output$Final

## Number of reads assigned to each taxon is listed in the column Mixture

      TaxID Mixture BLAST                                    Description

120    1358       1    63                             Lactococcus lactis

599   77133       6    38                           uncultured bacterium

668  100226    1873  1874                  Streptomyces coelicolor A3(2)

864  176299     617   617             Agrobacterium tumefaciens str. C58

982  224325     498   498                Archaeoglobus fulgidus DSM 4304

1001 228410     607   608               Nitrosomonas europaea ATCC 19718

1020 240292    1396  1398                 Anabaena variabilis ATCC 29413

1080 264462     806   806               Bdellovibrio bacteriovorus HD100

1141 272562     937   937            Clostridium acetobutylicum ATCC 824

1145 272622     566   566        Lactococcus lactis subsp. cremoris SK11

1155 273063     565   565                     Sulfolobus tokodaii str. 7

1375 354242     346   347      Campylobacter jejuni subsp. jejuni 81-176

1443 381754    1507  1511                     Pseudomonas aeruginosa PA7

1931 557722       1   805                  Pseudomonas aeruginosa LESB58

2365 757425       1   304 Campylobacter jejuni subsp. jejuni ICDCCJ07001 

    

 

4.   Simulation datasets used in the paper can be downloaded here.

            simLC 100bp

            simMC 100bp

            simHC 100bp

            simMC 400bp