TAMER: taxonomic assignment of metagenomic sequence reads

 

Here we describe the details and provide the blast and R codes for TAMER algorithm as described in the following paper:

 

Jiang H, An L, Lin SM, Feng G, Qiu Y (2012) A Statistical Framework for Accurate Taxonomic Assignment of Metagenomic Sequencing Reads. PLoS ONE 7(10): e46450. doi:10.1371/journal.pone.0046450

 

 

0.      Prepare the data.

The proposed method identifies multiple genomes based on the hits obtained by aligning sequence reads to reference sequences catalogued in databases. The input file for the R codes is a text file containing the following columns:

(1) read name,

(2) NCBI sequence identifier to which the read find a match,

(3) taxon ID,

(4) scientific name of the taxon,

(5) number of identical matches,

(6) read length, and

(7) alignment length.

 

The test data, i.e., the input file for the R codes of our method is the output from running blastn on the reads. The test data online-454.d6c4961d.blast.outputcan be downloaded by clicking on the file name.

 

Note:

1)      The raw reads data in fasta format can be downloaded here.

2)      To get the test data from the raw reads, here are our instructions. To speed up the mapping gid and taxon ID and scientific name, we also provide c codes count.c here.

 

 

There are two R functions which should be run sequentially for the TAMER algorithm:

 

1.   Process the data.

In the paper, we store the matching score (number of identical matches) in the M matrix, where the rows represent the reads and the columns represent the genomes. As M is a sparse matrix and the number of reads is often large, it takes too much memory when M is used. In practice, we only need to store the non-zero matching scores in a matrix, where the first column is a read ID, the second column is the taxon ID, and the third column is the non-zero matching score. For the convenience of computing, the read ID and the taxon ID are presented using integer numbers, and we need to record the mapping between read ID and read name, and between taxon ID and taxon name.

 

R function: Preprocess.R

 

2.   Identify the multiple genomes and estimate number of reads generated by each genome.

 

R function: TAMER.R

 

3.   Demonstration using an example.

In the following, we show how to use the above R codes for a test data which contains 10,000 reads with average read length of 100bp for the high complexity (simHC) simulated using Metasim. The following are the summary of the true numbers of reads generated from each genome:

 

632 Reads: `Agrobacterium tumefaciens str. C58 chromosome circular'

1422 Reads: `Anabaena variabilis ATCC 29413 chromosome'

513 Reads: `Archaeoglobus fulgidus DSM 4304'

828 Reads: `Bdellovibrio bacteriovorus HD100'

372 Reads: `Campylobacter jejuni subsp. jejuni 81-176'

967 Reads: `Clostridium acetobutylicum ATCC 824'

579 Reads: `Lactococcus lactis subsp. cremoris SK11'

629 Reads: `Nitrosomonas europaea ATCC 19718'

1545 Reads: `Pseudomonas aeruginosa PA7'

1937 Reads: `Streptomyces coelicolor A3(2) chromosome'

576 Reads: `Sulfolobus tokodaii str. 7 chromosome'

 

 

Example codes:

 

>Example.blast.output <-read.delim("online-454.d6c4961d.blast.output", header=F)

> source("Preprocess.R")

> source("TAMER.R")

 

>step1.output<-Preprocess(Example.blast.output)

>step2.output<-RAMS(step1.output)

 

>step2.output$R.est ## check the estimation of the proportions for all species

>step2.output$p.est ## check the estimation of the error

 

> write.csv(step2.output$Final, file=myresult.csv, header=TRUE)

> step2.output$Final

## Number of reads assigned to each taxon is listed in the column Mixture

TaxID Mixture BLAST Description

120 1358 1 63 Lactococcus lactis

599 77133 6 38 uncultured bacterium

668 100226 1873 1874 Streptomyces coelicolor A3(2)

864 176299 617 617 Agrobacterium tumefaciens str. C58

982 224325 498 498 Archaeoglobus fulgidus DSM 4304

1001 228410 607 608 Nitrosomonas europaea ATCC 19718

1020 240292 1396 1398 Anabaena variabilis ATCC 29413

1080 264462 806 806 Bdellovibrio bacteriovorus HD100

1141 272562 937 937 Clostridium acetobutylicum ATCC 824

1145 272622 566 566 Lactococcus lactis subsp. cremoris SK11

1155 273063 565 565 Sulfolobus tokodaii str. 7

1375 354242 346 347 Campylobacter jejuni subsp. jejuni 81-176

1443 381754 1507 1511 Pseudomonas aeruginosa PA7

1931 557722 1 805 Pseudomonas aeruginosa LESB58

2365 757425 1 304 Campylobacter jejuni subsp. jejuni ICDCCJ07001 

    

4.   Simulation datasets used in the paper can be downloaded here.

simLC 100bp

simMC 100bp

simHC 100bp

simMC 400bp