TAMER: taxonomic assignment of metagenomic sequence reads
Here we describe the details and provide the blast and R codes for TAMER algorithm as described in the following paper:
Jiang H, An L, Lin SM, Feng G, Qiu Y (2012) A Statistical Framework for Accurate Taxonomic Assignment of Metagenomic Sequencing Reads. PLoS ONE 7(10): e46450. doi:10.1371/journal.pone.0046450
0. Prepare the data.
The proposed method identifies multiple genomes based on the hits obtained by aligning sequence reads to reference sequences catalogued in databases. The input file for the R codes is a text file containing the following columns:
(1) read name,
(2) NCBI sequence identifier to which the read find a match,
(3) taxon ID,
(4) scientific name of the taxon,
(5) number of identical matches,
(6) read length, and
(7) alignment length.
The test data, i.e., the input file for the R codes of our method is the output from running blastn on the reads. The test data “online-454.d6c4961d.blast.output”can be downloaded by clicking on the file name.
1) The raw reads data in fasta format can be downloaded here.
There are two R functions which should be run sequentially for the TAMER algorithm:
1. Process the data.
In the paper, we store the matching score (number of identical matches) in the M matrix, where the rows represent the reads and the columns represent the genomes. As M is a sparse matrix and the number of reads is often large, it takes too much memory when M is used. In practice, we only need to store the non-zero matching scores in a matrix, where the first column is a read ID, the second column is the taxon ID, and the third column is the non-zero matching score. For the convenience of computing, the read ID and the taxon ID are presented using integer numbers, and we need to record the mapping between read ID and read name, and between taxon ID and taxon name.
2. Identify the multiple genomes and estimate number of reads generated by each genome.
3. Demonstration using an example.
In the following, we show how to use the above R codes for a test data which contains 10,000 reads with average read length of 100bp for the high complexity (simHC) simulated using Metasim. The following are the summary of the true numbers of reads generated from each genome:
632 Reads: `Agrobacterium tumefaciens str. C58 chromosome circular'
1422 Reads: `Anabaena variabilis ATCC 29413 chromosome'
513 Reads: `Archaeoglobus fulgidus DSM 4304'
828 Reads: `Bdellovibrio bacteriovorus HD100'
372 Reads: `Campylobacter jejuni subsp. jejuni 81-176'
967 Reads: `Clostridium acetobutylicum ATCC 824'
579 Reads: `Lactococcus lactis subsp. cremoris SK11'
629 Reads: `Nitrosomonas europaea ATCC 19718'
1545 Reads: `Pseudomonas aeruginosa PA7'
1937 Reads: `Streptomyces coelicolor A3(2) chromosome'
576 Reads: `Sulfolobus tokodaii str. 7 chromosome'
>Example.blast.output <-read.delim("online-454.d6c4961d.blast.output", header=F)
>step2.output$R.est ## check the estimation of the proportions for all species
>step2.output$p.est ## check the estimation of the error
> write.csv(step2.output$Final, file=”myresult.csv”, header=TRUE)
## Number of reads assigned to each taxon is listed in the column “Mixture”
TaxID Mixture BLAST Description
120 1358 1 63 Lactococcus lactis
599 77133 6 38 uncultured bacterium
668 100226 1873 1874 Streptomyces coelicolor A3(2)
864 176299 617 617 Agrobacterium tumefaciens str. C58
982 224325 498 498 Archaeoglobus fulgidus DSM 4304
1001 228410 607 608 Nitrosomonas europaea ATCC 19718
1020 240292 1396 1398 Anabaena variabilis ATCC 29413
1080 264462 806 806 Bdellovibrio bacteriovorus HD100
1141 272562 937 937 Clostridium acetobutylicum ATCC 824
1145 272622 566 566 Lactococcus lactis subsp. cremoris SK11
1155 273063 565 565 Sulfolobus tokodaii str. 7
1375 354242 346 347 Campylobacter jejuni subsp. jejuni 81-176
1443 381754 1507 1511 Pseudomonas aeruginosa PA7
1931 557722 1 805 Pseudomonas aeruginosa LESB58
2365 757425 1 304 Campylobacter jejuni subsp. jejuni ICDCCJ07001
4. Simulation datasets used in the paper can be downloaded here.