TAMER: taxonomic assignment of metagenomic sequence reads
Here we describe the details and provide the blast and R codes for
TAMER algorithm as described in the following paper:
Jiang H, An L, Lin SM, Feng G, Qiu Y (2012) A
Statistical Framework for Accurate Taxonomic Assignment of Metagenomic
Sequencing Reads. PLoS ONE 7(10): e46450.
doi:10.1371/journal.pone.0046450
0.
Prepare
the data.
The
proposed method identifies multiple genomes based on the hits obtained by
aligning sequence reads to reference sequences catalogued in databases. The
input file for the R codes is a text file containing the following columns:
(1)
read name,
(2)
NCBI sequence identifier to which the read find a match,
(3) taxon ID,
(4) scientific name of the taxon,
(5) number of identical matches,
(6) read length, and
(7) alignment length.
The test data, i.e.,
the input file for the R codes of our method is the output from running blastn on the reads. The test
data “online-454.d6c4961d.blast.output”can be
downloaded by clicking on the file name.
Note:
1)
The raw reads data in fasta format can be downloaded here.
2)
To get the test data from the raw reads,
here are our instructions.
To speed up the mapping gid and taxon ID and
scientific name, we also provide c codes count.c here.
There are two R functions which should be run sequentially for the
TAMER algorithm:
1. Process the data.
In the paper, we
store the matching score (number of identical matches) in the M matrix, where the
rows represent the reads and the columns represent the genomes. As M is a
sparse matrix and the number of reads is often large, it takes too much memory
when M is used. In practice, we only need to store the non-zero matching scores
in a matrix, where the first column is a read ID, the second column is the
taxon ID, and the third column is the non-zero matching score. For the
convenience of computing, the read ID and the taxon ID are presented using
integer numbers, and we need to record the mapping between read ID and read
name, and between taxon ID and taxon name.
2. Identify the multiple genomes and estimate number of
reads generated by each genome.
3. Demonstration using
an example.
In
the following, we show how to use the above R codes for a test
data which contains 10,000 reads with average read length of 100bp for the
high complexity (simHC) simulated using Metasim. The following are the summary of the true numbers
of reads generated from each genome:
632 Reads: `Agrobacterium tumefaciens str. C58 chromosome circular'
1422 Reads: `Anabaena variabilis
ATCC 29413 chromosome'
513 Reads: `Archaeoglobus
fulgidus DSM 4304'
828 Reads: `Bdellovibrio
bacteriovorus HD100'
372 Reads: `Campylobacter jejuni subsp. jejuni 81-176'
967 Reads: `Clostridium acetobutylicum ATCC 824'
579 Reads: `Lactococcus
lactis subsp. cremoris
SK11'
629 Reads: `Nitrosomonas
europaea ATCC 19718'
1545 Reads: `Pseudomonas aeruginosa PA7'
1937 Reads: `Streptomyces coelicolor A3(2) chromosome'
576 Reads: `Sulfolobus
tokodaii str. 7 chromosome'
Example
codes:
>Example.blast.output
<-read.delim("online-454.d6c4961d.blast.output", header=F)
>
source("Preprocess.R")
>
source("TAMER.R")
>step1.output<-Preprocess(Example.blast.output)
>step2.output<-RAMS(step1.output)
>step2.output$R.est ## check the estimation of the
proportions for all species
>step2.output$p.est ## check the estimation of the error
> write.csv(step2.output$Final, file=”myresult.csv”, header=TRUE)
>
step2.output$Final
##
Number of reads assigned to each taxon is listed in the column “Mixture”
TaxID Mixture
BLAST
Description
120 1358
1 63 Lactococcus lactis
599 77133
6 38 uncultured bacterium
668 100226 1873 1874 Streptomyces coelicolor A3(2)
864 176299 617 617 Agrobacterium tumefaciens
str. C58
982 224325 498 498 Archaeoglobus
fulgidus DSM 4304
1001
228410 607 608 Nitrosomonas
europaea ATCC 19718
1020
240292 1396 1398 Anabaena variabilis
ATCC 29413
1080
264462 806 806 Bdellovibrio
bacteriovorus HD100
1141
272562 937 937
Clostridium acetobutylicum ATCC 824
1145
272622 566 566
Lactococcus lactis
subsp. cremoris SK11
1155
273063 565 565 Sulfolobus
tokodaii str. 7
1375
354242 346 347
Campylobacter jejuni subsp. jejuni 81-176
1443
381754 1507 1511 Pseudomonas aeruginosa PA7
1931
557722 1 805 Pseudomonas aeruginosa LESB58
2365
757425 1 304 Campylobacter jejuni
subsp. jejuni ICDCCJ07001
4. Simulation datasets
used in the paper can be downloaded here.