Rapid advancements in sequencing technologies along with falling costs present widespread opportunities for microbiome studies across a vast and diverse array of environments. These impressive technological developments have been accompanied by a considerable growth in the number of methodological variables, including sampling, storage, DNA extraction, primer pairs, sequencing technology, chemistry version, read length, insert size, and analysis pipelines, amongst others. This increase in variability threatens to compromise both the reproducibility and the comparability of studies conducted. Here we perform the first reported study comparing both amplicon and shotgun sequencing for the three leading next-generation sequencing technologies. These were applied to six human stool samples using Illumina HiSeq, MiSeq and Ion PGM shotgun sequencing, as well as amplicon sequencing across two variable 16S rRNA gene regions. Notably, we found that the factor responsible for the greatest variance in microbiota composition was the chosen methodology rather than the natural inter-individual variance, which is commonly one of the most significant drivers in microbiome studies. Amplicon sequencing suffered from this to a large extent, and this issue was particularly apparent when the 16S rRNA V1-V2 region amplicons were sequenced with MiSeq. Somewhat surprisingly, the choice of taxonomic binning software for shotgun sequences proved to be of crucial importance with even greater discriminatory power than sequencing technology and choice of amplicon. Optimal N50 assembly values for the HiSeq was obtained for 10 million reads per sample, whereas the applied MiSeq and PGM sequencing depths proved less sufficient for shotgun sequencing of stool samples. The latter technologies, on the other hand, provide a better basis for functional gene categorisation, possibly due to their longer read lengths. Hence, in addition to highlighting methodological biases, this study demonstrates the risks associated with comparing data generated using different strategies. We also recommend that laboratories with particular interests in certain microbes should optimise their protocols to accurately detect these taxa using different techniques.
Citation: Clooney AG, Fouhy F, Sleator RD, O’ Driscoll A, Stanton C, Cotter PD, et al. (2016) Comparing Apples and Oranges?: Next Generation Sequencing and Its Impact on Microbiome Analysis. PLoS ONE 11(2): e0148028. https://doi.org/10.1371/journal.pone.0148028
Editor: Bryan A. White, University of Illinois, UNITED STATES
Received: September 8, 2015; Accepted: January 12, 2016; Published: February 5, 2016
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: Sequence data are available from the NCBI Short Read Archive. The accession number is SRP068612.
Funding: This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2273 and 11/PI/1137 and by FP7 funded CFMATTERS (Cystic Fibrosis Microbiome-determined Antibiotic Therapy Trial in Exacerbations: Results Stratified, Grant Agreement no. 603038). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The use of Next Generation Sequencing (NGS) for the analysis of complex microbial communities has increased dramatically in recent years. Reasons for this include a continual decrease in cost and an ever greater appreciation of the ability of NGS to more comprehensively characterise microbial communities than traditional culture based methods. NGS has been advantageous in determining the role of the microbiome in disorders like Inflammatory Bowel Disease , diabetes , and obesity , or environmental communities like wetland soils  and oceans .
There are many methodological choices to be made when conducting a sequence-based microbiome study. These decisions have led to the introduction of a variety of technical variables that affect the compositional signal to various degrees, potentially limiting the ability to investigate the main hypothesis or to compare results relating to communities that are similar but which have been investigated using different methods. Factors such as sampling methods, DNA extraction protocol , amplification, purification and quantification  along with sequencing depth  can significantly impact results. For instance, using different purification and quantification methods can lead to a five-fold difference in sequence counts while a one-step versus two-step PCR method can led to significant differences in alpha and beta diversity between replicates .
The majority of microbiome studies have relied on 16S rRNA gene amplicon sequencing. There are nine different variable regions within the prokaryotes ubiquitous 16S rRNA gene (V1-V9), each flanked by highly conserved stretches of DNA suitable for primer binding . Depending on sequencing technology and chemistry it is possible to sequence a number of adjacent variable 16S rRNA gene regions. However, none of the currently available technologies offer full-length gene sequencing at sufficient depth to allow for multiplexing larger numbers of samples on the same run. Unfortunately no standard approach exists for selecting the most appropriate primer pair suitable for all taxa and type of samples, and the decision is often made based on anecdotal evidence and/or advice from the published literature , , .
One of the first considerations before embarking on a microbiota project is to select a sequencing technology. Traditionally, the most common options are Roche 454 GS-FLX, the Illumina MiSeq (lower output, longer reads) and HiSeq (higher output, shorter reads) and the Ion PGM, each offering a series of advantages and disadvantages (see http://www.molecularecologist.com/next-gen-fieldguide-2014/ for a guide). Both the Illumina and Ion instruments utilise a sequencing by synthesis approach where Illumina use DNA templates immobilised on glass slides and optical detection of fluorescently-labelled nucleotides, whereas templates for the Ion Platforms are immobilised in wells on a semi-conductor chip followed by electrical detection of released hydrogen ions. The Illumina and Ion technologies have been compared for amplicon sequencing using various sampling environments, variable regions of the 16S rRNA gene and analysis pipelines. In one case, when stringent quality filtering and lower sequence similarity cut-off when clustering operational taxonomic units (OTUs) were applied on V4 reads sequenced, negligible differences in alpha and beta diversities were observed within and between soil samples when comparing the MiSeq and the PGM . This concordance was further supported when comparing MiSeq and PGM derived microbiota composition as determined by sequencing V1-V2 amplicons generated using a 20-species mock community and human-derived samples . In the latter case it should be noted that, some significant differences were attributed to the PGM failing to produce full-length reads for certain organisms. Furthermore, while not comparing amplicon sequencing and using relatively early versions of sequencing chemistry on an isolated E. coli species, Loman and colleagues found MiSeq to have lower error rates and longer reads than the PGM, which on the other hand had the fastest turn-around-time .
Comparative studies were also conducted to assess the initial potential of the MiSeq to replace the Roche 454 GS-FLX, while also evaluating the effect of the variable region studied. Kozich and co-authors established a dual-index barcoding approach suitable for variable MiSeq read lengths and amplicon regions, in particular V3-V4, V4 and V4-V5 regions . In terms of read quality, MiSeq was either comparable or better than the GS-FLX Titanium, and the V3-V4 better than the V4-V5 region. Another study compared amplicon sequences of seven tandem variable regions produced by the GS-FLX Titanium and Illumina GAII (predecessor of HiSeq) and showed the V3-V4 and V4-V5 primer combinations performed worst and best in terms of classification accuracy, irrespective of the technology used . It is clear that the choice of primers can have a major effect on the outcome, which was also further substantiated by Tremblay and co-authors, as the V6-V8 or V7-V8 regions returned taxonomic composition from a synthetic community that differed to higher degree than what the V4 region did .
With the ever increasing number of technological variables that have the potential to have non-trivial effects on microbiota composition analysis, it is critically important to maintain a consistent methodology within studies and when comparing studies, or to have evidence that any inconsistencies that exist do not bias results. A more expensive alternative to 16S rRNA gene amplicon sequencing is shotgun metagenomic sequencing, which bypasses gene-specific amplification and potentially sequences all fragmented DNA, including that from other microorganisms and viruses, in a community. While providing much more information, including encoded functions of the microbiota, the vast amount of sequence data obtained however leads to a new set of challenges in terms of data processing, storage and analysis. For instance, the Illumina HiSeq 2500 platform can yield over 1,000,000,000,000 bp (1 Tbp) of raw sequence data, which may increase several-fold during downstream processing and analysis. Shotgun sequencing is also possible using both the Illumina MiSeq and Ion PGM albeit with less throughput compared to HiSeq. Some non-metagenomic studies have evaluated these platforms and demonstrated comparable results when used to detect blood pathogens , diagnose dementia , and detect gene variants across four microbial genomes .
In the current study we investigated the impact of various amplicon primer combinations and sequencing technologies on the analysis of complex microbial communities. More specifically we compared amplicon and shotgun data generated by Illumina MiSeq, HiSeq and Ion PGM through the use of six human stool samples using two primer sets covering two different 16S rRNA gene regions (V1-V2  and V4-V5 ). We also assessed the depth requirements for analysing stool shotgun datasets, and thus if the MiSeq and/or PGM represent suitable alternatives to the HiSeq.
Materials and Methods
16S rRNA gene amplicon sequencing
Stool samples were collected from six elderly individuals and stored at -80°C during the ELDERMET project , approved by the Cork Clinical Research Ethics Committee of the Cork Teaching Hospitals (CREC), which granted full approval on the 19th February 2008 (Ref: ECM 3 (a) 01/04/08). Formal written consent was obtained at the time of recruitment, on the basis of an Information Sheet/Safety Statement, following an ethics protocol that was approved by CREC in compliance with pertaining local, national and European ethics legislation and guidelines to best practice. DNA was extracted from stool samples using previously described methods , together with a modified Qiagen DNA extraction procedure. Briefly, DNA was extracted using a QIAamp DNA stool Kit with the addition of an initial bead beating step. Microbial DNA from stool samples was used as template for PCR, which contained 25μl Biomix Red (MyBio, Kilkenny, Ireland), 1 μl forward primer (Sigma Aldrich, Dublin, Ireland) (10pmol), 1 μl reverse primer (Sigma Aldrich) (10pmol), template DNA and PCR grade water (MyBio), to a final reaction volume of 50μl. Conditions were optimised so that only 1 band of the correct sizes was obtained and all PCR were completed in triplicate (see S1 Table for primers and further details). Triplicate PCR products were pooled and cleaned using AMPure magnetic bead purification system (1:1.8 DNA:AMPure ratio) (Beckman Coulter, UK). Cleaned samples were quantified using Picogreen Quant-iT quantification and the Nanodrop 3300 (Fisher Scientific, Dublin, Ireland). Samples were subsequently pooled in an equimolar concentration of 10pM and prepared for MiSeq sequencing using standard Illumina protocols. Libraries were mixed with Illumina generated PhiX (20% of 12.5pM) control libraries and were denatured using freshly prepared NaOH and sequenced using a V3 600-cycle kit. For the PGM, libraries were pooled at a concentration of 10pM and sequenced according to Ion PGM protocols.
Metagenomic shotgun sequencing
For Illumina MiSeq shotgun sequencing, samples were initially tagmented, whereby the Nextera Transposome with sequencing adaptors combines to template DNA resulting in fragmentation of the DNA and the addition of adaptors using the Nextera XT kit from Illumina. A limited 12-cycle PCR was completed during which time sequencing adaptors and indexing primers were added to the DNA. Amplicon samples were then normalized and pooled, followed by sequencing on the MiSeq platform using Illumina protocols for a 2 x 300 cycle run, with an insert size of 400 bases.
Shotgun libraries for Ion PGM were generated according to instructions from the ‘Ion Xpress™ Plus gDNA Fragment Library Preparation’ User guide (Publication number MAN0007044). Libraries were sheared, size selected and individually barcoded using the Ion Xpress Barcode Adapters. Following library quantification and equimolar pooling, the Ion OneTouch™ 2 system was used to prepare template positive ion sphere particles containing the clonally amplified DNA libraries using the ION PGM™ Template OT2 400 Kit, allowing up to 400 bp single-end reads. Enrichment of the template positive ISPs was performed using the Ion OneTouch™ ES and an enrichment percentage of 18% was obtained, which was within the range recommended in the ION PGM™ Template OT2 400 Kit guide (Publication number MAN0007218). Sequencing was performed on the Ion PGM using an Ion 318v2 chip and the Ion PGM Sequencing 400 kit (guide number MAN0007242).
Shotgun Illumina HiSeq sequencing reads were obtained from the published ELDERMET dataset . The paired-end read lengths were 2 x 90 bp with an insert size of 300 bases. DNA was extracted from samples using the same method as used above.
MiSeq reads were merged and filtered using join_paired_ends.py in QIIME version 1.8 using the fastq-join.py tool , whereas the single-end PGM reads were not. Demultiplexing of both MiSeq and PGM reads was carried out using split_libraries.py also on QIIME  with default parameters retaining only reads matching the main length distributed (S1 Table) per primer and with an average quality score of Q25 or above. The differences in quality filtering lengths is due to reverse primers being present in the MiSeq reads. Chimeric sequences were removed via USEARCH version 7.0.1090 using the uchime_ref.py command along with the ChimeraSlayer GOLD database . OTUs were clustered using the QIIME script pick_closed_reference_otus.py and the RDP database version 11.4. The Mothur implementation of the RDP classifier was used to assign taxonomy from phylum to genus  with a bootstrap cut-off of 80%. Any sequences with less than 80% bootstrap values were assigned as unclassified at that particular rank. Species counts for amplicon data were generated using SPINGO with default parameters .
All three shotgun datasets reads were aligned to the human genome version 20 (hg20) to filter out human-derived sequences using Bowtie2 version 2.2.3. Illumina HiSeq and MiSeq reads were subsequently quality filtered and trimmed using Trimmomatic version 0.32  and only allowing a quality PHRED cut-off score of at least Q22 across a sliding window of 20 bp. Reads with a minimum length of 30 bp were also removed. Only PGM reads with a quality score of greater than Q15 and longer than 30bp were retained for downstream analysis .
All metagenome assemblies were performed using IDBA_UD version 4.1.2  and MetaVelvet version 1.2.02 . Phylogenetic binning was achieved using MetaPhlAn version 2 , Kraken version 0.10.5-beta  and GOTTCHA version 0.7.5 . MetaPhlAn2 classifies sequences via clade-specific marker genes, Kraken uses exact alignment of k-mers and a lowest common ancestor approach, while GOTTCHA maps reads to non-redundant signature databases to classify at multiple taxonomic levels. Genes were predicted using MetaGeneMark version 3.26 . Metaphor was used to predict core and unique genes with thresholds set to 30% amino acid identity across an alignment covering 50% of both sequence lengths . The core and unique genes were then mapped against the EGGNOG database version 4 using BLAST to create functional profiles for each of the samples and datasets retrieving the top hit with an E-value of 1e-5.
All statistical analysis was performed in R version 3.1.3. In each of the heatplots, Spearman correlations, along with Ward D2 clustering, were performed on the relative abundance at genus level of each sample. As the data was largely non-parametric, Spearman correlations were chosen to prevent breaking the statistical assumptions of Pearson correlations. A Mann-whitney test was used to analyse differences in the taxa between clusters. Where necessary, the P-values were corrected for multiple testing using Benjamini and Hochberg . A P-value of <0.05 was considered significant.
The data generated reflected the different outputs of the three platforms. For the amplicon datasets the PGM produced 57,720 (mean) ± 9,841 (SD) V1-V2, and 33,454 ± 10,488 V4-V5 reads per sample, respectively, while the MiSeq produced 181,758 ± 108,343 V1-V2, and 102,824 ± 22,154 V4-V5 reads per sample, respectively. For the shotgun datasets there was also a marked difference between the three sequencing technologies, with 26,590,475 ± 51,650 HiSeq, 1,352,748 ± 458,483 MiSeq and 962,226 ± 170,251 PGM reads were generated per sample, respectively.
We performed hierarchical clustering analysis on the microbiota composition of all six stool samples in order to assess the effect of the amplification primer combination (where relevant), sequencing strategy (16S rRNA gene or shotgun), sequencing technology and type along with metagenomic read classifier. Fig 1 shows a heat-plot with hierarchical clustering of the proportional taxonomic abundances at the genus level, with only genera in a minimum of 20% of the datasets included. All shotgun datasets fell into one large cluster with three distinct sub-clusters, labelled 2, 3 and 4. It is worth highlighting that although the shotgun samples clustered together, there were major discrepancies between the taxonomic profiles (sub-clusters) dictated by the metagenomic classifier used with one exception, sample 6 sequenced on the PGM and classified by GOTTCHA, which clustered with the MetaPhlAn2 sample 6 datasets. In the MetaPhlAn2 cluster (cluster 4), the datasets grouped by sample in each case, which is preferable as it suggests the technical variation is less than the inter-individual variation. For all six samples, the HiSeq and MiSeq datasets clustered together while the PGM sample was located to the side of the sub-cluster. For the GOTTCHA classifier, datasets grouped by sequencer more than by sample. Here there were no case where all three shotgun technologies clustered together by sample. For the third shotgun classifier, Kraken (cluster 2), five of the six samples clustered by sample with the exception of the MiSeq dataset for sample 2. Unlike MetaPhlAn2, the PGM formed sample-wise sub-clusters with HiSeq or MiSeq, with the two Illumina technologies not forming any sub-clusters. Out of a total of 163 genera, 23 were statistically significant between cluster 3 (GOTTCHA) and 4 (MetaPhlAn2) in Fig 1 where the most significant genera included Ruminococcus (increased in cluster 3; P-value = 9.88 x 10−05), Blautia (increased in cluster 3; P-value = 1.30 x 10−05) and Campylobacter (increased in cluster 3; P-value = 9.30 x 10−06). When comparing Kraken, cluster 2, to the other two shotgun classifiers (cluster 3 and 4) there were 52 statistically significant different genera. These included Buchnera, Cellulomonas and Cellvibrio, all increased in the Kraken dataset each with an adjusted P-value of 1.82 x 10−11. Of the 15 most significantly different genera, all but one were absent from the GOTTCHA and MetaPhlAn2 clusters, thereby indicating possible false positives detected by Kraken. The three aforementioned taxa are also not predominant colonisers of the human gut thus reinforcing the possibility of inaccuracies in Kraken assignments. See S2 Table for a full list of taxonomy comparisons.
Fig 1. Heat-plot representing the taxonomic composition of the samples at genus level.
The heat-plot also includes amplicon data long shotgun datasets from three classifiers namely: MetaPhlAn2, Kraken and GOTTCHA. Only genera in a minimum of 20% of datasets were retained. The method of correlation used was Spearman along with Ward D2 Clustering (PGM = Ion Personal Genome Machine).
For the amplicon datasets, sample-wise clustering was less prevalent than for the metagenomic datasets. MiSeq V1V2 amplicons were contained in a distinctive sub-cluster, contained within the cluster labelled 1 in Fig 1, clearly separated from the rest of the amplicon datasets. A second sub-cluster contained all the sample 3 and 6 amplicon datasets, with the exception of the V4V5 Miseq dataset and the aforementioned V1-V2 MiSeq dataset. The third sub-cluster contained the majority of the V4V5 MiSeq samples (4 of 6) along with two V4V5 PGM samples. In this case the amplicons clustered by 16S rRNA gene primer combination, as opposed to by sample or by technology. The final sub-cluster contained the majority of the V1V2 PGM datasets (4 of 6) along with 3 of the 4 sample 5 datasets (V1V2 MiSeq being the missing dataset). Investigating the differences between cluster 1 (amplicon data) and clusters 2–4 (shotgun data), uncovered 91 genera to be statistically significant, therefore showing the large differences between amplicon and shotgun classification methods of reads. The full list of taxonomy comparisons are found in S2 Table.
As for bacterial taxa that were the most abundant across all of the datasets, there were some families that differentiate the six subjects regardless of methodology used (Fig 2): For example, Porphyromonadaceae genera were consistently high in Sample 6 datasets compared to the other samples, and so were genera belonging to the Prevotellaceae family in Sample 3, irrespective of primer combination or sequencing technology. For samples 1 and 5 the shotgun-based methods appeared more sensitive with respect to detecting Enterobacteriaceae genera within the Proteobacteria phylum compared to the amplicon-based approaches, which could be attributed to the difficulty of discriminating such taxa at 16S rRNA gene level.
Fig 2. Bar-charts of taxonomic composition at family level.
The families are first organised by phylum abundance (highest to lowest) followed by family abundance (highest to lowest) in each of the phyla. The numbers of observed species are located at the top of each bar.
Fig 2 also highlights the number of unique species in each dataset, as identified by MetaPhlAn2 for shotgun data and SPINGO for amplicon data. Note that these were species that could be confidently classified as such, and should not be mistaken as number of unique OTUs. The highest numbers of unique species among all shotgun methods were detected in the HiSeq datasets, comparable to those resulting from the analysis of amplicons. The success of the HiSeq with respect to shotgun sequencing is not surprising given the greater sequencing depth it can provide resulting in detection of rarer species. The lowest number of unique species overall was detected in the MiSeq shotgun datasets, which is not due to total number of reads as PGM had fewer of these. For the amplicon datasets, the highest number of unique species was detected with the PGM datasets for five of the six datasets. Although the species counts for the pooled PGM amplicons was higher when compared to the MiSeq amplicons, the difference was not statistically significant (P-value = 0.24). However, when comparing particular primer combinations, the difference in the V1-V2 species counts between the two technologies was significant at the 10% level (P-value = 0.093). We further analysed the effect of varying sequencing depth on the number of unique species detected for each amplicon run (Fig 3). The highest numbers of species were detected at each read depth by the V1V2 amplicon on the PGM, while the lowest was the V1V2 on the Illumina MiSeq. All primer datasets reached saturation in the number of new species detected, other than the V4V5 primer on the PGM which was limited by the number of reads for some samples. However, despite this, more unique species were detected with this primer/technology combination than both MiSeq datasets, which had vastly more reads.
Fig 3. Observed Species at various sequencing depths for the amplicon data using SPINGO.
The data points represent the median values across the 6 samples and the error bars are the 25% and 75% quartile ranges.
Shotgun sequencing depth
To investigate which technology was most suitable for shotgun sequencing, we performed random subsampling of reads to determine occurrences at even sequencing depths, in recognition of the fact that the HiSeq coverage was substantially higher than the coverage for MiSeq and PGM. Fig 4 shows the median N50 values across each of the six samples per technology, including three replicates (random sub-samplings) for each sample. At the lowest sequencing depth selected (150,000 reads) the assembly using the MiSeq data had the highest N50 (minimum contig length above which 50% of all reads are assembled into), possibly due the longer read lengths. However, as more reads were added, the HiSeq data began to outperform the assembly from both the MiSeq and the PGM technologies. The MiSeq and PGM datasets became limited by read number and their N50 value plateaued at 1.7 million and 950,000 reads, respectively. Due to the large number of HiSeq reads, the N50 peaked at 10 million reads after a large increase at 1.7 million reads. Two of the six HiSeq datasets (Samples 1 and 5) had a very large N50 at 600,000 reads. In order to ensure that the results were not affected by the assembler selected, the datasets were also assembled using both Velvet (S1 Fig) and MetaVelvet (S2 Fig). Interestingly, the same two samples for the HiSeq datasets had an elevated N50 for both Velvet and MetaVelvet, however at 1.3 million and 950,000 reads respectively (S3 Table).
Fig 4. N50 values representing randomly subsampled reads at various sequencing depths after assembly by IDBA_UD.
Each point represents the median value across each of the 6 samples per technology (including 3 replicates per sample). Error bars are the 25% and 75% quartile ranges.
Furthermore, unique species detection was also performed on the sub-sampled shotgun sequencing-derived reads (Fig 5). At low sample depths the HiSeq, MiSeq and PGM datasets were comparable with few differences in the number of species detected. At 950,000 reads, the PGM data reached the read limit, but was still similar to the other technologies in terms of number of species. However, at 1.7 million reads, the HiSeq species counts continued to increase while the MiSeq counts level off. This could possibly be due to the fact that the longer MiSeq read lengths result in more accurate species assignments relative to HiSeq, leading to earlier plateauing. In the overall graph (Fig 5 insert) the HiSeq counts continued to increase without levelling off completely even at the 25 million read point.
Fig 5. Number of species observed from randomly subsampled reads using MetaPhlAn2.
Each point represents the median value across each of the 6 samples per technology (including 3 replicates per sample). Error bars are the 25% and 75% quartile ranges.
From within the categories of shotgun datasets, the core and unique genes were predicted using Metaphor (Fig 6). This was carried out on 600,000 reads per dataset in order to allow for comparative results at equal sequencing depth. For the core genes all three technologies gave broadly the same results, however the HiSeq data had the most poorly characterised genes out of the three datasets, along with the lowest number of genes with a “Metabolism” function and the highest with no function. Surprisingly, this technology did not predict any core genes for the categories, “Energy Production and Conversion” or “Inorganic Transport and Metabolism”, whereas both of these categories were present in the core gene profiles of the MiSeq and PGM datasets. The MiSeq datasets predicted the highest number of genes within the “Metabolism” category, while the PGM data predicted the highest for “Information Storage and Processing”, whilst also being the only technology to predict core genes in the category “Cell Motility”. The number of genes predicted by MetaGeneMark are listed in Fig 6. At a read depth of 600,000 sequences, the MiSeq datasets predicted the most genes for each of the 6 samples while the HiSeq datasets gave the lowest gene number of 5 of the 6 samples. This is a possible reason why this technology gives the most detailed core and unique gene profile.
Fig 6. Core and unique genes acquired by Metaphor with 600,000 sequencing randomly selected datasets for each of the samples.
The numbers represent the total number of predicted complete or incomplete genes for each metagenome.
The NGS technologies Illumina MiSeq, HiSeq and Ion PGM have shown significant promise in delivering cost-effective, high-resolution insights into microbiomes from various environments. However, due to a multitude of technical variables, careful comparisons are required to provide recommendations for suitable methodological approaches. In response to this, we compared the taxonomic composition of six stool samples using two different primer combinations covering two 16S rRNA gene variable regions. We then compared these results with those of shotgun sequencing using Illumina and Ion technologies.
Following either OTU clustering of amplicon reads or taxonomic classification by binning of shotgun reads, all at genus level, we compared microbiota composition of the different datasets. Even though the gut microbiota is generally regarded as individual specific, it was apparent that some amplicon datasets clustered according to technology and/or primer set, rather than by subject. In particular, microbiota composition from all V1V2 MiSeq and four of the six V4V5 MiSeq datasets grouped together in separate sub-clusters. The V1V2 and V4V5 PGM datasets clustered by sample opposed to technology in 3 of the 6 samples (samples 1, 3 and 6) while the V4V5 MiSeq data clustered with V4V5 PGM data per sample in 2 of the 6 samples (samples 5 and 6).
To ensure that the differences in classifications between shotgun and amplicon sequencing were not simply due to a particular shotgun classification method, we compared the compositional clustering with three classifiers of shotgun reads, MetaPhlAn2, GOTTCHA and Kraken. The shotgun datasets grouped together in a sub-cluster separated from the amplicon datasets, which might be expected as these methods are independent of amplification bias and 16S rRNA gene copy number differences. With MetaPhlAn2, all Illumina HiSeq and MiSeq datasets were consistently closer to each other than to the PGM shotgun sequences. This is seen to a smaller degree with GOTTCHA, where three of the six samples sub-clustered the Illumina technologies, but not at all for Kraken assemblies. In terms of clustering by sample over method, MetaPhlAn2 gave the most optimal results with all datasets clustered by sample groups, closely followed by Kraken where this occurred for 5 of the 6 samples in separate sub-clusters. GOTTCHA failed to cluster any dataset by samples, indicating its higher sensitivity for technological artefacts between sequencing methods. However, it must be noted that measuring accuracy based on individual sample clustering is not always a reflection of performance, as GOTTCHA datasets clustered more closely to MetaPhlAn2 and although sample clustering is observed when using Kraken, many of the taxonomic assignments may be false positives as previously mentioned.
Unsurprisingly, Illumina HiSeq shotgun sequences translated to the highest number of species, compared to the other two shotgun datasets, which were more than an order of magnitude smaller. Sub-sampling that simulated lower HiSeq coverage revealed, however, that even equal number of reads could result in more observed species for HiSeq. As this technology produces shorter reads compared to MiSeq and PGM it is possible that the number of species is artificially inflated as a result of higher sequence variation created from incorrect alignment to the reference marker genes. While not directly comparable with species observed through shotgun sequencing, V1-V2 amplicons, which are expected to be more variable than V4-V5 amplicons, sequenced by PGM resulted in the highest species counts.
Despite having the largest number of reads per sample, the V1-V2 region on the MiSeq had at each subsampling point the lowest number of unique species identified. This could be due to the questionable reliability for this primer combination in relation to unexpected clustering and failure to detect expected genera. Curiously, Salipante et al. , found that sequencing using the same V1-V2 primers on the PGM led to higher error rates when compared to the MiSeq, particularly for a mock community of 20 organisms where deviating abundances of single strains have much greater effect on the overall community composition than in a high-diversity sample. Other reasons for the different results in Salipante et al. study could be attributed to discrepancies in amplification (one-step PCR reaction) and taxonomic assignment (older RDP-classifier version and BLAST).
The benefits to using metagenomic shotgun over amplicon sequencing are clear in terms of increased information content and reduced biases related to amplification and gene copy numbers. However, it is currently not established what sequencing depth is required for the different technologies; this is a more pertinent issue for shotgun than for amplicon sequencing, due to its much higher cost per sample. We therefore assembled the randomly sub-sampled shotgun datasets and compared the common N50 metric across the three sequencing technologies. As expected, the MiSeq technology, with its non-overlapping 300 bp paired-end reads, had marginally higher N50 values than HiSeq and PGM. An N50 peak occurred at 10 million reads for the HiSeq data suggesting that this was the optimal point for sequencing depth for stool samples and 100 bp paired-end reads with 300 bp insert size. There was no peak observed for the PGM or the MiSeq in the available coverage range, which may suggest that the coverage may not be sufficient to reach an optimal level of assembly. Somewhat surprisingly, for two of the six samples there were drastically elevated N50 values at 600,000 HiSeq reads, irrespective of which random sub-sampling set. Such early N50 peaks were also observed using two other assemblers, albeit for a different number of reads, and has previously been reported when assembling sub-samples of an isolated bacterium . In that case, the authors reasoned that this could be due to chimeric reads, duplications or sequencing errors, and recommended that assembled contigs should be incrementally assembled in sub-sections before a final merge. We also suggest that for our data, this read depth may be where the majority of high abundant species are assembled and as more rare taxa are added the assembly becomes less efficient.
In terms of functional categorisation of assembled shotgun sequences, we found the MiSeq and PGM datasets to largely contain equal proportions of predicted core genes from the assembled contigs. For the HiSeq assemblies there were, however, substantially fewer core genes involved in “Metabolism” and more genes with unknown function. This may be attributable to the fewer number of predicted complete genes, which is plausible for this shorter-read technology.
To summarise, this is, to our knowledge, the first reported study comparing both amplicon and shotgun sequencing for Illumina and Ion technologies. Although shotgun sequencing did not suffer from the same degree of technology-dependent bias seen with the amplicon sequencing, there were some major distinct differences between phylogenetic binning software, with MetaPhlAn2 producing the most favourable results. GOTTCHA failed to cluster any datasets by sample, however sub-clustered with MetaPhlAn2, while Kraken clustered separately from the other two binners and also appeared to produce a high number of false positive taxonomic assignments. The variation of microbiota composition between the majority of gut samples proved to be lesser than between the compared sequencing technologies and variable 16S rRNA gene regions. In particular, the V1-V2 MiSeq showed poor performance, while the V4-V5 region was marginally more reliable on both platforms. There is evidence that the MiSeq and PGM offer valuable information when used for shotgun sequencing, however, in order to detect the majority of species in samples and to perform a high quality assembly, deeper sequencing is required. Species assignment is also dependent on read length, which is shorter for the HiSeq. We subsequently showed that there may be no assembly-related benefit in sequencing greater than 10 million HiSeq reads per stool sample. Nevertheless, as the cost of shotgun sequencing is lower on the HiSeq instrument compared to MiSeq or PGM, this platform may still be preferable even though MiSeq produces longer reads and somewhat better assemblies at low sequencing depth. Caution should however be applied with regards to taxonomic binning, and comparisons such as those described in this study must be carried out to prevent methodological biases eclipsing the true biological picture. Hence, we advise laboratories with particular interests in certain microbes to optimise their protocols to accurately detect these taxa using different techniques.
S1 Fig. N50 values representing randomly subsampled reads at various sequencing depths after assembly by Velvet.
Each point represents the median value across each of the 6 samples per technology (including 3 replicates per sample). Error bars are the 25% and 75% quartile ranges.
S2 Fig. N50 values representing randomly subsampled reads at various sequencing depths after assembly by MetaVelvet.
S1 Table. PCR primer, linker and adaptor sequences used for sequencing samples on the PGM Ion Torrent and Illumina MiSeq.
The table also contains the PCR conditions for 16S rRNA gene amlification and sequence length for quality filtering during read processing.
The authors wish to thank Dr. Fiona Crispie and Ms. Vicki Murray for their extensive help with the sequencing in this study. This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2273 and 11/PI/1137 and by FP7 funded CFMATTERS (Cystic Fibrosis Microbiome-determined Antibiotic Therapy Trial in Exacerbations: Results Stratified, Grant Agreement no. 603038).
Conceived and designed the experiments: PC MC AC FF. Performed the experiments: AC FF. Analyzed the data: AC FF. Contributed reagents/materials/analysis tools: PC MC. Wrote the paper: AC FF AOD CS RS PC MC.
- 1. Gevers D, Kugathasan S, Denson LA, Vazquez-Baeza Y, Van Treuren W, Ren B, et al. The treatment-naive microbiome in new-onset Crohn's disease. Cell host & microbe. 2014;15(3):382–92. pmid:24629344; PubMed Central PMCID: PMC4059512.
- 2. Zhou M, Rong R, Munro D, Zhu C, Gao X, Zhang Q, et al. Investigation of the effect of type 2 diabetes mellitus on subgingival plaque microbiota by high-throughput 16S rDNA pyrosequencing. PloS one. 2013;8(4):e61516. pmid:23613868; PubMed Central PMCID: PMC3632544.
- 3. Walters WA, Xu Z, Knight R. Meta-analyses of human gut microbes associated with obesity and IBD. FEBS letters. 2014;588(22):4223–33. pmid:25307765.
- 4. Lv X, Yu J, Fu Y, Ma B, Qu F, Ning K, et al. A meta-analysis of the bacterial and archaeal diversity observed in wetland soils. The Scientific World Journal. 2014;2014.
- 5. Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Ocean plankton. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359. pmid:25999513.
- 6. Salonen A, Nikkila J, Jalanka-Tuovinen J, Immonen O, Rajilic-Stojanovic M, Kekkonen RA, et al. Comparative analysis of fecal DNA extraction methods with phylogenetic microarray: effective recovery of bacterial and archaeal DNA using mechanical cell lysis. Journal of microbiological methods. 2010;81(2):127–34. pmid:20171997.
- 7. Sinclair L, Osman OA, Bertilsson S, Eiler A. Microbial community composition and diversity via 16S rRNA gene amplicons: evaluating the illumina platform. PloS one. 2015;10(2):e0116955. pmid:25647581; PubMed Central PMCID: PMC4315398.
- 8. Gihring TM, Green SJ, Schadt CW. Massively parallel rRNA gene sequencing exacerbates the potential for biased community diversity comparisons due to variable library sizes. Environmental microbiology. 2012;14(2):285–90. pmid:21923700.
- 9. Neefs JM, Van de Peer Y, De Rijk P, Chapelle S, De Wachter R. Compilation of small ribosomal subunit RNA structures. Nucleic acids research. 1993;21(13):3025–49. pmid:8332525; PubMed Central PMCID: PMC309731.
- 10. Sundquist A, Bigdeli S, Jalili R, Druzin ML, Waller S, Pullen KM, et al. Bacterial flora-typing with targeted, chip-based Pyrosequencing. BMC microbiology. 2007;7:108. pmid:18047683; PubMed Central PMCID: PMC2244631.
- 11. Claesson MJ, Wang Q, O'Sullivan O, Greene-Diniz R, Cole JR, Ross RP, et al. Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic acids research. 2010;38(22):e200. pmid:20880993; PubMed Central PMCID: PMC3001100.
- 12. Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied and environmental microbiology. 2013;79(17):5112–20. pmid:23793624
- 13. Pylro VS, Roesch LF, Morais DK, Clark IM, Hirsch PR, Totola MR. Data analysis for 16S microbial profiling from different benchtop sequencing platforms. Journal of microbiological methods. 2014;107:30–7. pmid:25193439.
NGS Sequencing Department, Beijing Genomics Institute (BGI), 4th Floor, Building 11, Beishan Industrial Zone, Yantian District, Guangdong, Shenzhen 518083, China
Academic Editor: P. J. Oefner
Copyright © 2012 Lin Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world’s biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized.
(Deoxyribonucleic acid) DNA was demonstrated as the genetic material by Oswald Theodore Avery in 1944. Its double helical strand structure composed of four bases was determined by James D. Watson and Francis Crick in 1953, leading to the central dogma of molecular biology. In most cases, genomic DNA defined the species and individuals, which makes the DNA sequence fundamental to the research on the structures and functions of cells and the decoding of life mysteries . DNA sequencing technologies could help biologists and health care providers in a broad range of applications such as molecular cloning, breeding, finding pathogenic genes, and comparative and evolution studies. DNA sequencing technologies ideally should be fast, accurate, easy-to-operate, and cheap. In the past thirty years, DNA sequencing technologies and applications have undergone tremendous development and act as the engine of the genome era which is characterized by vast amount of genome data and subsequently broad range of research areas and multiple applications. It is necessary to look back on the history of sequencing technology development to review the NGS systems (454, GA/HiSeq, and SOLiD), to compare their advantages and disadvantages, to discuss the various applications, and to evaluate the recently introduced PGM (personal genome machines) and third-generation sequencing technologies and applications. All of these aspects will be described in this paper. Most data and conclusions are from independent users who have extensive first-hand experience in these typical NGS systems in BGI (Beijing Genomics Institute).
Before talking about the NGS systems, we would like to review the history of DNA sequencing briefly. In 1977, Frederick Sanger developed DNA sequencing technology which was based on chain-termination method (also known as Sanger sequencing), and Walter Gilbert developed another sequencing technology based on chemical modification of DNA and subsequent cleavage at specific bases. Because of its high efficiency and low radioactivity, Sanger sequencing was adopted as the primary technology in the “first generation” of laboratory and commercial sequencing applications . At that time, DNA sequencing was laborious and radioactive materials were required. After years of improvement, Applied Biosystems introduced the first automatic sequencing machine (namely AB370) in 1987, adopting capillary electrophoresis which made the sequencing faster and more accurate. AB370 could detect 96 bases one time, 500 K bases a day, and the read length could reach 600 bases. The current model AB3730xl can output 2.88 M bases per day and read length could reach 900 bases since 1995. Emerged in 1998, the automatic sequencing instruments and associated software using the capillary sequencing machines and Sanger sequencing technology became the main tools for the completion of human genome project in 2001 . This project greatly stimulated the development of powerful novel sequencing instrument to increase speed and accuracy, while simultaneously reducing cost and manpower. Not only this, X-prize also accelerated the development of next-generation sequencing (NGS) . The NGS technologies are different from the Sanger method in aspects of massively parallel analysis, high throughput, and reduced cost. Although NGS makes genome sequences handy, the followed data analysis and biological explanations are still the bottle-neck in understanding genomes.
Following the human genome project, 454 was launched by 454 in 2005, and Solexa released Genome Analyzer the next year, followed by (Sequencing by Oligo Ligation Detection) SOLiD provided from Agencourt, which are three most typical massively parallel sequencing systems in the next-generation sequencing (NGS) that shared good performance on throughput, accuracy, and cost compared with Sanger sequencing (shown in Table 1(a)). These founder companies were then purchased by other companies: in 2006 Agencourt was purchased by Applied Biosystems, and in 2007, 454 was purchased by Roche, while Solexa was purchased by Illumina. After years of evolution, these three systems exhibit better performance and their own advantages in terms of read length, accuracy, applications, consumables, man power requirement and informatics infrastructure, and so forth. The comparison of these three systems will be focused and discussed in the later part of this paper (also see Tables 1(a), 1(b), and 1(c)).
Table 1: (a) Advantage and mechanism of sequencers. (b) Components and cost of sequencers. (c) Application of sequencers.
2. Roche 454 System
Roche 454 was the first commercially successful next generation system. This sequencer uses pyrosequencing technology . Instead of using dideoxynucleotides to terminate the chain amplification, pyrosequencing technology relies on the detection of pyrophosphate released during nucleotide incorporation. The library DNAs with 454-specific adaptors are denatured into single strand and captured by amplification beads followed by emulsion PCR . Then on a picotiter plate, one of dNTP (dATP, dGTP, dCTP, dTTP) will complement to the bases of the template strand with the help of ATP sulfurylase, luciferase, luciferin, DNA polymerase, and adenosine 5′ phosphosulfate (APS) and release pyrophosphate (PPi) which equals the amount of incorporated nucleotide. The ATP transformed from PPi drives the luciferin into oxyluciferin and generates visible light . At the same time, the unmatched bases are degraded by apyrase . Then another dNTP is added into the reaction system and the pyrosequencing reaction is repeated.
The read length of Roche 454 was initially 100–150 bp in 2005, 200000+ reads, and could output 20 Mb per run [9, 10]. In 2008 454 GS FLX Titanium system was launched; through upgrading, its read length could reach 700 bp with accuracy 99.9% after filter and output 0.7 G data per run within 24 hours. In late 2009 Roche combined the GS Junior a bench top system into the 454 sequencing system which simplified the library preparation and data processing, and output was also upgraded to 14 G per run [11, 12]. The most outstanding advantage of Roche is its speed: it takes only 10 hours from sequencing start till completion. The read length is also a distinguished character compared with other NGS systems (described in the later part of this paper). But the high cost of reagents remains a challenge for Roche 454. It is about $ per base (counting reagent use only). One of the shortcomings is that it has relatively high error rate in terms of poly-bases longer than 6 bp. But its library construction can be automated, and the emulsion PCR can be semiautomated which could reduce the manpower in a great extent. Other informatics infrastructure and sequencing advantages are listed and compared with HiSeq 2000 and SOLiD systems in Tables 1(a), 1(b), and 1(c).
2.1. 454 GS FLX Titanium Software
GS RunProcessor is the main part of the GS FLX Titanium system. The software is in charge of picture background normalization, signal location correction, cross-talk correction, signals conversion, and sequencing data generation. GS RunProcessor would produce a series of files including SFF (standard flowgram format) files each time after run. SFF files contain the basecalled sequences and corresponding quality scores for all individual, high-quality reads (filtered reads). And it could be viewed directly from the screen of GS FLX Titanium system. Using GS De Novo Assembler, GS Reference Mapper and GS Amplicon Variant Analyzer provided by GS FLX Titanium system, SFF files can be applied in multiaspects and converted into fastq format for further data analyzing.
3. AB SOLiD System
(Sequencing by Oligo Ligation Detection) SOLiD was purchased by Applied Biosystems in 2006. The sequencer adopts the technology of two-base sequencing based on ligation sequencing. On a SOLiD flowcell, the libraries can be sequenced by 8 base-probe ligation which contains ligation site (the first base), cleavage site (the fifth base), and 4 different fluorescent dyes (linked to the last base) . The fluorescent signal will be recorded during the probes complementary to the template strand and vanished by the cleavage of probes’ last 3 bases. And the sequence of the fragment can be deduced after 5 round of sequencing using ladder primer sets.
The read length of SOLiD was initially 35 bp reads and the output was 3 G data per run. Owing to two-base sequencing method, SOLiD could reach a high accuracy of 99.85% after filtering. At the end of 2007, ABI released the first SOLiD system. In late 2010, the SOLiD 5500xl sequencing system was released. From SOLiD to SOLiD 5500xl, five upgrades were released by ABI in just three years. The SOLiD 5500xl realized improved read length, accuracy, and data output of 85 bp, 99.99%, and 30 G per run, respectively. A complete run could be finished within 7 days. The sequencing cost is about per base estimated from reagent use only by BGI users. But the short read length and resequencing only in applications is still its major shortcoming . Application of SOLiD includes whole genome resequencing, targeted resequencing, transcriptome research (including gene expression profiling, small RNA analysis, and whole transcriptome analysis), and epigenome (like ChIP-Seq and methylation). Like other NGS systems, SOLiD’s computational infrastructure is expensive and not trivial to use; it requires an air-conditioned data center, computing cluster, skilled personnel in computing, distributed memory cluster, fast networks, and batch queue system. Operating system used by most researchers is GNU/LINUX. Each solid sequencer run takes 7 days and generates around 4 TB of raw data. More data will be generated after bioinformatics analysis. This information is listed and compared with other NGS systems in Tables 1(a), 1(b), and 1(c). Automation can be used in library preparations, for example, Tecan system which integrated a Covaris A and Roche 454 REM e system .
3.1. SOLiD Software
After the sequencing with SOLiD, the original sequence of color coding will be accumulated. According to double-base coding matrix, the original color sequence can be decoded to get the base sequence if we knew the base types for one of any position in the sequence. Because of a kind of color corresponding four base pair, the color coding of the base will directly influence the decoding of its following base. It said that a wrong color coding will cause a chain decoding mistakes. BioScope is SOLiD data analysis package which provides a validated, single framework for resequencing, ChIP-Seq, and whole transcriptome analysis. It depends on reference for the follow-up data analysis. First, the software converts the base sequences of references into color coding sequence. Second, the color-coding sequence of references is compared with the original sequence of color-coding to get the information of mapping with newly developed mapping algorithm MaxMapper.
4. Illumina GA/HiSeq System
In 2006, Solexa released the Genome Analyzer (GA), and in 2007 the company was purchased by Illumina. The sequencer adopts the technology of sequencing by synthesis (SBS). The library with fixed adaptors is denatured to single strands and grafted to the flowcell, followed by bridge amplification to form clusters which contains clonal DNA fragments. Before sequencing, the library splices into single strands with the help of linearization enzyme , and then four kinds of nucleotides (ddATP, ddGTP, ddCTP, ddTTP) which contain different cleavable fluorescent dye and a removable blocking group would complement the template one base at a time, and the signal could be captured by a (charge-coupled device) CCD.
At first, solexa GA output was 1 G/run. Through improvements in polymerase, buffer, flowcell, and software, in 2009 the output of GA increased to 20 G/run in August (75PE), 30 G/run in October (100PE), and 50 G/run in December (Truseq V3, 150PE), and the latest GAIIx series can attain 85 G/run. In early 2010, Illumina launched HiSeq 2000, which adopts the same sequencing strategy with GA, and BGI was among the first globally to adopt the HiSeq system. Its output was 200 G per run initially, improved to 600 G per run currently which could be finished in 8 days. In the foreseeable future, it could reach 1 T/run when a personal genome cost could drop below $1 K. The error rate of 100PE could be below 2% in average after filtering (BGI’s data). Compared with 454 and SOLiD, HiSeq 2000 is the cheapest in sequencing with $0.02/million bases (reagent counted only by BGI). With multiplexing incorporated in P5/P7 primers and adapters, it could handle thousands of samples simultaneously. HiSeq 2000 needs (HiSeq control software) HCS for program control, (real-time analyzer software) RTA to do on-instrument base-calling, and CASAVA for secondary analysis. There is a 3 TB hard disk in HiSeq 2000. With the aid of Truseq v3 reagents and associated softwares, HiSeq 2000 has improved much on high GC sequencing. MiSeq, a bench top sequencer launched in 2011 which shared most technologies with HiSeq, is especially convenient for amplicon and bacterial sample sequencing. It could sequence 150PE and generate 1.5 G/run in about 10 hrs including sample and library preparation time. Library preparation and their concentration measurement can both be automated with compatible systems like Agilent Bravo, Hamilton Banadu, Tecan, and Apricot Designs.
4.1. HiSeq Software
HiSeq control system (HCS) and real-time analyzer (RTA) are adopted by HiSeq 2000. These two softwares could calculate the number and position of clusters based on their first 20 bases, so the first 20 bases of each sequencing would decide each sequencing’s output and quality. HiSeq 2000 uses two lasers and four filters to detect four types of nucleotide (A, T, G, and C). The emission spectra of these four kinds of nucleotides have cross-talk, so the images of four nucleotides are not independent and the distribution of bases would affect the quality of sequencing. The standard sequencing output files of the HiSeq 2000 consist of *bcl files, which contain the base calls and quality scores in each cycle. And then it is converted into *_qseq.txt files by BCL Converter. The ELAND program of CASAVA (offline software provided by Illumina) is used to match a large number of reads against a genome.
In conclusion, of the three NGS systems described before, the Illumina HiSeq 2000 features the biggest output and lowest reagent cost, the SOLiD system has the highest accuracy , and the Roche 454 system has the longest read length. Details of three sequencing system are list in Tables 1(a), 1(b), and 1(c).
5. Compact PGM Sequencers
Ion Personal Genome Machine (PGM) and MiSeq were launched by Ion Torrent and Illumina. They are both small in size and feature fast turnover rates but limited data throughput. They are targeted to clinical applications and small labs.
5.1. Ion PGM from Ion Torrent
Ion PGM was released by Ion Torrent at the end of 2010. PGM uses semiconductor sequencing technology. When a nucleotide is incorporated into the DNA molecules by the polymerase, a proton is released. By detecting the change in pH, PGM recognized whether the nucleotide is added or not. Each time the chip was flooded with one nucleotide after another, if it is not the correct nucleotide, no voltage will be found; if there is 2 nucleotides added, there is double voltage detected . PGM is the first commercial sequencing machine that does not require fluorescence and camera scanning, resulting in higher speed, lower cost, and smaller instrument size. Currently, it enables 200 bp reads in 2 hours and the sample preparation time is less than 6 hours for 8 samples in parallel.
An exemplary application of the Ion Torrent PGM sequencer is the identification of microbial pathogens. In May and June of 2011, an ongoing outbreak of exceptionally virulent Shiga-toxin- (Stx) producing Escherichia coli O104:H4 centered in Germany [16, 17], there were more than 3000 people infected. The whole genome sequencing on Ion Torrent PGM sequencer and HiSeq 2000 helped the scientists to identify the type of E. coli which would directly apply the clue to find the antibiotic resistance. The strain appeared to be a hybrid of two E. coli strains—entero aggregative E. coli and entero hemorrhagic E. coli—which may help explain why it has been particularly pathogenic. From the sequencing result of E. coli TY2482 , PGM shows the potential of having a fast, but limited throughput sequencer when there is an outbreak of new disease.
In order to study the sequencing quality, mapping rate, and GC depth distribution of Ion Torrent and compare with HiSeq 2000, a high GC Rhodobacter sample with high GC content (66%) and 4.2 Mb genome was sequenced in these two different sequencers (Table 2). In another experiment, E. coli K12 DH10B (NC_010473.1) with GC 50.78% was sequenced by Ion Torrent for analysis of quality value, read length, position accuracies, and GC distribution (Figure 1).
Table 2: Comparison in alignment between Ion Torrent and HiSeq 2000.
Figure 1: Ion Torrent sequencing quality. E. coli K12 DH10B (NC_010473.1) with GC 50.78% was used for this experiment. (a) is 314–200 bp from Ion Torrent. The left figure is quality value: pink range represents quality minimum and maximum values each position has. Green area represents the top and bottom quarter (1/4) reads of quality. Red line represents the average quality value in the position. The right figure is read length analysis: colored histogram represents the real read length. The black line represents the mapped length, and because it allows 3′ soft clipping, the length is different from the real read length. (b) is accuracy analysis. In each position, accuracy type including mismatch, insertion, and deletion is shown on the left -axis. The average accuracy is shown the right -axis. Accuracy of 200 bp sequencing could reach 99%. (c) is base composition along reads (left) and GC distribution analysis (right). The left figure is base composition in each position of reads. Base line splits after about 95 cycles indicating an inaccurate sequencing. The right one uses 500 bp window and the GC distribution is quite even. The data using high GC samples also indicates a good performance in Ion Torrent (data not shown).
5.1.1. Sequencing Quality
The quality of Ion Torrent is more stable, while the quality of HiSeq 2000 decreases noticeably after 50 cycles, which may be caused by the decay of fluorescent signal with increasing the read length (shown in Figure 1).
The insert size of library of Rhodobacter was 350 bp, and 0.5 Gb data was obtained from HiSeq. The sequencing depth was over 100x, and the contig and scaffold N50 were 39530 bp and 194344 bp, respectively. Based on the assembly result, we used 33 Mb which is obtained from ion torrent with 314 chip to analyze the map rate. The alignment comparison is Table 2.
The map rate of Ion Torrent is higher than HiSeq 2000, but it is incomparable because of the different alignment methods used in different sequencers. Besides the significant difference on data including mismatch rate, insertion rate, and deletion rate, HiSeq 2000 and Ion Torrent were still incomparable because of the different sequencing principles. For example, the polynucleotide site could not be indentified easily in Ion Torrent. But it is shown that Ion Torrent has a stable quality along sequencing reads and a good performance on mismatch accuracies, but rather a bias in detection of indels. Different types of accuracy are analyzed and shown in Figure 1.
5.1.3. GC Depth Distribution
The GC depth distribution is better in Ion Torrent from Figure 1. In Ion Torrent, the sequencing depth is similar while the GC content is from 63% to 73%. However in HiSeq 2000, the average sequencing depth is 4x when the GC content is 60%, while it is 3x with 70% GC content.
Ion Torrent has already released Ion 314 and 316 and planned to launch Ion 318 chips in late 2011. The chips are different in the number of wells resulting in higher production within the same sequencing time. The Ion 318 chip enables the production of >1 Gb data in 2 hours. Read length is expected to increase to >400 bp in 2012.
5.2. MiSeq from Illumina
MiSeq which still uses SBS technology was launched by Illumina. It integrates the functions of cluster generation, SBS, and data analysis in a single instrument and can go from sample to answer (analyzed data) within a single day (as few as 8 hours). The Nextera, TruSeq, and Illumina’s reversible terminator-based sequencing by synthesis chemistry was used in this innovative engineering. The highest integrity data and broader range of application, including amplicon sequencing, clone checking, ChIP-Seq, and small genome sequencing, are the outstanding parts of MiSeq. It is also flexible to perform single 36 bp reads (120 MB output) up to 2 × 150 paired-end reads (1–1.5 GB output) in MiSeq. Due to its significant improvement in read length, the resulting data performs better in contig assembly compared with HiSeq (data not shown). The related sequencing result of MiSeq is shown in Table 3. We also compared PGM with MiSeq in Table 4.
Table 3: MiSeq 150PE data.
Table 4: The comparison between PGM and MiSeq.
5.3. Complete Genomics
Complete genomics has its own sequencer based on Polonator G.007, which is ligation-based sequencer. The owner of Polonator G.007, Dover, collaborated with the Church Laboratory of Harvard Medical School, which is the same team as SOLiD system, and introduced this cheap open system. The Polonator could combine a high-performance instrument at very low price and the freely downloadable, open-source software and protocols in this sequencing system. The Polonator G.007 is ligation detection sequencing, which decodes the base by the single-base probe in nonanucleotides (nonamers), not by dual-base coding . The fluorophore-tagged nonamers will be degenerated by selectively ligate onto a series of anchor primers, whose four components are labeled with one of four fluorophores with the help of T4 DNA ligase, which correspond to the base type at the query position. In the ligation progress, T4 DNA ligase is particularly sensitive to mismatches on 3′-side of the gap which is benefit to improve the accuracy of sequencing. After imaging, the Polonator chemically strips the array of annealed primer-fluorescent probe complex; the anchor primer is replaced and the new mixture are fluorescently tagged nonamers is introduced to sequence the adjacent base . There are two updates compared with Polonator G.007, DNA nanoball (DNB) arrays, and combinatorial probe-anchor ligation (cPAL). Compared with DNA cluster or microsphere, DNA nanoball arrays obtain higher density of DNA cluster on the surface of a silicon chip. As the seven 5-base segments are discontinuous, so the system of hybridization-ligation-detection cycle has higher fault-tolerant ability compared with SOLiD. Complete genomics claim to have 99.999% accuracy with 40x depth and could analyze SNP, indel, and CNV with price 5500$–9500$. But Illumina reported a better performance of HiSeq 2000 use only 30x data (Illumina Genome Network). Recently some researchers compared CG’s human genome sequencing data with Illumina system , and there are notable differences in detecting SNVs, indels, and system-specific detections in variants.
5.4. The Third Generation Sequencer
While the increasing usage and new modification in next generation sequencing, the third generation sequencing is coming out with new insight in the sequencing. Third-generation sequencing has two main characteristics. First, PCR is not needed before sequencing, which shortens DNA preparation time for sequencing. Second, the signal is captured in real time, which means that the signal, no matter whether it is fluorescent (Pacbio) or electric current (Nanopore), is monitored during the enzymatic reaction of adding nucleotide in the complementary strand.
Single-molecule real-time (SMRT) is the third-generation sequencing method developed by Pacific Bioscience (Menlo Park, CA, USA), which made use of modified enzyme and direct observation of the enzymatic reaction in real time. SMRT cell consists of millions of zero-mode waveguides (ZMWs), embedded with only one set of enzymes and DNA template that can be detected during the whole process. During the reaction, the enzyme will incorporate the nucleotide into the complementary strand and cleave off the fluorescent dye previously linked with the nucleotide. Then the camera inside the machine will capture signal in a movie format in real-time observation . This will give out not only the fluorescent signal but also the signal difference along time, which may be useful for the prediction of structural variance in the sequence, especially useful in epigenetic studies such as DNA methlyation .
Comparing to second generation, PacBio RS (the first sequencer launched by PacBio) has several advantages. First the sample preparation is very fast; it takes 4 to 6 hours instead of days. Also it does not need PCR step in the preparation step, which reduces bias and error caused by PCR. Second, the turnover rate is quite fast; runs are finished within a day. Third, the average read length is 1300 bp, which is longer than that of any second-generation sequencing technology. Although the throughput of the PacBioRS is lower than second-generation sequencer, this technology is quite useful for clinical laboratories, especially for microbiology research. A paper has been published using PacBio RS on the Haitian cholera outbreak .
We have run a de novo assembly of DNA fosmid sample from Oyster with PacBio RS in standard sequencing mode (using LPR chemistry and SMRTcells instead of the new version FCR chemistry and SMRTcells). An SMRT belt template with mean insert size of 7500 kb is made and run in one SMRT cell and a 120-minute movie is taken. After Post-QC filter, 22,373,400 bp reads in 6754 reads (average 2,566 bp) were sequenced with the average Read Score of 0.819. The Coverage is 324x with mean read score of 0.861 and high accuracy (~99.95). The result is exhibited in Figure 2.
Figure 2: Sequencing of a fosmid DNA using Pacific Biosciences sequencer. With coverage, the accuracy could be above 97%. The figure was constructed by BGI’s own data.
Nanopore sequencing is another method of the third generation sequencing. Nanopore is a tiny biopore with diameter in nanoscale , which can be found in protein channel embedded on lipid bilayer which facilitates ion exchange. Because of the biological role of nanopore, any particle movement can disrupt the voltage across the channel. The core concept of nanopore sequencing involves putting a thread of single-stranded DNA across α-haemolysin (αHL) pore. αHL, a 33 kD protein isolated from Staphylococcus aureus , undergoes self-assembly to form a heptameric transmembrane channel . It can tolerate extraordinary voltage up to 100 mV with current 100 pA . This unique property supports its role as building block of nanopore. In nanopore sequencing, an ionic flow is applied continuously. Current disruption is simply detected by standard electrophysiological technique. Readout is relied on the size difference between all deoxyribonucleoside monophosphate (dNMP). Thus, for given dNMP, characteristic current modulation is shown for discrimination. Ionic current is resumed after trapped nucleotide entirely squeezing out.
Nanopore sequencing possesses a number of fruitful advantages over existing commercialized next-generation sequencing technologies. Firstly, it potentially reaches long read length >5 kbp with speed 1 bp/ns . Moreover, detection of bases is fluorescent tag-free. Thirdly, except the use of exonuclease for holding up ssDNA and nucleotide cleavage , involvement of enzyme is remarkably obviated in nanopore sequencing . This implies that nanopore sequencing is less sensitive to temperature throughout the sequencing reaction and reliable outcome can be maintained. Fourthly, instead of sequencing DNA during polymerization, single DNA strands are sequenced through nanopore by means of DNA strand depolymerization. Hence, hand-on time for sample preparation such as cloning and amplification steps can be shortened significantly.
6. Discussion of NGS Applications
Fast progress in DNA sequencing technology has made for a substantial reduction in costs and a substantial increase in throughput and accuracy. With more and more organisms being sequenced, a flood of genetic data is inundating the world every day. Progress in genomics has been moving steadily forward due to a revolution in sequencing technology. Additionally, other of types-large scale studies in exomics, metagenomics, epigenomics, and transcriptomics all become reality. Not only do these studies provide the knowledge for basic research, but also they afford immediate application benefits. Scientists across many fields are utilizing these data for the development of better-thriving crops and crop yields and livestock and improved diagnostics, prognostics, and therapies for cancer and other complex diseases.
BGI is on the cutting edge of translating genomics research into molecular breeding and disease association studies with belief that agriculture, medicine, drug development, and clinical treatment will eventually enter a new stage for more detailed understanding of the genetic components of all the organisms. BGI is primarily focused on three projects. (1) The Million Species/Varieties Genomes Project, aims to sequence a million economically and scientifically important plants, animals, and model organisms, including different breeds, varieties, and strains. This project is best represented by our sequencing of the genomes of the Giant panda, potato, macaca, and others, along with multiple resequencing projects. (2) The Million Human Genomes Project focuses on large-scale population and association studies that use whole-genome or whole-exome sequencing strategies. (3) The Million Eco-System Genomes Project has the objective of sequencing the metagenome and cultured microbiome of several different environments, including microenvironments within the human body . Together they are called 3 M project.
In the following part, each of the following aspects of applications including de novo sequencing, mate-pair, whole genome or target-region resequencing, small RNA, transcriptome, RNA seq, epigenomics, and metagenomics, is briefly summarized.
In DNA de novo sequencing, the library with insert size below 800 bp is defined as DNA short fragment library, and it is usually applied in de novo and resequencing research. Skovgaard et al.  have applied a combination method of WGS (whole-genome sequencing) and genome copy number analysis to identify the mutations which could suppress the growth deficiency imposed by excessive initiations from the E. coli origin of replication, oriC.
Mate-pair library sequencing is significant beneficial for de novo sequencing, because the method could decrease gap region and extend scaffold length. Reinhardt et al.  developed a novel method for de novo genome assembly by analyzing sequencing data from high-throughput short read sequencing technology. They assembled genomes into large scaffolds at a fraction of the traditional cost and without using reference sequence. The assembly of one sample yielded an N50 scaffold size of 531,821 bp with >75% of the predicted genome covered by scaffolds over 100,000 bp.
Whole genome resequencing sequenced the complete DNA sequence of an organism’s genome including the whole chromosomal DNA at a single time and alignment with the reference sequence. Mills et al.  constructed a map of unbalanced SVs (genomic structural variants) based on whole genome DNA sequencing data from 185 human genomes with SOLiD platform; the map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications . Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact .
The whole genome resequencing is an effective way to study the functional gene, but the high cost and massive data are the main problem for most researchers. Target region sequencing is a solution to solve it. Microarray capture is a popular way of target region sequencing, which uses hybridization to arrays containing synthetic oligo-nucleotides matching the target DNA sequencing. Gnirke et al.  developed a captured method that uses an RNA “baits” to capture target DNA fragments from the “pond” and then uses the Illumina platform to read out the sequence. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper .
Fehniger et al. used two platforms, Illumina GA and ABI SOLiD, to define the miRNA transcriptomes of resting and cytokine-activated primary murine NK (natural killer) cells . The identified 302 known and 21 novel mature miRNAs were analyzed by unique bioinformatics pipeline from small RNA libraries of NK cell. These miRNAs are overexpressed in broad range and exhibit isomiR complexity, and a subset is differentially expressed following cytokine activation, which were the clue to identify the identification of miRNAs by the Illumina GA and SOLiD instruments .
The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other noncoding RNA produced in one or a population of cells. In these years, next-generation sequencing technology is used to study the transcriptome compares with DNA microarray technology in the past. The S. mediterranea transcriptome could be sequenced by an efficient sequencing strategy which designed by Adamidi et al. . The catalog of assembled transcripts and the identified peptides in this study dramatically expand and refine planarian gene annotation, which is demonstrated by validation of several previously unknown transcripts with stem cell-dependent expression patterns.
RNA-seq is a new method in RNA sequencing to study mRNA expression. It is similar to transcriptome sequencing in sample preparation, except the enzyme. In order to estimate the technical variance, Marioni et al.  analyzed a kidney RNA samples on both Illumina platform and Affymetrix arrays. The additional analyses such as low-expressed genes, alternative splice variants, and novel transcripts were found on Illumina platform. Bradford et al.  compared the data of RNA-seq library on the SOLiD platform and Affymetrix Exon 1.0ST arrays and found a high degree of correspondence between the two platforms in terms of exon-level fold changes and detection. And the greatest detection correspondence was seen when the background error rate is extremely low in RNA-seq. The difference between RNA-seq and transcriptome on SOLiD is not so obvious as Illumina.
There are two kinds of application of epigenetic, Chromatin immunoprecipitation and methylation analysis. Chromatin immunoprecipitation (ChIP) is an immunoprecipitation technique which is used to study the interaction between protein and DNA in a cell, and the histone modifies would be found by the specific location in genome. Based on next-generation sequencing technology, Johnson et al.  developed a large-scale chromatin immunoprecipitation assay to identify motif, especially noncanonical NRSF-binding motif. The data displays sharp resolution of binding position (±50 bp), which is important to infer new candidate interaction for the high sensitivity and specificity (ROC (receiver operator characteristic) area ≥0.96) and statistical confidence ( < 10–4). Another important application in epigenetic is DNA methylation analysis. DNA methylation exists typically in vertebrates at CpG sites; the methylation caused the conversion of the cytosine to 5-methylcytosine. Chung presented a whole methylome sequencing to study the difference between two kinds of bisulfite conversion methods (in solution versus in gel) by SOLiD platform .
The world class genome projects include the 1000 genome project, and the human ENCODE project, the human Microbiome (HMP) project, to name a few. BGI takes an active role in these and many more ongoing projects like 1000 Animal and Plant Genome project, the MetaHIT project, Yanhuang project, LUCAMP (Diabetes-associated Genes and Variations Study), ICGC (international cancer genome project), Ancient human genome, 1000 Mendelian Disorders Project, Genome 10 K Project, and so forth . These internationally collaborated genome projects greatly enhanced genomics study and applications in healthcare and other fields.
To manage multiple projects including large and complex ones with up to tens of thousands of samples, a superior and sophisticated project management system is required handling information processing from the very beginning of sample labeling and storage to library construction, multiplexing, sequencing, and informatics analysis. Research-oriented bioinformatics analysis and followup experiment processed are not included. Although automation techniques’ adoption has greatly simplified bioexperiment human interferences, all other procedures carried out by human power have to be managed. BGI has developed BMS system and Cloud service for efficient information exchange and project management. The behavior management mainly follows Japan 5S onsite model. Additionally, BGI has passed ISO9001 and CSPro (authorized by Illumina) QC system and is currently taking (Clinical Laboratory Improvement Amendments) CLIA and (American Society for Histocompatibility and Immunogenetics) AShI tests. Quick, standard, and open reflection system guarantees an efficient troubleshooting pathway and high performance, for example, instrument design failure of Truseq v3 flowcell resulting in bubble appearance (which is defined as “bottom-middle-swatch” phenomenon by Illumina) and random in reads. This potentially hazards sequencing quality, GC composition as well as throughput. It not only effects a small area where the bubble locates resulting in reading but also effects the focus of the place nearby, including the whole swatch, and the adjacent swatch. Filtering parameters have to be determined to ensure quality raw data for bioinformatics processing. Lead by the NGS tech group, joint meetings were called for analyzing and troubleshooting this problem, to discuss strategies to best minimize effect in terms of cost and project time, to construct communication channel, to statistically summarize compensation, in order to provide best project management strategies in this time. Some reagent QC examples are summaried in Liu et al. .
BGI is establishing their cloud services. Combined with advanced NGS technologies with multiple choices, a plug-and-run informatics service is handy and affordable. A series of softwares are available including BLAST, SOAP, and SOAP SNP for sequence alignment and pipelines for RNAseq data. Also SNP calling programs such as Hecate and Gaea are about to be released. Big-data studies from the whole spectrum of life and biomedical sciences now can be shared and published on a new journal GigaSicence cofounded by BGI and Biomed Central. It has a novel publication format: each piece of data links to a standard manuscript publication with an extensive database which hosts all associated data, data analysis tools, and cloud-computing resources. The scope covers not just omic type data and the fields of high-throughput biology currently serviced by large public repositories but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology, and other new types of large-scale sharable data.