Hybrid-sequencing assembles gene isoforms accurately

With the rapid development in sequencing technology and great expense reduction in sequencing costs, big data analysis and data mining are currently one of the major challenges in full understanding of health and diseases for potential clinical applications. Facing the claims of “high accuracy” from the emerging sequencing companies, are things really that simple and perfect in the black-box of genome analysis?  

 

March 9, 2015, Dr. Kin Fai Au from University of Iowa visited Beijing Institute of Genomics (BIG), CAS and gave a talk on “Hybrid-Seq, Gene Isoform Identification in hESCs, Gene Fusion Identification in Breast Cancer”. Dr. Au started by briefly introducing the current developments on the second and third generations of sequencing technologies, especially on their pros and cons: short reads from 2nd generation sequencing show advance in the high sequence accuracy, but the length limits increase the assembly difficulties; on the other hand, the long reads generated by the 3rd generation sequencing harbor a rather high sequencing error rate. These lead to the great need for the Hybrid-sequencing technology: integrating the strengths from both technologies.  

 

Dr. Au aligns short reads from 2nd generation sequencing to long reads from 3rd generation sequencing and replaces the long read sequence by short reads, which reduces the error rate of the third generation sequencing data. There are, however, some difficulties in the alignment among short and long reads generated from different sequencing technologies because of the unbalanced error rate.  

 

Dr. Au’s group used the method of Homopolymer Compression to condense the reads respectively, which simplified the alignment process and improved the computational efficiency largely (software LSC). Moreover, they developed a pipeline to identify gene isoform in genome, which integrated the LSC core concept and also a series of processes including isoform candidate library construction, abundance calculation, and isoform identification (software IDP).  

 

The application of this pipeline to the gene isoform identification in human embryonic stem cells (hESCs) has resulted in successful finding of gene isoforms. They have experimentally verified 23 novel genes that warrant further investigation. It is worth noticing that the false positive rate is dramatically lower in IDP than other available popular software like Cufflinks, which is practical for the subsequent biological experiment design. In the end, Dr. Au talked about the newly developed IDP-fusion package for fusion gene identification in breast cancer, which also has a promising ability to help understanding the disease comprehensively. 

 

After his open talk, Dr. Au answered questions from the audiences before his individual meeting with several faculty members from BIG for detailed discussion and visited to the sequencing platform. Dr. Au is an assistant professor in the Department of Internal Medicine and Department of Biostatistics in University of Iowa, USA. As a young rising star in the bioinformatics area, he published numerous studies in prestigious journals like PNAS and Science.
 
 

Dr. Au was invited to give a talk by the Outstanding Scientists Forum program of the journal Genomics, Proteomics and Bioinformatics (GPB). 

  

GPB is a peer-reviewed Open Access journal that publishes papers from all over the world in the fields of omics and bioinformatics. GPB aims to serve as a premier platform for communication among scientists in omics and bioinformatics field, and enhance collaborations in scientific communities.