KASAHARA Masahiro
(Associate Professor/Division of Biosciences)
Department of Computational Biology and Medical Sciences/Computational biology, bioinformatics using massive ammounts of data
Career Summary
Mar. 2002: B.S., Department of Information Science, Faculty of Science, The University of Tokyo.
Mar. 2004: M.Information Science and Technology, Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo.
Apr. 2004: Project Assistant Professor, Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo
Mar. 2009: Ph.D., The University of Tokyo
Jun. 2009-present Assistant Professor (PI), The University of Tokyo
Educational Activities
Graduate School: Basic Bioinformatics I
Graduate School: Basic Bioinformatics Programming
Faculty of Science: Basics of Bioinformatics and Systems Biology I
Faculty of Science: Basic Laboratory Work in Information Science
Faculty of Science: Database in Biology
.
Research Activities
The throughput of DNA sequencers is now about four orders of magnitude higher than that a few years ago. The international human genome consortium spent roughly 13 years and 3 billion dollars in determining the accurate sequence of the human genome. With the advent of second-generation DNA sequencers such as HiSeq2000, which was announced by Illumina Inc. in Jan. 2010, we can obtain more than 200 Gbp in a single run although the nucleotide accuracy is lower than those of previous sequencers. We need only a week to obtain 60 times as many nucleotides as in the human genome.
Considering this rapid advance in sequencing technology, the development of algorithms and systems to analyze such huge amounts of data is demanded in order to extract biological knowledge from them.
(1) Genome Assembly Algorithms
DNA sequencers can "read" only 30 to 1000 consecutive nucleotides. However, the chromosomes of higher organisms are much longer. To "read" entire chromosomes, genomic DNA is first randomly fragmented, and then millions of the fragments are "read" by sequencers. This process is called the whole-genome shotgun method, and it is widely used to determine genome sequences.
Each obtained nucleotide sequence is a random part of the target genome, and its position on the genome is not known. Therefore, we need to reconstruct the original genome sequence by stitching the pieces together using computers. This processing is called genome assembly, and it is one of my primary research interests.
Genome assembly algorithms have been studied for decades, but the nature of the problem has been completely changed due to the rapid advances in sequencing technology. We are working on a parallel assembly algorithm that integrates the output from various sequencers, utilizing parallel distributed computing.
(2) Development of distributed parallel computation framework for genome informatics
By taking advantage of massive parallelism, second-generation DNA sequencers have a wide range of applications other than genome sequencing. For example, we can determine the expression level and expressed regions on a genome by shotgun sequencing reverse-transcribed mRNAs. Other applications include tag-based methods that collect the 5'-end of mRNAs to determine expression levels, resequencing that identifies SNPs, copy-number polymorphisms, structural variations, etc. detection of protein-genome interactions by immunoprecipitation, and identification of three-dimensional structure of chromosomes by the Hi-C method.
Therefore, analysis methods have to keep up with the expansion of experimental analysis methods. To this end, faster algorithms should be developed. At the same time, utilizing data parallelism is another option for some kinds of data.
Research on parallel computation also has a long history. Numerous paradigms have been proposed so far, but none of them is particularly suitable for recent molecular biology and computational biology. There are several problems including the following. (a) The data size is relatively huge and the computation is I/O-bounded but the conventional parallel computing is targeted for numerical algorithms, which are often CPU-bounded. (b) the algorithms used in analysis are complex and full of heuristics. We wish to take advantage of parallel computation without touching the existing huge and complex software especially written by others, but the current technologies often require us to modify source codes to make them parallel. (c) Implementation speed matters more than computation speed. The time spent in programming often dominates in genome informatics, and therefore easy implementation is favored over efficient parallel computation.
We propose a new parallel skeleton programming framework and pipeline computation platform for genome informatics.
(3) RNA sequencing algorithms
RNA sequencing is a method for identifying mRNA molecules expressed in a sample using massively parallel sequencing. Gene prediction methods using only genome sequences are unreliable, so RNA sequencing is often used for annotating newly sequenced genomes. However, the genome sequences of newly sequenced genomes are often highly fragmented and unreliable, largely due to the technological limitation of new sequencing technologies. We are developing algorithms to analyze data from RNA sequencing for newly sequenced species.
Literature
1) Kasahara M, et al, Nature, 447(7145):714-9, 2007
2) Kasahara M and Morishita S, Large-Scale Genome Sequence Processing, Imperial College Press, London, 2006
3) Kasahara M, Genome assembler and its future with next-generation sequencers (review in Japanese), Jikken Igaku
Special Issue"Frontier of Applications and Development of Biodatabase and Software for Life Science" 26(7):1021-1032, 2008
Other Activities
Member of Association for Computing Machinery and The Molecular Biology Society of Japan.
Future Plan
Combination of bioinformatics for next-generation sequencers with synthetic biology might be an interesting in the future.
Messages to Students
Aim to be number one in your chosen field.