Ph.D. Defense: Weiwei Li
Ph.D. Thesis Defense
Thursday, March 19th, 2020
Meeting ID: 604 122 248
Data Science Methods with Applications to Genetic Sequencing
Data science methods is of increasing importance in modern genetic sequencing analysis. In this dissertation, we mainly focus on applying statistical modeling to structural variant detection problem and a new frame work for scalable and provable subspace clustering.
In the first project, we discuss the optimal sampling strategy for structural variant detection using optical mapping. Here we develop an optimization approach using a simple, yet realistic, model of the genomic mapping process using a hypergeometric distribution and probabilistic concentration inequalities.
In the second project, we introduce a formal probabilistic model to assessing how well an optical read maps to a reference genome. We use this approach to infer the most likely location within that reference for any given read, as well as the likelihood of mapping to all other possible locations. Using data produced by BioNano Saphyr to parameterize a simulation, we show that our approach accurately identifies the likeliest locations of the observed optical read data.
If time permits, in the third part we introduce a new algorithm for subspace clustering. We consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire dataset via first spectral clustering of a small random sample followed by classifying the remaining out of sample points. The numerical results indicate we outperform other state-of-the-art subspace clustering algorithms with respect to both accuracy and speed.