Ph.D. Defense: Siliang Gong
Ph.D. Thesis Defense
Thursday, May 10th, 2018
107 Hanes Hall
Study on Correlations in High Dimensional Data
With the prevalence of high dimensional data, variable selection is crucial in many real applications. Although various methods have been investigated in the past decades, challenges still remain when tens of thousands of predictor variables are available for modeling. One difficulty arises from the spurious correlation, referring to the phenomenon that the sample correlation between two variables can be large when the dimension is relatively high even if they are independent. While many classical variable selection methods choose a variable based upon its marginal correlation with the response, the existence of spurious correlation may result in a high false discovery rate. On the other hand, when important variables are highly correlated, it is desirable to include all of them into the model. However, there is no such guarantee in many existing methods. Another challenge is in most variable selection approaches one needs to implement model selection to control the model complexity. While cross-validation is commonly used, it is computationally expensive and lack of statistical interpretation. In this proposal, we introduce some novel variable selection approaches to address the challenges mentioned above. Our proposed methods are based upon the investigations on the limiting distribution of the spurious correlation. For the first project, we study the maximal absolute sample partial correlation between the covariates and the response, and introduce a testing-based variable selection procedure. In the second project, we take advantage of the asymptotic results of the maximal absolute sample correlation among covariates and incorporate them into a penalized variable selection approach. The third project considers applications of the asymptotic results in multiple-response regression. Numerical studies demonstrate the effectiveness of our proposed methods.