STOR Colloquium: Jacob Bien, USC
University of Southern California
Tree-Based Aggregation of Rare Features for Prediction
It is common in modern prediction problems for many features to be counts of rarely occurring events. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from biology (e.g., rare species within a microbiome) to natural language processing (e.g., rare words within an online hotel review). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Applications to the microbiome and to online hotel reviews show how our methodology is useful in a wide range of contexts.