318 Hanes Hall, CB #3260 Chapel Hill, NC 27599-3260
919-962-1329

Old Colloquia

Old Colloquia


December 2018
November 2018
October 2018
September 2018
August 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
April 2016
March 2016
February 2016
January 2016
04 Dec

Mean field limits for stochastic differential games
120 Hanes Hall

Mean field game (MFG) theory generalizes classical models of interacting particle systems by replacing the particles with decision-makers, making the theory applicable in economics and other social sciences. Most research so far has focused on the existence and uniqueness of Nash equilibria in a model which arises intuitively as a continuum limit (i.e., an infinite-agent version) of a given large-population stochastic differential game of a certain symmetric type. This talk discusses some recent results in this direction, particularly for MFGs with common noise, but more attention is paid to recent progress on a less well-understood problem: Given for each n a Nash equilibrium for the n-player game, in what sense if any do these equilibria converge as n tends to infinity? The answer is somewhat unexpected, and certain forms of randomness can prevail in the limit which are well beyond the scope of the usual notion of MFG solution. A new notion of weak MFG solutions is shown to precisely characterize the set of possible limits of approximate Nash equilibria of n-player games, for a large class of models.

02 Dec

Structure of Inhomogeneous Random Graphs.
120 Hanes Hall

In this talk we describe statistical properties of Inhomogeneous Random Graphs. In this model edges are present independently but with different probabilities depending on types associated to the vertices. The focus is on structural, metric, and spectral properties. In particular, we describe the connectivity threshold and the diameter of the graph. The talk is based on joint works with Luc Devroye and Dieter Mitsche.

30 Nov

Nonparametric Graphical Model: Foundation and Trends.
120 Hanes Hall

We consider the problem of learning the structure of a non-Gaussian graphical model. We introduce two strategies for constructing tractable nonparametric graphical model families. One approach is through semiparametric extension of the Gaussian or exponential family graphical models that allows arbitrary graphs. Another approach is to restrict the family of allowed graphs to be acyclic, enabling the use of fully nonparametric density estimation in high dimensions. These two approaches can both be viewed as adding structural regularization to the a general pairwise nonparametric Markov random field and reflect an interesting tradeoff of model flexibility with structural complexity. In terms of graph estimation, these methods achieve the optimal parametric rates of convergence. In terms of computation, these methods are as scalable as the best implemented parametric methods. Such a "free lunch phenomenon" make them extremely attractive for large-scale applications. We will also introduce several new research directions along this line of work, including latent-variable extension, model-based nonconvex optimization, graph uncertainty assessment, and nonparametric graph property testing.

18 Nov

Data Analysis, machine learning in industry: Theory and Practice
120 Hanes Hall

Statistics, data analysis, and machine learning play a major role in many internet companies. However, it is not always clear for statistics students (and prospective employees) how their knowledge can be concretely applied and what skills are needed to be successful at these companies. It is also unclear how research work fits into the job of industry statisticians. This talk consists of two parts. First, I present some ways in which statistics plays a surprising role in solving some research and practical problems of interest in industry. In particular, I will talk about a computer science problem of estimating the number of distinct elements in a data stream in a memory efficient manner as well as estimating the standard deviation of the estimate (as any good statistician should do). A wide range of statistical concepts including sufficiency, completeness, martingales, composite-likelihood, and conjugate priors are used to solve the problem and demonstrate asymptotic efficiency of the methods. Second, I discuss roles statisticians can play at FB and the skills needed to be successful in the company. I will share my experiences in developing tools for doing statistics at FB.

09 Nov

Organizing individuals based on temporal processes
120 Hanes Hall

Individuals often differ in their brain processes across time, giving rise to the need for individual-level models. This heterogeneity in brain processes occurs even within predefined, arbitrary, categories (such as those based on diagnoses or gender), indicating that these groups by be further classified or perhaps are better classified along a different dimension than the one chosen by the researcher. These results suggest a need for approaches which can accommodate individual-level heterogeneity as well as classify individuals based on similarities in brain processes. In this presentation I'll introduce a method for clustering individuals based on their dynamic processes. The approach builds from unified structural equation modeling (also referred to as structural vector autoregression), which is a statistical method for estimating effects among variables (e.g., brain regions) across time. Here, unsupervised classification of individuals occurs using community detection during the data-driven model selection procedure for arriving at individual-level models. Our approach has several advantages over existing methods for classifying individuals. In particular, it places no assumption on homogeneity of predefined subgroups within the sample and utilizes individual-level parameters that have been shown to be more reliable than some competing approaches. This flexible analytic technique will be illustrated with a simulation study and empirical functional MRI data.

Ryan Thorpe, Senior Director Advanced Analytics at Liberty Mutual and UNC Alumni will speaking on campus on Nov. 3rd at 3:30pm in Hanes 112. He will be discussing the current state of data science at Liberty Mutual, general industry trends in analytics, and hosting an open discussion on data science and business. Please join us for this exciting opportunity to speak with a UNC Alumni who is developing analytics for a Fortune 100 company and leading the integration of analytics into industry.

02 Nov

Robust Spatial Varying Coefficient Model in Neuroimaging Data Analysis
120 Hanes Hall

Neuroimaging studies aim to analyze imaging data with complex spatial patterns in a large number of locations (called voxels) on a two dimensional (2D) surface or in a 3D volume. We proposed three methods to spatially model the varying association between imaging measures with a set of covariates, namely, spatially varying coefficient model (SVCM), multiscale adaptive composite quantile regression model (MACQRM), and spatially statistical parametric mapping model (SSPM). For each method, we develop a three-stage estimation procedure to simultaneously estimate the effect images and the complex spatial correlation of imaging data. Theoretically, we establish consistency and asymptotic normality of the adaptive estimates and the asymptotic distribution of the adaptive test statistics.Our Monte Carlo simulation and real data analysis have confirmed the excellent performance of our methods. This is a joint work with Hongtu Zhu, Jianqing Fan, Xingcai Zhou, Partha Sarathi Mukherjee, Baiguo An and Chao Huang. Linglong Kong is an assistant professor at the department of Mathematical and Statistical Sciences of the University of Alberta. He received his BSc in probability and statistics in 1999 at Beijing Normal University and his MSc in statistics at Peking University in 2002. Dr. Linglong Kong obtained his PhD in statistics at the University of Alberta in 2009. He then did one-­‐year postdoctoral training at Michigan State University in 2010 and another two-­‐year postdoctoral training at the University of North Carolina at Chapel Hill from 2010 to 2012; after that he joined the University of Alberta as an assistant professor. His research interests include high-­‐dimensional data analysis neuroimaging data analysis robust statistics and statistical machine learning.

28 Oct

ATM replenishment scheduling
120 Hanes Hall

We develop an ATM replenishment policy for a bank that operates multiple ATMs with an aim to minimize the cost of stock-outs and replenishments, taking into account the economies of scale involved in replenishing multiple ATMs simultaneously. We present the structure of the optimal strategy that minimizes the long run cost per unit time and study a heuristic policy which is easy to implement.

Huiyin Ouyang, UNC Chapel Hill

Allocation of Intensive Care Unit Beds in Periods of High Demand

We consider a stylized, discrete-time model for the admission and discharge decisions in an Intensive Care Unit (ICU) in which patients' health conditions change over time with Markovian probabilities. We find that the optimal decision can depend on the mix of patients in the ICU and provide an analytical characterization of the optimal policy. We also identify conditions under which the optimal policy is state-independent.

21 Oct

Spherical Cap Packing Asymptotics and Rank-Extreme Detection
120 Hanes Hall

We study the spherical cap packing problem with a probabilistic approach. Such probabilistic considerations result in an asymptotic universal uniform sharp bound on the maximal inner product between any set of unit vectors and a stochastically independent uniformly distributed unit vector. When the set of unit vectors are themselves independently uniformly distributed, we further develop the extreme value distribution limit of the maximal inner product, which characterizes its stochastic uncertainty around the bound. As applications of the above asymptotic results, we derive (1) an asymptotic universal uniform sharp bound on the maximal spurious correlation, as well as its uniform convergence in distribution when the explanatory variables are independently Gaussian; and (2) a sharp universal bound on the maximum norm of a low-rank elliptically distributed vector, as well as related limiting distributions. With these results, we develop a fast detection method for a low-rank in high dimensional Gaussian data without using the spectrum information.

19 Oct

Gaussian Process Emulation of Computer Models with Massive Output
120 Hanes Hall

Often computer models yield massive output; e.g., a weather model will yield the predicted temperature over a huge grid of points in space and time. Emulation of a computer model is the process of finding an approximation to the computer model that is much faster to run than the computer model itself (which can often take hours or days for a single run). Many successful emulation approaches are statistical in nature, but these have only rarely attempted to deal with massive computer model output; some approaches that have been tried include utilization of multivariate emulators, modeling of the output (e.g., through some basis representation, including PCA), and construction of parallel emulators at each grid point, with the methodology typically based on use of Gaussian processes to construct the approximations. These approaches will be reviewed, with the startling computational simplicity with which the last approach can be implemented being highlighted and its remarkable success being illustrated and explained; in particular the surprising fact that one can ignore spatial structure in the massive output is explained. All results will be illustrated with a computer model of volcanic pyroclastic flow, the goal being the prediction of hazard probabilities near active volcanoes.

12 Oct

Detecting Weak Signals in GWAS and Sequencing Studies
120 Hanes Hall

Genome-wide association studies (GWASs), as the pre-eminent tool for genetic discovery, have enjoyed much success over the last decade by identifying thousands of genetic loci associated with complex traits. Nonetheless, these variants have only explained a small fraction of the inheritable variability for diseases. GWASs are often characterized by a very large number of candidate variants and generally small effect sizes. Recent studies shows that there remain many causal variants that have not been identified due to a lack of statistical significance. The main goal of this paper is to develop powerful and efficient statistical tools to detect sparse and week signals under high-dimensionality, and to more fully explore the potential of current high-throughput data. Results from this paper will facilitate the identification of missing heritability due to lack of statistical significance.

07 Oct

Graduate Student - Faculty Forum
120 Hanes Hall

Open agenda discussion between faculty and graduate students.

05 Oct

Real stable polynomials, determinants, and combinatorics
120 Hanes Hall

Real stable polynomials define real hypersurfaces with special topological structure. These polynomials bound the feasible regions of SDPs and appear in many areas of mathematics, including optimization, combinatorics and differential equations. Recently, tight connections have been developed between these polynomials and combinatorial objects called matroids. This led to a counterexample to the generalized Lax conjecture, which concerned high-dimensional feasible regions of semidefinite programs. I will give an introduction to some of these objects and the fascinating connections between them.

``
23 Sep

Regulatory Statistics for Public Health
120 Hanes Hall

This presentation is for students and faculty interested in learning about working or collaborating with statisticians at the Food and Drug Administration (FDA). The speaker will first give an overview of FDA and the Office of Biostatistics at the Center for Drug Evaluation and Research (CDER). Then, she will describe and give examples of regulatory science work at the Office of Biostatistics. Finally, she will discuss some opportunities for work and collaboration with the Office of Biostatistics. The Office of Biostatistics in the Center for Drug Evaluation and Research (CDER) at FDA has over 190 statisticians working in statistical science for drug regulation and development. The office’s missions are to provide CDER and other internal and external stakeholders with statistical leadership, expertise, and advice to foster the expeditious development of safe and effective drugs and therapeutic biologics for the American people AND to protect the public health by applying statistical approaches for monitoring the effectiveness and safety of marketed drugs and therapeutic biologic products.

14 Sep

Perspectives on Solving Complex Problems in Industrial R&D
120 Hanes Hall

This is my 35th year working as a statistical consultant in industry. I get paid to help scientists and engineers solve complex problems. The complexity of the problems and the sophistication and expense of the solutions have followed a steep upward trend over the years. In this presentation, I'm going to address four basic themes that have remained important and useful over time. (1) Visualizing data is fundamental to understanding the problem, finding a solution, and conveying the results. Visualization, for me, started with the first PC-based graphics software and has continued on to highly automated, web-based graphics applications. (2) Solving real, complex problems that matter to people is a passion for me; this means experimental design (DOE). This path has also been highly driven by computing, starting with some of the first PC-based software for DOE continuing through space-filling designs and finally to highly automated robotics platforms for experimental design. No matter the approach, success depends on systematically varying inputs to gain a deeper understanding of how the system works. (3) In the modern world, you can't solve problems without organizing data, LOTS of data. For the last 15 years or so I have lead multi-disciplinary teams that develop software, databases, and large scale informatics systems. My current focus is on developing and managing statistics driven informatics systems for the development of an automated, high-throughput library prep system for Next Generation Sequencing. (4) Interestingly, every new problem seems to require fresh thinking about analysis methods. Real problems seldom fit into neat boxes defined by standard methods. During the last several years, I have been working with Prof. Marron on Object Oriented Data Analysis, which I think captures the spirit of how I think about data. Throughout my career, I've maintained strong relationships with academics.I'll also discuss how interactions with professors and graduate students has enriched my personal satisfaction and contributed to successful problem solving.

02 Sep

Elementary certificates in semidefinite programming
201 Chapman Hall

Semidefinite Programming (SDP) is the problem of optimizing a linear objective function of a symmetric positive semidefinite matrix variable. SDP is much more general than linear programming, with applications from engineering to combinatorial optimization. Some call it ``linear programming for the 2000s''. Proving infeasibility is a crucial problem in SDP, as well as in any branch of optimization. Though SDP has been around in some form for almost 50 years, the known infeasibility proofs may either fail, or are much more involved than the ones in linear programming. The main contribution of our work is a surprisingly simple method to prove infeasibility in SDP: we reformulate semidefinite systems using only elementary operations, mostly inherited from Gaussian elimination. When a system is infeasible, the reformulated system is trivially infeasibile. As a corollary, we obtain algorithms to generate the data of all infeasible SDPs. As a practical application of our method, we generate a library of infeasible SDPs. While the status of our instances can be verified by inspection, they turn out to be challenging for the best commercial and research solvers. If time permits, I will also discuss the use of reformulations to obtain a simple certificate of when a feasible semidefinite system is badly behaved, i.e., when duality fails for some objective function. Students interested in a challenging and rewarding research topic are encouraged to attend. The first part of the talk is joint work with Minghui Liu.

31 Aug

Handling Heterogeneity in Big Data
201 Chapman Hall

A major challenge in the world of Big Data is heterogeneity. This often results from the aggregation of smaller data sets into larger ones. Such aggregation creates heterogeneity because different experimenters typically make different design choices. Even when attempts are made at common designs, environmental or operator effects still often create heterogeneity. Thus motivates moving away from the classical conceptual model of Gaussian distributed data, in the direction of Gaussian mixtures. But classical mixture estimation methods are usually useless in Big Data contexts, because there are far too many parameters to efficiently estimate. Thus there is a strong need for statistical procedures which are robust against mixture distributions without the need for explicit estimation. Some early ideas in this important new direction are discussed.

08 May

Dynamics of Distal Group Actions
125 Hanes Hall

An automorphism T of a locally compact group is said to be distal if the closure of the T-orbit of any nontrivial element stays away from the identity. We discuss some properties of distal actions on groups. We will also relate distal groups with behaviour of powers of probability measures on it.

22 Apr

Bayesian threshold selection for extremal models
120 Hanes Hall

Statistical extreme value theory is concerned with the use of asymptotically motivated models to describe the extreme values of a process. A number of commonly used models are valid for observed data that exceed some high threshold. However, in practice a suitable threshold is unknown and must be determined for each analysis. In this talk, I will demonstrate how to use Bayesian measures of surprise to determine suitable thresholds for extreme value models. This approach is easily implemented for both univariate and multivariate extremes.

Sarah Marshall, Auckland University of Technology

Modelling a product recovery system using Markov Decision Processes

Increasing legislative and societal pressures are requiring manufacturers to operate more sustainably and to take responsibility for the fate of their goods after they have been used by consumers. This research models a product recovery system in which newly produced and remanufactured used goods are sold on separate markets but can also act substitutes for each other. A Markov decision process is used to model this system. Optimal policies under four substitution strategies are compared and results from a computational study are discussed.

13 Apr

Penalized adaptive weighted least square regression
120 Hanes Hall

To conduct regression analysis with the data contamination, two approaches are usually applied. One is to detect the contamination and then run ordinary regression using the data excluding contaminated observations. The other is to run some robust regression which is insensitive to the data contamination. In this talk, I will present a novel approach, penalized adaptive weighted least squares (PAWLS), for simultaneous robust estimation, contaminated observation detection, and variable selection in high-dimensional settings. The proposed PAWLS estimator is justified from the Bayesian understanding point of view. Corresponding oracle inequalities are also established. The performance of the proposed estimator is evaluated in both simulation studies and real data analysis.

05 Mar

Bootstrapping for Learning Statistics
Genome Sciences Building-G100

(This talk is intended for introductory statistics students on up.) Statistical concepts such as sampling distributions, standard errors, and P-values are difficult for many students. It is hard to get hands-on experience with these abstract concepts. I think a good way to get that experience is using bootstrapping and permutation tests. I'll demonstrate using a variety of examples. They're not just for students. I didn't realize just how inaccurate the classical methods are until I started checking them using these methods. Remember that old rule of n >= 30? Try n >= 5000 instead. The methods are also useful in their own right. We use them all the time at Google -- they are easier to use than standard methods (far less chance of screwing up), besides being more accurate.

04 Mar

Many Flavors of Statistics and Their Applications
120 Hanes Hall

In this talk I will give an overview of the variety of statistical problems I have encountered, and have worked on, over the years, in the academia, in the industry, and currently, in the government. I hope to touch upon theory, methodology, and applications. Industrial application areas include Condition Monitoring, Supply Chain Modeling, Experimental Design, and Regulatory Conformance. At NIST my areas of involvement include Uncertainty Quantification in Phase Diagrams, Structural Variants Detection in the Human Genome, Forensic Statistics, Calibration of Mass Standards, and Inter-laboratory Trials. I will briefly explain each of these problem areas, the approach we took towards solving them and what the outcome was. I will also mention some problems that may be good candidates for MS or PhD research.

13 Feb

Structured smoothness in modern convex optimization
120 Hanes Hall

The importance of convex optimization techniques has dramatically increased in the last decade due to the rise of new theory for structured sparsity and low-rankness, and successful statistical learning models such as support vector machines. Convex optimization formulations are now employed with great success in various subfields of data sciences, including machine learning, compressive sensing, medical imaging, geophysics, and bioinformatics. However, the renewed popularity of convex optimization places convex algorithms under tremendous pressure to accommodate increasingly difficult nonlinear models and non-smooth cost functions with ever increasing data sizes. Overcoming these emerging challenges requires nonconventional ways of exploiting useful yet hidden structures within the underlying convex optimization models. To this end, I will demonstrate how to exploit the classical notion of smoothness in novel ways to develop fully rigorous methods for fundamental convex optimization settings, from primal-dual framework to composite convex minimization, and from proximal-path following scheme to barrier smoothing technique. Some of these results play key roles in convex optimization, such as unification and uncertainty principles for augmented Lagrangian and decomposition methods, and have important computational implications such as solving convex programs on the positive semidefinite cone without any matrix decompositions.

11 Feb

Large-scale Optimization via Block Coordinate Update
120 Hanes Hall

Large-scale problems arise in many areas including statistics, machine learning, medical imaging, signal processing, to name a few. Datasets and/or number of variables involved by these problems can be extremely large. This talk will show you how to handle these problems by block coordinate update method. Basic strategies of the method are to break variables and datasets into small pieces, and each iteration only updates one piece of variables by using a small set of the data. Depending on specific problems, I will present different update schemes including the classic alternating minimization, recently popularly used block prox-linear, and a new scheme incorporating stochastic gradient method into the block coordinate update. I will also talk about its applications in tensor (multi-dimensional array data) problems to show that the method is simple and also very efficient.

09 Feb

A Unified Theory of Elicitation via Convex Analysis
120 Hanes Hall

Elicitation is the study of mechanisms which incentivize the truthful reporting of private information from self-minded agents. In this talk I will present a general theory of elicitation grounded in convex analysis. Beyond recovering the basic results for several existing models, such as scoring rules and mechanism design, we will see new results and insights in statistics and economics which are made possible by this theoretical unification. To conclude, I will discuss connections to mathematical finance and prediction markets. Based on joint works with Ian Kash (Microsoft UK) and Mark Reid (ANU & NICTA).

04 Feb

Data Analysis through Polyhedral Theory
120 Hanes Hall

With geometric modeling techniques, one can represent the feasible solutions of problems in operations research as objects in high-dimensional space. The properties of these objects reveal information about the underlying problems and lead to algorithms. We model an application in the consolidation of farmland and private forests as a clustering problem where the clusters have to adhere to prescribed cluster sizes. In this approach, we connect least-squares assignments, cell complexes, and the studies of polyhedra. The devised methods lead to generalizations of the classical k-means algorithm and algorithms for soft-margin separation in general data analysis tasks. Further, we report on how these results were implemented in practice.

02 Feb

Improved Decomposition Algorithms for Two-Stage Stochastic Integer Programs
120 Hanes Hall

Many practical planning, design and operational problems involve making decisions under uncertainty. Also, most of them include some integer decisions. Stochastic programming is a useful tool for dealing with uncertainty and integrality requirements in optimization problems. We consider two-stage stochastic integer programs, where the decision maker must take some (integer) decisions before the uncertainty is revealed, then can observe the realizations and take recourse actions. These problems yield to large-scale mixed integer programs, which are computationally very challenging, thus decomposition methods are used. The most common solution approach is Benders decomposition. However, standard Benders decomposition algorithm usually fails due to the weakness of the linear programming relaxations. We propose two methods to strengthen Benders decomposition algorithm with integrality-based cuts. We present numerical results on integrated service system staffing and scheduling, capacitated facility location and network interdiction problems that demonstrate the computational efficiency of the proposed approaches.

23 Jan

Decentralized Mixed Integer Programming: Theory and Application
120 Hanes Hall

Mixed Integer Programming (MIP) is a very strong tool for modeling and solving many real-world optimization problems. In practice, there are a lot of large-scale MIPs with specific structure of loosely coupled blocks of constraints. Fast solution time, scalability, distributed databases and data privacy motivate exploiting decentralized methods in these problems. One possible decentralized approach is relaxing the joint constraints and then solving the relaxed problem in parallel. In this research, we investigate the augmented Lagrangian relaxation and its dual for MIP problems. We show that under some mild assumptions, using any norm as an augmenting function with a sufficiently large penalty coefficient closes the duality gap for MIPs. Then, we propose a decentralized MIP approach based on adding primal cuts and restricting the Lagrangian relaxation of the original MIP. We also propose a decentralized heuristic approach to solve large-scale unit commitment problems in rapidly growing electric power systems. We present and discuss the promising results from testing the method on large-scale power systems.

01 Dec

Characterizing optimal experimental designs for rational function regression using semidefinite optimization
120 Hanes Hall

We consider the problem of finding optimal experimental designs for a large class of regression models involving polynomials and rational functions with heteroscedastic noise also given by a polynomial or rational weight function. The design weights can be found quite easily by existing methods once the support is known, therefore we concentrate on determining the support of the optimal design. The approach we shall present treats all commonly used optimality criteria in a unified manner, and gives a polynomial whose zeros are the support points of the optimal design, generalizing a number of previously known results of the same flavor. As a corollary, a new upper bound on the size of the support set of the minimally supported optimal designs is also found.

24 Nov

Combining Statistical Alignment with Annotation
120 Hanes Hall

Although bioinformatics perceived is a new discipline, certain parts have a long history and could be viewed as classical bioinformatics. For example, application of string comparison algorithms to sequence alignment has a history spanning the last three decades, beginning with the pioneering paper by Needleman and Wunch, 1970. They used dynamic programming to maximize a similarity score based on a cost of insertion-deletions and a score function on matched amino acids. The principle of choosing solutions by minimizing the amount of evolution is also called parsimony and has been widespread in phylogenetic analysis even if there is no alignment problem. This situation is likely to change significantly in the coming years. After a pioneering paper by Bishop and Thompson (Bishop and Thompson, 1986) that introduced and approximated likelihood calculation, Thorne, Kishino and Felsenstein from 1991 proposed a well defined time reversible Markov model for insertion and deletions (denoted more briefly as the TKF91-model), that allowed a proper statistical analysis for two sequences. Such an analysis can be used to provide maximum likelihood (pairwise) sequence alignments, or to estimate the evolutionary distance between two sequences. This was subsequently generalized further to any phylogeny and more practical methods based on MCMC has been developed. Despite much work these models are still very simple and needs to be extended to longer insertion-deletions, position heterogeneity and annotation of protein genes/structure, regulatory signals and RNA structure. This talk will present some progress on combining statistical alignment with annotation. Key Words: Phylogenetic approximation, statistical alignment, annotation, evolution, regulatory signals, transcription factor

20 Nov

Weak existence of a solution to a differential equation driven by a very rough fBm
130 Hanes Hall

19 Nov

Local linear regression on manifolds and its geometric implications
120 Hanes Hall

High-dimensional data are increasingly commonly seen in many of today's scientific fields. Recently there is a trend in finding the low-dimensional structure of high-dimensional data. Then we can build statistical models and carry out inference based on the identified low-dimensional structure. This approach is appealing in many senses. For example, theoretically speaking the estimation and inference enjoy better efficiency, and numerically the computational time is dramatically reduced and the performance is stable. Of particular interest is to assume that the data have some manifold structure and learn it from the data themselves. In this talk, I will present a nonparametric regression on manifold method built on local PCA. The proposed method can be used to further produce a manifold learning tool which possesses nice properties. Both theoretical and numerical properties of the proposed models will be reported. Application to the registration with CT scan data will be discussed as well.

12 Nov

Confidence regions and intervals of stochastic variational inequalities: the normal map approach
120 Hanes Hall

Variational inequalities model a general class of equilibrium problems, and also arise as first order conditions of nonlinear programs. This talk considers a stochastic variational inequality (SVI) defined over a polyhedron, with the function defining the variational inequality being an expectation function. A basic approach for solving such an SVI is the sample average approximation (SAA) method, that replaces the expectation function by a sample average function, and uses a solution of the SAA problem to estimate the true solution. It is well known that under appropriate conditions the SAA solutions provide asymptotically consistent point estimators for the true solution. In this talk, we present methods to compute confidence regions and confidence intervals for the true solution, given an SAA solution. Standard statistical techniques are not applicable here due to the nonsmooth structure of variational inequalities. We will discuss how to overcome such difficulties caused by the nonsmoothness.

05 Nov

Internship Student Panel
120 Hanes Hall

Internships are a great opportunity for graduate students to gain experience working in a non-academic setting while tackling real world problems. A panel of current students will share their experiences working as summer interns in various sectors and roles. Panelists have interned at companies including: SAS, Johnson & Johnson, CNA, Bank of America, among others. The colloquium will be structured as an informal Q&A, and should be viewed as an opportunity for students to learn from their peers about the complete "internship experience" - from the application process to the final presentation and beyond.

03 Nov

Three Easy Pieces --- Modes, Bumps, and Mixtures
120 Hanes Hall

In this talk, we consider three different but complementary approaches for finding structure in data via density estimation. The first examines good histogram bandwidths for finding modes with massive data sets. The second is an update of I.J. Good's "bump surgery" approach. And thirdly, we focus on the weights of a normal mixture as surrogates for structure.

30 Oct

Probabilistic rearrangement inequalities and applications
130 Hanes Hall

The classical entropy power inequality, due to Shannon and Stam, provides a lower bound on the Boltzmann-Shannon entropy of a sum of independent random vectors, and is a key ingredient of the information-theoretic approach to the central limit theorem. We present a refinement of this inequality using the notion of spherically symmetric rearrangements, and simultaneously also obtain a lower bound on the (more general) R’enyi entropy of a sum of independent random vectors. We will also discuss several applications and related results, including a new proof of the classical entropy power inequality, and a discrete analogue that is related to questions in additive combinatorics. Includes joint work with Liyao Wang and Jaeoh Woo (Yale University).

29 Oct

Generalized Fiducial Inference
120 Hanes Hall

R. A. Fisher's fiducial inference has been the subject of many discussions and controversies ever since he introduced the idea during the 1930's. The idea experienced a bumpy ride, to say the least, during its early years and one can safely say that it eventually fell into disfavor among mainstream statisticians. However, it appears to have made a resurgence recently under various names and modifications. For example under the new name generalized inference fiducial inference has proved to be a useful tool for deriving statistical procedures for problems where frequentist methods with good properties were previously unavailable. Therefore we believe that the fiducial argument of R.A. Fisher deserves a fresh look from a new angle. In this talk we first generalize Fisher's fiducial argument and obtain a fiducial recipe applicable in virtually any situation. We demonstrate this fiducial recipe on examples of varying complexity. We also investigate, by simulation and by theoretical considerations, some properties of the statistical procedures derived by the fiducial recipe showing they often posses good repeated sampling, frequentist properties. Portions of this talk are based on a joined work with Hari Iyer, Thomas C.M. Lee, Randy Lai, Dimitris Katsoridasi

23 Oct

Hydrodynamic limits for directed traps and systems of independent RWRE
130 Hanes Hall

We study the evolution of a system of independent random walks in a common random environment (RWRE). Previously a hydrodynamic limit was proved in the case where the environment is such that the random walks are ballistic (i.e., transient with non-zero speed v_0). In this case it was shown that the asymptotic particle density is simply translated deterministically by the speed v_0. In this talk we will consider the more difficult case of RWRE that are transient but with v_0=0. Under the appropriate spacetime scaling, we prove a hydrodynamic limit for the system of random walks. The statement of the hydrodynamic limit that we prove is non-standard in that the evolution of the asymptotic particle density is given by the solution of a random rather than a deterministic PDE. The randomness in the PDE comes from the fact that under the hydrodynamic scaling the effect of the environment does not ``average out'' and so the specific instance of the environment chosen actually matters. The proof of the hydrodynamic limit for the system of RWRE will be accomplished by coupling the system of RWRE with a simpler model of a system of particles in an environment of ``directed traps.'' This talk is based on joint work with Milton Jara.

22 Oct

Tips and Tools for Creating a Personal Webpage
120 Hanes Hall

Personal webpages are an important tool for sharing information with applications to practically everything. Despite the clear advantages of having a personal webpage, graduate students have traditionally avoided making one. In this talk, we introduce a fast and simple approach using the web.unc.edu system to create and maintain a personal webpage under weak assumptions on prior programming experience. Several empirically derived rules for maintaining a professional looking webpage will also be presented. Throughout the talk, real webpage examples will be used to motivate and illustrate key points.

James Wilson, UNC Chapel Hill

20 Oct

Optimization in High-Dimensions
120 Hanes Hall

Many modern estimation settings result in problems that are high-dimensional, meaning that the number of parameters to estimate can be far greater than the number of examples. In this talk I will discuss a general methodology for understanding the statistical behavior of high-dimensional estimation procedures under a broad class of problems. I will then discuss some of the algorithms and optimization procedures used to actually perform the estimation. We will demonstrate that the same set of tools that lend themselves to establishing good statistical properties also lend themselves to understanding efficient computational methods.

15 Oct

Prioritization in service systems with customers changing types
120 Hanes Hall

Motivated by the admission and discharge decisions in Intensive Care Units (ICUs), we analyze a stylized formulation to gain insights into prioritization decisions for access to a limited service resource. Specifically, we consider a service system with multiple servers, no queueing capacity and customers whose types may change during service. Under the objective of maximizing the total benefit for the system, which can be interpreted as the expected number of survivors or the expected number of readmissions in an ICU setting, we prove that there exists an optimal policy that is of threshold type and we give a set of sufficient conditions under which one type is always preferred over the other regardless of the system state. Joint work with Nilay Tanık Argon and Serhan Ziya

13 Oct

Functional Nuclear Norm and Low Rank Function Estimation
120 Hanes Hall

The problem of low rank estimation naturally arises in a number of functional or relational data analysis settings, for example when dealing with spatio temporal data or link prediction with attributes. We consider a unified framework for these problems and devise a novel penalty function to exploit the low rank structure in such contexts. The resulting empirical risk minimization estimator can be shown to be optimal under fairly general conditions.

09 Oct

Transition between averaging and homogenization regimes for periodic flows and averaging for flows with ergodic components.
130 Hanes Hall

In this talk we'll discuss two asymptotic problems that are related by common techniques. First, we'll consider elliptic PDEs with with a small diffusion term in a large domain. The coefficients are assumed to be periodic. Depending on the relation between the parameters, either averaging or homogenization need to be applied in order to describe the behavior of solutions. We'll discuss the transition regime. The second problem concerns equations with a small diffusion term, where the first-order term corresponds to an incompressible flow, possibly with a complicated structure of flow lines. Here we prove an extension of the classical averaging principle of Freidlin and Wentzell. Different parts of the talk are based on joint results with M. Hairer, Z. Pajor-Guylai, D. Dolgopyat, and M. Freidlin.

08 Oct

Higher order asymptotics of Generalized Fiducial Inference
130 Hanes Hall

R. A. Fisher's fiducial inference has been the subject of many discussions and controversies ever since he introduced the idea during the 1930's. The idea experienced a bumpy ride, to say the least, during its early years and one can safely say that it eventually fell into disfavor among mainstream statisticians. However, fiducial inference appears to have made a resurgence recently under various names and modifications that proved to be useful for deriving statistical procedures for problems where frequentist methods with good properties were previously unavailable. Therefore we believe that the fiducial argument of R.A. Fisher deserves a fresh look from a new angle. Generalized Fiducial inference was motivated by Fisher's philosophy and provides a distribution on the parameter space derived in closed form using the theory of increasing precision asymptotics (Hannig,2013) and not Bayes' theorem. In this work we analyzed the regularity conditions under which the Fiducial distribution is first and second order exact in frequentist sense. To get the expansion of the frequentist coverage of Fiducial quantile, we used an ingenious approach called "Shrinkage method" (Ghosh,Bickel 1990) which is extensively used in the context of probability matching prior. The ideas will be demonstrated on several simple examples.

06 Oct

A Biased View of Topological Data Analysis (TDA)
120 Hanes Hall

I will review some of the recent work in TDA from the perspective of a statistician. I will begin by motivating TDA from the perspective of some applications in biology and anthropology. I will then discuss homology and persistence homology (PH)---PH is the dominant topological summary statistic used in TDA. I will also describe another summary statistic called the Euler characteristic curve and relate it to extrema of Gaussian random fields. Given these summaries I will proceed to 1) State that PH forms a probability space and has Frechet means and variances and one can condition on these summaries, an example of Jeffreys' substitution l likelihoods; 2) Topological summaries can be used as sufficient statistics for distributions on shapes or surfaces that are given as meshes and specific summaries admit an exponential family model; 3) I will state some results on limiting distributions of Betti numbers of the random set model generated by a point process on a manifold (the purpose of this topic is to make the probabilists happy)

01 Oct

Sparse Regression Incorporating Graphical Structure among Predictors
120 Hanes Hall

With the abundance of high dimensional data in various disciplines, sparse regularized techniques are very popular these days. In this paper, we make use of the structure information among predictors to improve sparse regression models. Typically, such structure information can be modeled by the connectivity of an undirected graph using all predictors as nodes of the graph. Most existing methods use this undirected graph information edge by-edge to encourage the regression coefficients of corresponding connected predictors to be similar. However, such methods may require expensive computation when the predictor graph has many more edges than nodes. Furthermore, they do not directly utilize the neighborhood information of the graph. In this paper, we incorporate the graph information node-by-node, instead of edge-by-edge as used in most existing methods. To that end, we decompose the true p-dimensional regression coefficient vector as the sum of p latent parts and incorporate convex group penalty functions to utilize predictor graph neighborhood structure information. Our proposed method is very general and it includes adaptive Lasso and group Lasso as special cases. Both theoretical and numerical studies demonstrate the effectiveness of the proposed method for simultaneous estimation, prediction and model selection.

25 Sep

The growth model: Busemann functions, shape, geodesics, and other stories.
130 Hanes Hall

We consider the directed last-passage percolation model on the planar integer lattice with nearest-neighbor steps and general i.i.d. weights on the vertices, outside the class of exactly solvable models. Stationary cocycles are constructed for this percolation model from queueing fixed points. These cocycles define solutions to variational formulas that characterize limit shapes and yield new results for Busemann functions, geodesics and the competition interface. This is joint work with Nicos Georgiou and Timo Seppalainen.

24 Sep

Quadratic programming in synthesis of stationary Gaussian fields
120 Hanes Hall

Stationary Gaussian random fields are used as models in a range of applications such as image analysis or geostatistics. For example, the spatial distribution of fMRI signals is often modeled as a stationary Gaussian random field with a pre-specified covariance structure. One of the most effective and exact methods to synthesize such fields is based on the so-called circulant matrix embedding. But the standard version of the method works only under suitable assumptions, which are well-known to fail for many practical covariance structures of stationary fields of interest. In this talk, I will present a novel methodology which adaptively constructs feasible circulant embeddings based on constrained quadratic optimization. In the first part of my talk, I will review two circulant embedding methods, namely, the standard and smoothing windows circulant embeddings. Then, I will give several examples of covariance functions for which these methods fail and motivate the formulation of a quadratic problem with linear inequality constraints. In the second part of my talk I will explain how a well-known interior point optimization strategy called primal log barrier method can be suitably adapted to solve the quadratic problem faster than commercial solvers. Time permitting, future work will also be discussed.

22 Sep

L1-Norm Prinicipal Component Analysis
120 Hanes Hall

Principal component analysis (PCA) may be viewed in terms of optimization as finding a series of best-fit subspaces. Traditional PCA is based on using the L2 norm to measure distances of points to the fitted subspaces and can be sensitive to outlier observations. Several robust approaches based on the L1 norm have been proposed, including methods that estimate best-fitting L1-norm subspaces. In this talk, we review progress on the L1-norm best-fit hyperplane problem and the L1-norm best-fit line problem. Both problems are naturally written as nonlinear nonconvex optimization problems. However, the best-fit hyperplane problem can be found by solving a small number of linear programs. Whether the best-fit line can be found in polynomial time remains an open problem. We introduce methods for deriving solutions to these problems and show how they can be used for robust PCA.

17 Sep

Influence of Climate Change on Extreme Weather Events
120 Hanes Hall

The increasing frequency of extreme weather events raises the question to what extent such events are attributable to human causes. The human influence may be characterized through the fraction of attributable risk (FAR) or equivalently the risk ratio (RR), which is the ratio of the probability of a given extreme event under an anthropogenic forcing scenario to the same probability when only natural forcings are taking into account. However, there is no generally accepted method of calculating these quantities. We propose a method based on extreme value theory, incorporated into a Bayesian hierarchical model for combining climate models runs and the observational record. The same method solves the closely related question of projecting changes in the probability of extreme events over some future period. These methods are applied to three extreme events: (a) the extreme heatwave that occurred in parts of Europe in 2003, (b) the heatwave of Russia in 2010, and (c) the central USA heatwave of 2011. In each case we find posterior median risk ratios (from anthropogenic versus control-run models) of between 2 and 3, with wide posterior intervals, but still at least a two-thirds probability that the true risk ratio is greater than 1.5, implying substantial anthropogenic influence. Projections of future probabilities for the same three extreme events show substantial differences: an event equivalent to Europe in 2003 has nearly a 90% probability of reoccurrence in any given year by 2040; corresponding results for the other two events also show an increase, but much more slowly.

15 Sep

MCMC: How would you propose?
120 Hanes Hall

Markov chain Monte Carlo (MCMC) is a simulation algorithm that has made modern Bayesian inference possible. It is extremely flexible and can use nearly arbitrary proposal kernels to generate the correct target distribution. However, some proposals may be more efficient than others, in the sense that they lead to smaller variances in the estimates based on the resulting MCMC sample. While it is well-known that one should use medium-sized steps in any given proposal, with a moderate acceptance rate of the proposed moves, it is not well-appreciated that different proposal densities can lead to very different algorithm efficiency. We compared a number of proposal densities applied to various targets, and found that the uniform kernel is more efficient than the Gaussian kernel, while a two-humped Bactrian kernel is even better. With optimal scales used for both, the Bactrian kernel is at least 50% more efficient than the Gaussian. We suggest that further research is needed in this area given the popularity of MCMC algorithms and the general applicability of such simple proposal densities.

10 Sep

A short proof of infeasibility and generating all infeasible semidefinite Programs
120 Hanes Hall

In optimization problems it is of fundamental importance to have a short and easy to verify proof of infeasibility. In linear programming (LP) the celebrated Farkas' lemma easily verifies infeasibility: the linear system (1) x≥0,Ax=b is infeasible, if and only if the alternative system (2) y^T A≥0,y^T b=-1. is feasible and the "if" statement is trivial to prove. I.e., the system (2) is a short certificate of infeasibility of (1). Semidefinite programming (SDP) is the problem of optimizing a linear objective subject to linear constraints, and the matrix variable being positive semidefinite. SDP is a vast generalization of LP, with applications in combinatorial optimization, engineering and statistics. The straightforward generalization of Farkas' lemma, however, may fail. We present a short certificate of infeasibility of SDPs using a standard form reformulation. The "if" direction is almost as easy to prove as in the LP Farkas' lemma. Obviously, there are infinitely many infeasible SDPs. Still, the reformulation allows us to systematically generate all of them; and we prove that, somewhat surprisingly, there are only finitely many "representative" SDPs in every dimension. The talk will not assume any knowledge of semidefinite programming; knowledge of linear algebra is sufficient to follow it. There are many challenging fundamental open questions in SDP; the corresponding ones have long been resolved in LP. I will outline some of these questions. Students interested in a challenging research topic are encouraged to attend. The URL of the paper is http://arxiv.org/abs/1406.7274

08 Sep

The fascinating origins of Statistics in the state of NC and the development of our UNC department - some personal recollections of earlier days.
120 Hanes Hall

The early history of Statistics in NC and in particular the formation of the departments at NCSU and UNC is well documented in a 1978 Int. Stat. Review paper by a team of authors including Gertrude M Cox, David D Mason, John Monroe of NCSU, and Bernard G Greenberg, Norman L Johnson, Lyle V Jones, Gordon D Simons, of UNC. Further, a recent volume (“Strength in numbers…”) contains interesting historical accounts of development of many individual departments through 2010. This includes a chapter for the NCSU Statistics Dept by T. Gerig , and one for our department (now Statistics and Operations Research) by D.G. Kelly. Prime motivations for my talk are the recognition that we keep our graduate students very busy with courses, research, but largely unaware of the rich traditions which they inherit in spending significant time here. I also admit the temptation to add my personal perspectives (biases) to the written accounts, from associations with the early players in those exciting times. In this talk I plan to mainly focus on the early period from the remarkable beginnings of NC academic Statistics in 1940 emanating from a chance encounter of university President Frank Porter Graham in a train ride, with WF Callander of the US Dept of Agriculture. This led to the immediate formation of the Dept. of Experimental Statistics at NCSU, the hiring of Gertrude M Cox as its head, and her joint activity with Frank Graham leading to the formation of the “Institute of Statistics” in 1944 and the UNC “Department of Mathematical Statistics” in 1946 with its “Dream Team” faculty, led by Harold Hotelling. My plan then is to describe these early events – largely as recounted to me by Gertrude Cox herself, for whom I worked for 5 years before joining the UNC department as attrition depleted its awesome first faculty, in the early 1960’s. I will briefly sketch the later department development and indicate some of the subsequent highlights, especially in its trends of emphasis following the golden era of the “prima donna” theoretical department complementing the splendid Experimental Statistics department in Raleigh (to quote Gertrude Cox’s own vision for its creation). Finally in so doing I shall indicate the department’s early interest and capability in Operations Research, its regular teaching of courses in both stochastic and deterministic aspects in its graduate program, and its role in the formation of a Curriculum (later a separate department) and the ultimate reunification by merger as the current Dept of Statistics and Operations Research. References: “Statistical training and research: The University of North Carolina System”, ISI Review, Aug 1978. Authored by a committee, B. Greenberg (UNC Biostat.), Chair, GM Cox, DD Mason, John Monroe, from NCSU, and NL Johnson, JE Grizzle, LV Jones, GD Simons, UNC. Collated by E. Shepley Nourse, Publications Consultant. “Strength in numbers, the rising of academic statistics departments in the US”, Eds. A. Agresti, X-L Meng, Springer NY, 2013.

03 Sep

Community extraction in multilayer networks
120 Hanes Hall

Community detection is an important problem in network analysis that has been successfully applied in various biological, communication, and social interacting systems. Informally, community detection seeks to divide the vertices of a given network into one or more groups, called communities, in such a way that vertices of the same community are more interconnected than vertices belonging to different ones. Until recently, detection methods have assumed that every vertex belongs to a well-defined community; however, in many applications, networks contain a significant number of non preferentially attached “background” vertices not belonging to a distinct community. In these applications, contemporary detection methods may provide misleading results due to false discovery. Community extraction – the algorithmic search of dense communities one at a time – is one promising avenue for addressing the issue of background vertices. In this talk I will first discuss community extraction in static networks and recent developments in the area. Then, I will introduce the notion of community extraction in multilayer networks wherein the data object is a sample of networks each with possibly different relational structure. I will describe an extraction procedure that searches for vertex-layer sets containing a statistically surprising number of edges and will investigate the performance of the multilayer extraction method through various simulations as well as an application to an ADHD-200 fMRI brain imaging data set.

01 May

Filtering with noisy Lagrangian tracers
125 Hanes Hall

An important practical problem is the recovery of a turbulent velocity field from Lagrangian tracers that move with the fluid flow. Despite the inherent nonlinearity in measuring noisy Lagrangian tracers, it is shown that there are exact closed analytic formulas for the optimal filter. When the underlying velocity field is incompressible, the tracers’ distribution converge to the uniform distribution geometrically fast; concrete asymptotic features, such as information barriers, are obtained for the optimal filter when the number of tracers goes to infinity. On the hand, the filtering of a compressible flow that consists of both geostrophically balanced (GB) modes and gravity waves is also considered. Its performance can be closely approximated by the filter performance of an idealized GB truncation of this model when the Rossby number is small, i.e. the rotation is fast. Such phenomenon is caused by fast-wave averaging and inspires a simplified filtering scheme.

21 Apr

Chasing Demand: Learning and Earning in a Changing Environment
120 Hanes Hall

We consider a dynamic pricing problem in which a seller faces an unknown demand model that can change over time. We measure the amount of change over a time horizon of T periods using a quadratic variation metric, and allow a finite “budget" for such changes. We first derive a lower bound on the expected performance gap between any pricing policy and a clairvoyant who knows a priori the temporal evolution of the underlying demand model, and then design families of near-optimal pricing policies, the revenue performance of which asymptotically matches said lower bound. We also show that the seller can achieve a substantially better revenue performance in demand environments that change in “bursts" than it would in a demand environment that changes “smoothly." Finally, we extend our analysis to the case of rapidly changing demand settings, and obtain a range of results that quantify the net effect of the volatility in the demand environment on the seller’s revenue performance.

14 Apr

Real-Time Control of Ambulance Fleets, and Simulation Optimization using High-Performance Computing
120 Hanes Hall

In the first part of the talk I will discuss ambulance redeployment, in which an ambulance fleet is controlled in real-time to attempt to ensure short response times to calls. I'll focus on the use of simulation optimization to tune approximate-dynamic programs that yield highly effective policies, along with a coupling approach to compute a bound on the optimality gap. This work has motivated us to develop simulation-optimization algorithms that exploit parallel computing capabilities. In the second part of the talk, I'll discuss our work in developing "ranking and selection" algorithms for high-performance computing environments, and show results for runs using up to 1000 cores. Bio: Shane G. Henderson is a professor in the School of Operations Research and Information Engineering at Cornell University. He received his PhD from Stanford University in 1997, and has held academic positions in the Department of Industrial and Operations Engineering at the University of Michigan and the Department of Engineering Science at the University of Auckland. His research interests include discrete-event simulation, simulation optimization, and emergency-services planning.

10 Apr

Convergence of the Maximum of 2-Dimensional Gaussian Free Field and Connections with Branching Brownian Motion
130 Hanes Hall

The 2-dimensional discrete Gaussian free field (GFF) is a Gaussian process on the NxN square in Z^2, with zero boundary data. The behavior of its maximum, as N goes to infinity, has in recent years been the object of considerable interest. Here, we discuss results in two recent papers, B-Zeitouni (2012) and B-Ding-Zeitouni (2014), that address the convergence in distribution of this maximum under appropriate translation. The techniques employed here have a strong connection with those for analyzing the maximum of Brownian motion and branching random walk, which we also discuss.

07 Apr

False Discovery Control in Large-Scale Spatial Multiple Testing
120 Hanes Hall

This talk considers both point-wise and cluster-wise spatial multiple testing problems. We derive oracle procedures which optimally control the false discovery rate, false discovery exceedance and false cluster rate, respectively. A data-driven finite approximation strategy is developed to mimic the oracle procedures on a continuous spatial domain. Our multiple testing procedures are asymptotically valid and can be effectively implemented using Bayesian computational algorithms for analysis of large spatial data sets. Numerical results show that the proposed procedures lead to more accurate error control and better power performance than conventional methods. We demonstrate our methods for analyzing the time trends in tropospheric ozone in eastern US. This is the joint work with Brian Reich, Tony Cai, Michele Guindani and Armin Schwartzman.

03 Apr

Brownian crossings via regeneration and the past and future given the present.
130 Hanes Hall

Let B be standard Brownian Motion. Suppose B(t) is in (a,b). We derive the limit distributions of the last entrance to (a,b) to and next exit from (a,b) as well the path upto t and the path from t as t goes to infinity. (Joint work with B. Rajeev of ISI Bangalore)

26 Mar

Big n, Big p: Eigenvalues for Cov Matrices of Heavy-Tailed Multivariate Time Series
120 Hanes Hall

In this paper we give an asymptotic theory for the eigenvalues of the sample covariance matrix of a multivariate time series when the number of components p goes to infinity with the sample size. The time series constitutes a linear process across time and between components. The input noise of the linear process has regularly varying tails with index between 0 and 4; in particular, the time series has infinite fourth moment. We derive the limiting behavior for the largest eigenvalues of the sample covariance matrix and show point process convergence of the normalized eigenvalues as n and p go to infinity. The limiting process has an explicit form involving points of a Poisson process and eigenvalues of a non-negative definite matrix. Based on this convergence we derive limit theory for a host of other continuous functionals of the eigenvalues, including the joint convergence of the largest eigenvalues, the joint convergence of the largest eigenvalue and the trace of the sample covariance matrix, and the ratio of the largest eigenvalue to their sum. (This is joint work with Thomas Mikosch and Oliver Pfaffel.)

24 Mar

Structural Breaks, Outliers, MDL, Some Theory and Google Trends
120 Hanes Hall

In this lecture, we will take another look at modeling time series that exhibit certain types of nonstationarity. Often one encounters time series for which segments look stationary, but the whole ensemble is nonstationary. On top of this, each segment of the data may be further contaminated by an unknown number of innovational and/or additive outliers; a situation that presents interesting modeling challenges. We will seek to find the best fitting model in terms of the minimum description length principle. As this procedure is computationally intense, strategies for accelerating the computations are required. Numerical results from simulation experiments and real data analyses, some of which come from Google trends, show that our proposed procedure enjoys excellent empirical properties. In the case of no outliers, there is an underlying theory that establishes consistency of our method. The theory is based on an interesting application of the functional law of the iterated logarithm. (This is joint work with Thomas Lee and Gabriel Rodriguez-Yam.)

05 Mar

The Job Market: 2014
120 Hanes Hall

Recent Ph.D graduates from the STOR program will discuss their experience on the job market this year. This will be a question and answer session used to help current students in our program prepare for the future job market. In particular, Susan Wei and Sean Skwerer will contribute their own experiences. Other graduating students are TBA.

03 Mar

A Dynamic Directional Model for Effective Brain Connectivity using Electrocorticographic (ECoG) Time Series

We introduce a dynamic directional model (DDM) for studying brain effective connectivity based on intracranial electrocorticographic (ECoG) time series. The DDM consists of two parts: a set of differential equations describing neuronal activity of brain components (state equations), and observation equations linking the underlying neuronal states to observed data. When applied to functional MRI or EEG data, DDMs usually have complex formulations and thus can accommodate only a few regions, due to limitations in spatial and/or temporal resolution of these imaging modalities. In contrast, we formulate our model in the context of ECoG data. The combined high temporal and spatial resolution of ECoG data result in a much simpler DDM, allowing investigation of complex connections between many regions. To identify functionally-segregated sub-networks, a form of biologically economical brain networks, we propose the Potts model for the DDM parameters. The neuronal states of brain components are represented by cubic spline bases and the parameters are estimated by minimizing a log-likelihood criterion that combines the state and observation equations. The Potts model is converted to the Potts penalty in the penalized regression approach to achieve sparsity in parameter estimation, for which a fast iterative algorithm is developed. An L_1 penalty is also considered for comparison. The methods are applied to an auditory ECoG data set.

27 Feb

The Dynamics of Retweeting on Twitter
130 Hanes Hall

In this work we analyze the dynamics retweeting in the micro-blogging site Twitter. We propose a model for Twitter users' retweeting behavior which incorporates two key elements: the arrival patterns of users to Twitter and the design of the Twitter user interface. Using only these elements, our model predicts a distribution of user retweet times which agrees with observations on over 2.4 million retweets in Twitter. Our model allows us to predict the probability of a tweet being viewed by a specific user and the impact of promotional activity on the visibility of a tweet. This suggests that our model can serve as a tool to optimize advertising services provided by Twitter which promote tweets. Bio: Tauhid Zaman is an Assistant Professor of Operations Management at the MIT Sloan School of Management. His research focuses on utilizing large-scale data from online social networks such as Facebook and Twitter to develop predictive models for user behavior and enhance business operations. He received his BS, MEng, and PhD degrees in electrical engineering and computer science from MIT. Before returning to MIT he spent one year as a postdoctoral researcher in the Wharton Statistics Department at the University of Pennsylvania. His work has been featured in Wired, Mashable, the LA Times, and Time Magazine.

26 Feb

Inference on Covariance Structure & Sparse Discriminant Analysis
120 Hanes Hall

Covariance structure is of fundamental importance in many areas of statistical inference and a wide range of applications, including genomics, fMRI analysis, risk management, and web search problems. In the high dimensional setting where the dimension p can be much larger than the sample size n, classical methods and results based on fixed p and large n are no longer applicable. In this talk, I will discuss some recent results on optimal estimation of large covariance and precision matrices. The results and technical analysis reveal new features that are quite diff erent from the conventional low-dimensional problems. I will also discuss sparse linear discriminant analysis with high-dimensional data.

24 Feb

Fairness, Efficiency and Flexibility in Organ Allocation for Kidney Transplantation

We propose a scalable, data-driven method for designing national policies for the allocation of deceased donor kidneys to patients on a waiting list, in a fair and efficient way. We focus on policies that have the same form as the one currently used in the U.S. In particular, we consider policies that are based on a point system, which ranks patients according to some priority criteria, e.g., waiting time, medical urgency, etc., or a combination thereof. Rather than making specific assumptions about fairness principles or priority criteria, our method offers the designer the flexibility to select his desired criteria and fairness constraints from a broad class of allowable constraints. The method then designs a point system that is based on the selected priority criteria, and approximately maximizes medical efficiency, i.e., life year gains from transplant, while simultaneously enforcing selected fairness constraints. Among the several case studies we present employing our method, one case study designs a point system that has the same form, uses the same criteria and satisfies the same fairness constraints as the point system that was recently proposed by U.S. policymakers. In addition, the point system we design delivers an 8% increase in extra life year gains. We evaluate the performance of all policies under consideration using the same statistical and simulation tools and data as the U.S. policymakers use.

20 Feb

Dynamical random graph processes with bounded-size rules and the augmented multiplicative coalescent

The last few years have seen significant interest in random graph models with limited choice. One of the standard models are the Achlioptas random graph processes on a fixed set of n vertices. Here at each step, one chooses two edges uniformly at random and then decides which one to add to the existing configuration according to some criterion. An important class of rules are the bounded-size rules (BSR) wherein for a fixed value K, all components of size greater than K are treated equally. We prove that, through the critical window, the component sizes and surplus edges of BSR random graph processes exhibits the same merging dynamic as the classic Erdos-Renyi process. The key observation in the proof is that BSR random graph models can be related to a family of inhomogeneous random graph models. We also introduce the augmented multiplicative coalescent, which captures the evolution of both component sizes and surpluses of BSR random graph processes in the critical window. This is joint work with Shankar Bhamidi and Amarjit Budhiraja.

19 Feb

Managing Capacity for a Disruptive Innovation
120 Hanes Hall

Disruptive innovations are often associated with high demand and limited supply capacity. The two factors conspire to generate a decreasing price pattern that is in uenced by the innovator's capacity policy, which endogenizes the price evolution. This paper studies the innovator's dynamic capacity management policies with such endogenous price evolution. We reveal several managerially relevant insights: 1) the innovator should never reduce capacity before exiting the market; 2) at any time, the innovator's capacity building policy can be described with a single capacity target; 3) the innovator should build higher capacity for a more uncertain market; and 4) when estimating the market's prospect, a conservative estimator is better than an unbiased one. These insights can potentially inform innovators' capacity management decisions. Joint work with Jianfeng Lu, Department of Mathematics, Duke University

05 Feb

Finding optimal treatment dose using outcome weighted learning
120 Hanes Hall

Finding an optimal treatment dose is an important issue in clinical trials. Recently, there are increasing needs for considering individual heterogeneity in finding the optimal treatment dose. In particular, instead of determining a fixed dose for all patients, it is desirable to find a decision rule as a function of patient characteristics such that the expected clinical outcome in the population level is maximized. We propose a randomized trial design for optimal dose finding and provide a corresponding analysis method. We show that our proposed dose finding method using randomized trial data can be regarded as an inverse probability weighting estimator of expected clinical outcome. Further, we show the estimation problem is equivalent to solve a weighted regression problem with a truncated L1 loss function. An efficient difference convex algorithm is proposed to solve the associated non-convex optimization problem. We also derive the asymptotic consistency of the estimated decision rule. In addition, the performance of the proposed method and competitive methods are illustrated through both simulation examples and a real dosage identification example for Warfarin (an anti-thrombosis drug).

30 Jan

The Gaussian Kinematic Formula, with some applications
130 Hanes Hall

The Gaussian kinematic formula is a rather amazing result in the theory of smooth Gaussian processes on stratified manifolds, due mainly to Jonathan Taylor. In its simplest version, the GKF reduces to the 60 year old Rice formula, which for decades has been at the core of applications of time dependent Gaussian processes. A more general version applies to random fields. However, in its full glory, the GKF is a result that generalises both to random settings and to infinite dimensions the classical Kinematic Fundamental Formula of Integral Geometry, while also giving a probabilistic extension of the classical Gauss-Bonnet theorem for manifolds. It also goes considerably beyond the purely Gaussian scenario. In addition to its elegant theory, the GKF has a myriad of applications, ranging from Astrostatistics and fMRI Imaging to very recent applications in Topological Data Analysis and to adaptive estimation in the lasso and related techniques of sparse regression. In this (blackboard) talk I want to describe the GKF and some of its applications. Proofs - which are typically long and hard - will be avoided, so that the material will be accessible to a broad audience

13 Jan

Nonparametric kernel regression with multiple predictors and multiple shape constraints
120 Hanes Hall

Nonparametric smoothing under shape constraints has recently received much well-deserved attention. Powerful methods have been proposed for imposing a single shape constraint such as monotonicity and concavity on univariate functions. In this paper, we extend the monotone kernel regression method in Hall and Huang (2001) to the multivariate and multi-constraint setting. We impose equality and/or inequality constraints on a nonparametric kernel regression model and its derivatives. A bootstrap procedure is also proposed for testing the validity of the constraints. Consistency of our constrained kernel estimator is provided through an asymptotic analysis of its relationship with the unconstrained estimator. Theoretical underpinnings for the bootstrap procedure are also provided. Illustrative Monte Carlo results are presented and an application is considered.

08 Jan

Multicategory Angle-based Large-margin Classification
120 Hanes Hall

Large-margin classifiers are popular classification methods in both machine learning and statistics. These techniques have been successfully applied in many scientific disciplines such as bioinformatics. Despite the success of binary large-margin classifiers, extensions to multicategory problems are quite challenging. Among existing simultaneous multicategory large-margin classifiers, a common approach is to learn k different classification functions for a k-class problem with a sum-to-zero constraint. Such a formulation can be inefficient. In this talk, I will present a new Multicategory Angle-based large-margin Classification (MAC) framework. The proposed MAC structure considers a simplex based prediction rule without the sum-to-zero constraint, and consequently enjoys more efficient computation. Many binary large-margin classifiers can be naturally generalized for multicategory problems through the MAC framework. Both theoretical and numerical studies will be discussed to demonstrate the usefulness of the proposed MAC classifiers.

05 Dec

Critical Care in Hospitals: When to Introduce a Step Down Unit
120 Hanes Hall

Step Down Units (SDUs) provide an intermediate level of care between the Intensive Care Units (ICUs) and the general medical-surgical wards. Because SDUs are less richly staffed than ICUs, they are less costly to operate; however, they also are unable to provide the level of care required by the sickest patients. There is an ongoing debate in the medical community as to whether and how SDUs should be used. On one hand, an SDU alleviates ICU congestion by providing a safe environment for post-ICU patients before they are stable enough to be transferred to the general wards. On the other hand, an SDU can take capacity away from the already over-congested ICU. In this work, we propose a queueing model to capture the dynamics of patient flows through the ICU and SDU in order to determine how to size the ICU and SDU.

We account for the fact that patients may abandon if they have to wait too long for a bed, while others may get bumped out of a bed if a new patient is more critical. Using fluid and diffusion analysis, we examine the tradeoff between the flexibility of ICUs to treat patients of varying severity versus the additional capacity achieved by allocating nurses to the SDUs due to the lower staffing requirement. Despite the complex patient flow dynamics, we leverage a state-space collapse result in our diffusion analysis to establish the optimal allocation of nurses to units. We find that under some circumstances the optimal size of the SDU is zero, while in other cases, having a sizable SDU may be beneficial. The insights from our work will be useful for hospital managers determining how to allocate nurses to the hospital units, which subsequently determines the size of each unit.

02 Dec

Markov Decision Problems where Means Bound Variances
120 Hanes Hall

We identify a rich class of finite-horizon Markov decision problems (MDPs) for which the variance of the optimal total reward can be bounded by a simple affine function of its expected value. The class is characterized by three natural properties: reward boundedness, existence of a do-nothing action, and optimal action monotonicity. These properties are commonly present and typically easy to check. Implications of the class properties and of the variance bound are illustrated by examples of MDPs from operations research, operations management, financial engineering, and combinatorial optimization. Joint work with Noah Gans and J. Michael Steele.

25 Nov

Optimal Design of the Annual Influenza Vaccine with Manufacturing Autonomy
120 Hanes Hall

Seasonal influenza is a major public health concern, and the first line of defense is the flu shot. However, antigenic drifts and the high rate of influenza transmission require annual updates to the flu shot composition. The World Health Organization recommends which flu strains to include in the annual vaccine, based on surveillance and epidemiological analysis. There are two critical decisions regarding the flu shot design. One is its composition; currently, three strains constitute the flu shot, and they influence vaccine effectiveness. Another critical decision is the timing of the composition decisions, which affects the flu shot availability. Both of these decisions take place at least six months before the influenza season, as flu shot production has many time-sensitive steps.

We propose a bilevel multi-stage stochastic mixed-integer program that maximizes societal benefit of the flu shot under autonomous profit maximizing manufacturers. Calibrated over publicly available data, our model returns the optimal flu shot composition and timing in a stochastic and dynamic environment. We derive analytical results, and perform numerical experiments to analyze the effect of yield uncertainty on the vaccine supply. We also study the impact of supply- and demand-side interventions.

14 Nov

Belief Propagation for Optimal Edge-Cover in the Random Complete Graph
120 Hanes Hall

We apply the objective method of Aldous to the problem of finding the minimum cost edge-cover of the complete graph with random independent and identically distributed edge-costs. The limit, as the number of vertices goes to infinity, of the expected minimum cost for this problem is known via a combinatorial approach of Hessler and Wastlund. We provide a proof of this result using the machinery of the objective method and local weak convergence, which was used to prove the zeta (2) limit of the random assignment problem. A proof via the objective method is useful because it provides us more information on the nature of the edges incident on a typical root in the minimum cost edge-cover. We further show that a belief propagation algorithm converges asymptotically to the optimal solution. The belief propagation algorithm yields a near optimal solution with lesser complexity than the known best algorithms designed for optimality in worst-case settings.

13 Nov

Size of the Largest Component in Dynamical Random Graph Processes
120 Hanes Hall

A family of random graphs can be constructed dynamically as follows: The graph starts with n isolated vertices, and edges are added to the graph step by step. At each step, two pairs of vertices are picked uniformly at random and only one pair of vertices is linked with an edge. The decision is based on certain rules which only depend on the sizes of the components associated with the four chosen vertices. Scaling time such that at time t, [nt/2] edges have been added. This random graph model has a phase transition in the sense that there exists a critical time tcritical > 0 such that when t < tcritical, all components in the graph are small, and when t > tcritical, the largest component contains a positive proportion of all the vertices. More precisely, in the large n limit, the size of the largest component increases from O(log n) to Θ(n) as t increases across tcritical. In our work, we show that the size of the largest component is of the order n2/3 at time t ≈ tcritical. This is joint work with Shankar Bhamidi and Amarjit Budhiraja.

Qing Feng, UNC Chapel Hill

Integrated Analysis of Multi-Block Data

Many scientific research now involve analysis of multiple disparate high-dimensional datasets on same set of individuals, multi-block data analysis is therefore a prevalent and challenging problem for data scientists. I’ll briefly discuss about my recent research and learning on the integrated analysis of multi-block data. Discussion will mainly focus on the previous work in this field and also the Joint and Individual Variation Explained (JIVE) method recently developed.

11 Nov

Statistical Methods for Ambulance Fleet Management
120 Hanes Hall

We introduce statistical methods to address two estimation problems arising in the management of ambulance fleets: (1) predicting the distribution of ambulance travel time between arbitrary start and end locations in a road network; and (2) space-time forecasting of ambulance demand.These predictions are critical for deciding how many ambulances should be deployed at a given time and where they should be stationed, which ambulance should be dispatched to an emergency, and whether and how to schedule ambulances for non-urgent patient transfers.We demonstrate the accuracy and operational impact of our methods using ambulance data from Toronto Emergency Medical Services.

For travel time estimation the relevant data are Global Positioning System (GPS) recordings from historical lights-and-sirens ambulance trips.Challenges include the typically large size of the road network and dataset (70,000 network links and 160,000 historical trips for Toronto), the lack of trips in the historical data that follow precisely the route of interest, and uncertainty regarding the route taken in the historical trips (due to sparsity of the GPS recordings).We introduce a model of the travel time at the network link level, assuming independence across links, as well as a model at the trip level, and compare them.We also introduce methods for both joint estimation of the travel time parameters and unknown historical routes, and more computationally efficient two-stage estimation. For space-time forecasting of demand we develop integer time-series factor models and spatio-temporal mixture models, which capture the complex weekly and daily patterns in demand as well as changes in the spatial demand density over time.

07 Nov

Stationary Densities and Disintegration Kernels
130 Hanes Hall

04 Nov

Error Bounds for LASSO: Functional Linear Regression with Gaussian Design
120 Hanes Hall

We discuss a functional regression problem with Gaussian random design. It is assumed that the regression function can be well approximated by a "sparse" functional linear model in which the slope function can be represented as a sum of a small number of well separated "spikes". We study an estimator of the regression function based on penalized empirical risk minimization with quadratic loss, the complexity penalty being defined in terms of L_1-norm (a functional version of LASSO). The goal is to introduce important parameters characterizing sparsity in such problems and to prove sharp oracle inequalities showing how the L_2-error of the functional LASSO estimator depends on the underlying sparsity of the problem. This is a joint work with Stas Minsker.

28 Oct

Distribution of the discrete scan statistic for multi-state higher-order Markovian sequences
120 Hanes Hall

The discrete scan statistic is used in many areas of applied probability and statistics to study local clumping of patterns. Testing based on the statistic requires tail probabilities. Whereas the distribution has been studied extensively, most of the results are approximations, due to the difficulties associated with the computation. Results for exact -values for the statistic have been given for a binary sequence that is independent or first-order Markovian. We give an algorithm to obtain probabilities for the statistic over multi state trials that are Markovian of a general order of dependence, and explore the algorithm’s usefulness.

24 Oct

Homogenisation for Multi-Dimensional Maps and Flows
130 Hanes Hall

We will discuss diffusion limits for multi-dimensional slow-fast systems, in both discrete and continuous time. In particular, we focus on the homogenisation of "deterministic" slow-fast systems, where the role of the "fast, noisy" process is played by a chaotic signal. A significant obstacle in this area is the case in which the fast "noisy" process appears multiplicatively in the slow process. We will show how to overcome this difficulty using rough path - like machinery. In the case of continuous time dynamics, one heuristically expects to see Stratonovich integrals appearing in the homogenised equations. This heuristic has been thoroughly verified in one dimensional systems. However, we show that in higher dimensions this is almost always wrong, instead one encounters a perturbed version of the Stratonovich integral. This is joint work with Ian Melbourne.

23 Oct

Empirical Analysis of Sequential Trade Models for Market Microstructure
120 Hanes Hall

Market microstructure concerns how different trading mechanisms affect asset price formation. It generalizes the classical asset pricing theory under frictionless perfect market conditions in various directions. Most market microstructure models focus on two important aspects: (a) asymmetric information shared by different market participants (informed traders, market makers, liquidity traders, et al.) (b) transaction costs reflected in bid-ask spreads. The complexity of those models presents significant challenges to empirical studies in such a research area. In this work, we consider some extensions of the seminal sequential trade model in Glosten and Milgrom (Journal of Financial Economics, 1985) and perform Bayesian MCMC inference based on the TAQ (trade and quote) database in Wharton Research Data Services. Issues in both (a) and (b) are addressed in our study. In particular, the latent process of fundamental asset value is modeled with GARCH volatilities; the observed and predicted bid ask price sequences are related by incorporating  parameters for pricing errors and for informed traders’ impact.

Minghui Liu, UNC Chapel Hill

Some Topics About Network Problems

I will talk about some network problems which include centrality, clique, biconnected components, core decomposition and transitive closure. If time permits, something about DAG (directed acyclic graph) will be mentioned as well.

03 Oct

Risk-Sensitive Control of Continuous-Time Markov Chains
130 Hanes Hall

We study risk-sensitive control of continuous time Markov chains taking values in discrete state space. We study both finite and infinite horizon problems. In the finite horizon problem we characterize the value function via HJB equation and obtain an optimal Markov control. We do the same for infinite horizon discounted cost case. In the infinite horizon average cost case we establish the existence of an optimal stationary control under certain Lyapunov condition. We also develop a policy iteration algorithm for finding an optimal control.

02 Oct

Appointment Scheduling under Customer Preferences
120 Hanes Hall

We consider an appointment system where the patients have preferences about the appointment days. A patient may be scheduled on one of the days that is acceptable to her, or be denied appointment. The patient may or may not show up on the appointed time. The cost is a convex function of the actual number of patients served on a given day. We present structural properties of the appointment policy, and study a heuristic policy that is easy to implement and performs well.

Zhankun Sun, UNC Chapel Hill

Prioritization in Services when Customer Types are Unknown

We consider a service system with two types of customers. The type of each customer is unknown but the server has the option of performing triage and making an imperfect classification. Each customer incurs a linear waiting cost depending on its type. The server has to weigh the benefit of triage and prioritization under imperfect information against the cost of delay by triage. We characterize the optimal dynamic and static policies that minimize the expected cost.

30 Sep

Nonlinear Evolution Equations Driven by Levy Processes and their Queuing Interpretations
120 Hanes Hall

One of the motivations of our program was to develop understanding of the interplay between the nonlinear and nonlocal components in evolution equation driven by the infinitesimal generators of processes with jumps, such as Levy processes and flights. In the process we also studied  the probabilistic approximations (propagation of chaos) for several extensions of the classical quasilinear and strongly linear PDEs, including the conservation laws, porous medium equation, and reaction-diffusion type equations for Darwinian evolutionary population models where the hydrodynamic limits may still preserve some "background" random noise. Some queuing theory interpretations will be included.

26 Sep

Maximum Independent Sets in Random d-regular Graphs
130 Hanes Hall

Satisfaction and optimization problems subject to random constraints are a well-studied area in the theory of computation. These problems also arise naturally in combinatorics, in the study of sparse random graphs. While the values of limiting thresholds have been conjectured for many such models, few have been rigorously established. In this context we study the size of maximum independent sets in random d-regular graphs. We show that for d exceeding a constant d(0), there exist explicit constants A, C depending on d such that the maximum size has constant fluctuations around A*n-C*(log n)establishing the one-step replica symmetry breaking heuristics developed by statistical physicists. As an application of our method we also prove an explicit satisfiability threshold in random regular k-NAE-SAT. This is joint work with Allan Sly and Nike Sun.

25 Sep

Outside the Academic Setting
120 Hanes Hall

A panel of current students will share their experiences having worked or interned in a variety of non-academic settings. The panel's members will begin by discussing topics ranging from:

  • Finding and interview for positions
  • Impressions of different work environments and
  • Comparisons to the work they have done in this department.

What time remains will be left open for questions. Panel members have interned at Quintiles, Bell Labs, Shutterfly, SAS, Carillon Assisted Living, and Boehringer Ingelheim and previously worked at RTI.

18 Sep

Clustering Considerations: Biclustering and Clustering on Networks
120 Hanes Hall

Gen will give a brief talk about the motivation and background of biclustering, followed by two recently developed biclustering methods: Large Average Submatrix, LAS, and Sparse Singular Value Decomposition, SSVD. Open areas for future research will be addressed at the end of the talk.

James Wilson, UNC Chapel Hill

Clustering of Networks

James will give a motivation behind clustering of networks, also known as community detection. An overview of community detection methods will be given first, followed by a discussion of how one may assess the statistical significance of communities. There will be several examples to demonstrate the application of community detection methods, including an example using James' own Facebook data set. Friend him now for a larger data set!.

16 Sep

A Dynamic Mechanism for Achieving Sustainable Quality Supply
120 Hanes Hall

Several leading companies have realized the importance of sustainable quality supply and initiated programs to achieve it. One example is Starbucks' C.A.F.E. Practices. This paper investigates whether the guidelines in such programs provide the right incentives and information structures for all parties to participate, including supplier development, to achieve the intended long term goals. To that end, we present a stylized multi-period, two-party model with double-sided asymmetric information and dynamically changing environment.

We construct a supply agreement between the two parties which leads to the desired equilibrium. We show that the mechanism induces efficient supplier development investments by the retailer. We then compare the key elements of the agreement with existing industry guidelines. Some of the elements in our theoretical mechanism are consistent with current industry guidelines, but some are not.  We expect these results to be helpful in guiding the design and administration of sustainability programs. Joint work with Tracy Lewis and Fang Liu.

11 Sep

An Introduction to Cluster Analysis
120 Hanes Hall

With larger and more complex datasets becoming available in areas such as genomics and the Web, exploratory analysis tools, such as clustering and dimensionality reduction, have become an essential part of modern data analysis. In this talk, we will focus on clustering - the task of partitioning unlabeled data into subsets, called clusters, of similar objects. We first provide a brief introduction to standard clustering approaches and devote most of our time to the following import questions in cluster analysis

  1. how do we determine the optimal number of clusters, and
  2. how do we assess the significance of these clusters?

While numerous approaches have been proposed for clustering, less work has been done to address these questions. Existing methods, including: the gap statistic, SigClust, a recently proposed bootstrapping approach, and others, will be introduced.

The talk will be prefaced by a short discussion of the revised format and aims of the graduate student colloquia.

Patrick Kimes, UNC Chapel Hill

10 Sep

Integrating Academic and Family Life
120 Hanes Hall

There is often a perception that having a family and an academic career at a Research I university are incompatible, especially for women. In this talk I will discuss the challenges, share my personal experiences, and present some tips for integrating academic and family life.

09 Sep

The Relative Size of Big Data: Perspectives from an Interdisciplinary Statistician
120 Hanes Hall

Big data problems occur when available computing resources (CPU, communication bandwidth, and memory) can not accommodate the computing demand on the data at hand. In this introductory overview talk, we provide perspectives on big data motivated by a collaborative project on coded aperture imaging with the advanced light source (ALS) group at the Lawrence Berkeley National Lab. In particular, we emphasize the key role in big data computing played by memory and communication bandwidth. Moreover, we briefly review available resource to monitor memory and time efficiency of algorithms in R and discuss active big data research topics. We conclude that the bottle-neck between statisticians and big data is human resource that includes interpersonal, leadership, and programming skills.

15 Jul

Effective Methodologies for High-Dimensional Data and Its Applications
120 Hanes Hall

A common feature of high-dimensional data is that the dimension is high, however, then sample size is relatively low. We call such a data HDLSS data. Aoshima and Yata (2011a,b) developed a variety of inference for HDLSS data such as given-bandwidth confidence region, two-sample test, testing the equality of covariance matrices, correlation test, classification, regression, and variable selection. The keys are non-Gaussian, HDLSS asymptotics, geometric representations, cross-data-matrix methodology and sample size determination to ensure prespecified accuracy for the inference. Hall et al. (2005) and Jung and Marron (2009) gave geometric representations of HDLSS data under a Gaussian-type assumption. As for non-Gaussian HDLSS data, Yata and Aoshima (2012) found a completely different geometric representation. Jung and Marron (2009) considered the conventional PCA for HDLSS data under a Gaussian-type assumption.

As for non-Gaussian HDLSS data, Yata and Aoshima (2009) showed that the conventional PCA cannot give consistent estimates for eigenvalues, eigenvectors and PC scores. Yata and Aoshima (2010) created a new PCA called the cross-data-matrix (CDM) methodology that offers consistent estimates of those quantities for non-Gaussian HDLSS data. The CDM methodology is also an effective tool to construct a statistic of inference for HDLSS data at a reasonable computational cost. Yata and Aoshima (2013) created the extended cross-data-matrix (ECDM) methodology that offers an optimal unbiased estimate in inference for HDLSS data and they applied the ECDM methodology to the test of correlations given by Aoshima and Yata (2011a).

A review of research on high-dimensional data analysis is given by Aoshima and Yata (2013a). In this talk, I would like to give an overview of recent developments on high-dimensional data analysis. The Gaussian assumption and the equality of covariance matrices are not assumed. I would like to show the possibility of a variety of inference that can ensure prespecified accuracy by utilizing geometric characteristics of HDLSS data. I will introduce a new classification procedure called the misclassication rate adjusted classier, developed by Aoshima and Yata (2013b), that can ensure accuracy in misclassification rates for multiclass classification. Finally, I will give some examples of discriminant analysis and cluster analysis for HDLSS data by using microarray data sets.

18 Apr

Recent results on inviscid limits for the stochastic Navier-Stokes equations and rated systems.
130 Hanes Hall

One of the original motivations for the development of stochastic partial differential equations traces it's origins to the study of turbulence. In particular, invariant measures provide a canonical mathematical object connecting the basic equations of fluid dynamics to the statistical properties of turbulent flows. In this talk we discuss some recent results concerning inviscid limits in this class of measures for the stochastic Navier-Stokes equations and other related systems arising in geophysical and numerical settings.

15 Apr

Meta-Analysis Based Variable Selection for Gene Expression Data
120 Hanes Hall

Recent advance in biotechnology and its wide applications have led to the generation of many high-dimensional gene expression data sets that can be used to address similar biological questions. Meta- analysis plays an important role in summarizing and synthesizing scientific evidence from multiple studies. When the dimensions of datasets are high, it is desirable to incorporate variable selection into meta-analysis to improve model interpretation and prediction. In this talk, we propose a novel method called meta-lasso for variable selection with high dimensional meta-data.

Through a hierarchical decomposition on regression coefficients, our method not only borrows strength across multiple data sets to boost the power to identify important genes, but also keeps the selection flexibility among data sets to take into account data heterogeneity. We show that our method possesses the gene selection consistency with NP-dimensionality. Simulation studies demonstrate the good performance of our method. We applied our meta-lasso method to a meta-analysis of five cardiovascular studies. The analysis results are clinically meaningful.

08 Apr

Of Cells and People: Multiscale Modeling of HPV Dynamics
120 Hanes Hall

Infection with the Human Papilloma Virus (HPV) is a prerequisite for the development of cervical cancer, the second most common cancer in women in the developing world. In addition, HPV-related male cancers are on the rise worldwide. While the life time risk of HPV infection is about 80%, most individuals clear the virus within 2 years. However, if the infection persists, further cellular events can  eventually lead to invasive carcinoma. Motivated by the fact that various aspects of HPV infection, transmission and carcinogenesis remain poorly understood to date, we develop two models, one at the cellular level and one at the population level.

In the first part of the talk we develop a stochastic model of the cervical epithelium coupled to the infection dynamics of HPV, present insights gained from the model, and discuss the connection to the multistage process of carcinogenesis. In the second part of the talk, we focus on questions related to HPV transmission and optimal vaccination strategies. Acknowledging that the temporal ordering of relationships plays a crucial role for sexually transmitted diseases, we develop a dynamic random graph model of adolescent sexual networks. Based on the random graph model for the network, we then study HPV transmission and different vaccination strategies.

03 Apr

Statistical Inference for Parameters: P-values, Intervals, Distributions
120 Hanes Hall

Statistics has the wealth of multiple ways of doing statistical inference and different ways give different answers. Logic would suggest: Two ways, different answers: one, other or both is wrong. Physics with more than one theory got billions of taxpayer money to test theirs. Statistics can't ignore L'Aquila, Vioxx and Challenger. We examine some very simple examples and consider the role of continuity with regular statistical models: the theories agree but Statistics may need more for L'Aquila, Vioxx or Challenger and more than algorithms and mining and exploring

25 Mar

Optional Randomized Response Models: Efficiency vs. Privacy Protection
120 Hanes Hall

Randomized response models, introduced by Warne (1965, Journal of the American Statistical Association), are important data acquisition tools in social and behavioral sciences where researchers are often faced with sensitive questions. These models allow respondents to provide a scrambled response offering them complete privacy. The researcher is able to unscramble the responses at an aggregate level but not at an individual level. These models are very useful in social sciences research but have also been used in many other fields such as business, criminology, medicine and public health.

Among the newer RRT models are the one-stage and two-stage optional randomized response models. An Optional RRT model, introduced by Gupta, Gupta and Singh (2002, Journal of Statistical Planning and Inference), is a variation of the usual randomized response model and is based on the premise that a question may be sensitive for one respondent but may not be sensitive for another, and hence the choice to provide a truthful response or a scrambled response should be left to the respondent. Gupta, Shabbir and Sehra (2010, Journal of Statistical Planning and Inference) have recently introduced a two-stage quantitative response optional RRT model where a randomly selected pre-determined proportion (T) of the subjects is asked to provide a truthful response and rest of the respondents are asked to provide a response using the optional RRT model, although the researcher would not know if a respondent provided a truthful response or a scrambled response. One would expect a two-stage model to always perform better than the one-stage model regardless of the value of T but we observe that this is not true in general. We will discuss how to choose an optimal value of T.

Protection of respondent privacy is a major issue when dealing with RRT models. Recently Mehta et al. (2012, Journal of Statistical Theory and Practice) have introduced a different two-stage model which is not only more efficient than the Gupta, Shabbir and Sehra (2010, Journal of Statistical Planning and Inference) model, it offers greater privacy protection. We will discuss how the efficiency of a RRT model can be artificially.

18 Mar

Consistent Cross-Validation for Tuning Parameter Selection in High-Dimensional Variable Selection
120 Hanes Hall

Asymptotic behavior of tuning parameter selection in the standard cross-validation methods is investigated for the high-dimensional variable selection problem. It is shown that the shrinkage problem with LASSO penalty is not always the true reason for the over selection phenomenon in cross-validation based tuning parameter selection. After identifying the potential problems with the standard cross-validation methods, we propose a new procedure, Consistent Cross Validation (CCV), for selecting the optimal tuning parameter. CCV is shown to enjoy the model selection consistency. Extensive simulations and real data analysis support the theoretical results and demonstrate that CCV also works well in terms of prediction.

28 Feb

EIGENVALUES OF SPARSE RANDOM REGULAR GRAPHS
130 Hanes Hall

Adjacency matrices of sparse random regular graphs are long conjectured to lie within the universality class of random matrices. However, there are few rigorously known results. We focus on fluctuations of linear eigenvalue statistics of a stochastic process of such adjacency matrices growing in dimension. The idea is to compare with eigenvalues of minors of Wigner matrices whose fluctuation converges to the Gaussian Free Field. We show that linear eigenvalue statistics can be described by a family of Yule processes with immigration. Certain key features of the Free Field emerge as the degree tends to infinity. Based on joint work with Tobias Johnson.

27 Feb

Joint Statistical Modeling of Multiple High Dimensional Datasets
120 Hanes Hall

With the abundance of high-dimensional data, sparse techniques are popular for simultaneous variable selection and estimation in recent years. In this talk, I will present some new sparse techniques for joint analysis of multiple high-dimensional data under two different settings. In the first setting, the task of estimating multiple graphical models with some common structure is considered. In this case, the goal is to estimate the common structure more efficiently, as well as to identify some differences among these graphical models.

In the second setting, in order to explore the relationship between two high-dimensional datasets, modeling multiple response variables jointly with a common set of predictor variables is explored.  New penalized methods which incorporate the joint information among response variables are proposed. The new methods estimate the regression coefficient matrix, as well as the conditional graphical models of response variables. Application of these techniques to cancer gene expression and micro-RNA expression datasets will be presented.

28 Jan

Minimax Estimation of High-Dimensional Predictive Densities
120 Hanes Hall

Over the last decade, operational analytics in the fields of weather forecasting, financial investments, sports betting, etc have been undergoing a gradual evolution from point prediction towards probabilistic forecasting. Reliable predictive systems for these occurrences can be built on efficient predictive density estimates of the associated high-dimensional parametric models. Recently, new directions have opened up in statistical probability forecasting as decision theoretic parallels have been established between predictive density estimation in high-dimensional Gaussian models and the comparatively well-studied problem of point estimation of the multivariate normal mean.

Building on these parallels we present a frequentist perspective on roles of shrinkage and sparsity in predictive density estimation under Kullback-Leibler loss. Studying the problem of minimax estimation of sparse predictive densities we nd new phenomena which contrast with results in point estimation theory, and are explained by the new notion of risk diversification. The uncertainty sharing idea is also generalized to provide a unified predictive outlook by relating the nature of optimal shrinkage in unrestricted parameter spaces to ideally diversified predictive schemes. Motivational stories and toy examples from the world of sports, stock markets and wind speed proles will be used to illustrate the implications of our results.

25 Jan

Association Studies with Functional Phenotypes
120 Hanes Hall

In this talk I will discuss the application of functional data methods, FDA, to genome wide association studies with longitudinal response variables. Such data can be difficult to analyze due to the heterogeneity of the observations; subjects may evolve in intricate ways as they age or are administered various treatments. An FDA framework allows for very flexible models, while still exploiting the temporal structure of the data in powerful ways. However, such methods must be applied with care as subjects are often observed at a relatively small number of common time points, while most FDA methods are intended for high frequency data or sparse data whose pooled time points are dense in the time domain. After introducing the FDA perspective and some basic methodology, we will present an association test which differs from established FDA methods in that it does not directly depend on principal component analysis. We illustrate these ideas via simulations and by exploring data coming from the childhood asthma management program, CAMP.

23 Jan

Regularized Learning of High-dimensional Sparse Graphical Models
120 Hanes Hall

In this talk, I will talk about our recent efforts on exploring the large-scale networks of binary data and non-Gaussian data respectively. In the first part of this talk, I will present the nonconcave penalized composite conditional likelihood estimator for learning sparse Ising models. To handle the computational challenge, we design an efficient coordinate-minorization-ascent algorithm by taking advantage of coordinate-ascent and minorization maximization principles. Strong oracle optimality and explicit convergence rate of the computed local solution are established in the high dimensional setting. Our method is applied to study the HIV-1 protease structure, and we obtain scientifically sound discoveries.

In the second part, I will present a unified regularized rank estimation scheme for efficiently estimating the inverse correlation matrix of the Gaussian copula model, which is used to build a graphical model with the non-Gaussian data. The Gaussian copula graphical model is more robust than the Gaussian graphical model while still retains the nice graphical interpretability of the latter. Proposed rank-based estimators achieve the optimal convergence rate as well as graphical model selection consistency, and behave like their oracle counterparts.

18 Jan

Classification Under Noise: Not-Completely Sparse Linear Discriminants
120 Hanes Hall

In high dimensional statistics, we often assume that the true coefficient vector is sparse. Unfortunately,in practical applications, this assumption hardly holds. In this talk, I will consider classification in the high dimensional setting where data are assumed to be have multivariate Gaussian distributions with different means. Almost all of the literature on this framework analyzes the case where the Bayes classifier, given by the multiplication of difference of the means and the inverse of the covariance matrix, is sparse. However, additional noise to the observations can invalidate the sparsity assumption. This case is commonly observed in gene expression and fMRI studies where the spatial correlation structure often induces extra noise.

17 Jan

Quantum and Classical Annealing
130 Hanes Hall

I'll give a self-contained introduction to a simple version of the quantum adiabatic algorithm as a technique for solving optimization problems on a quantum computer. No previous knowledge of quantum mechanics will be required for this part of the talk, and the analysis of the algorithm will lead to an interesting question in equilibration time of Markov chains for certain classical Ising models. I'll then present some results on equilibration of those chains for these particular models.

16 Jan

Semiparametric Sparse Discriminant Analysis in High Dimensions
120 Hanes Hall

In recent years, a considerable amount of work has been devoted to generalizing linear discriminant analysis to overcome its incompetence for high-dimensional classication (Tibshirani et al. (2002), Fan & Fan (2008), Wu et al. (2009), Clemmensen et al. (2011), Cai & Liu (2011), Witten & Tibshirani (2011), Fan et al. (2012) and Mai et al. (2012)). These research efforts are rejuvenating discriminant analysis. However, the normality assumption, which rarely holds in real applications, is still required by all of these recent methods. We develop high-dimensional semiparametric sparse discriminant analysis (SeSDA) that generalizes the normality-based discriminant analysis by relaxing the Gaussian assumption. If the underlying Bayes rule is sparse, SeSDA can estimate the Bayes rule and select the true features simultaneously with overwhelming probability, as long as the logarithm of dimension grows slower than the cube root of sample size. At the core of the theory is a new exponential concentration bound for semiparametric Gaussian copulas, which is of independent interest. Further, the analysis of a malaria data (Ockenhouse et al. (2006), by SeSDA conrms the superior performance of SeSDA to normality-based methods in both classication and feature selection.

14 Jan

Exploring Dynamic Complex Systems Using Time-Varying Networks
120 Hanes Hall

Extracting knowledge and providing insights into the complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Networks are an example of simple, yet powerful tools for capturing relationships among entities over time. For example, in social media, networks represent connections between different individuals and the type of interaction that two individuals have.  In systems biology, networks can represent the complex regulatory circuitry that controls cell behavior. Unfortunately the relationships between entities are not always observable and need to be inferred from nodal measurements.

I will present a line of work that deals with the estimation of high-dimensional dynamic networks from limited amounts of data.  The framework of probabilistic graphical models is used to develop semiparametric models that are flexible enough to capture the dynamics of network changes while, at the same time, are as interpretable as parametric models. In this framework, estimating the structure of the graphical model results in a deep understanding of the underlying network as it evolves over time.  I will present a few computationally efficient estimation procedures tailored to different situations and provide statistical guarantees about the procedures. Finally, I will demonstrate how dynamic networks can be used to explore real world systems.

11 Jan

Testing of Large Covariance Matrices
120 Hanes Hall

This talk considers in the high-dimensional setting two inter-related problems: (a) testing the equality of two covariance matrices; (b)recovering the support of the difference of two covariance matrices. We propose a new test for testing the equality of two covariance matrices and investigate its theoretical and numerical properties. The limiting null distribution of the test statistic is derived and the power of the test is studied. The test is shown to enjoy certain optimality and to be especially powerful against sparse alternatives.

The simulation results show that the test significantly outperforms the existing methods both in terms of size and power. Analysis of a p53 dataset is carried out to demonstrate the application of the testing procedures. When the null hypothesis of equal covariance matrices is rejected, it is often of significant interest to further investigate how they differ from each other. Motivated by applications in genomics, we also consider recovering the support of the difference of two covariance matrices. New procedures are introduced and their properties are studied. Applications to gene selection are also discussed.

05 Dec

A multi-bin policy for inventory systems with differentiated demand classes
120 Hanes Hall

We consider an inventory system under continuous review serving two demand classes. The two demand classes are different in terms of the penalty cost incurred for backordering of demand. Current literature recommends a rationing policy based on critical inventory level. The inventory rationing policy, while working best for the higher priority class, does so at the expense of the service level of the lower priority class. In this research, we propose a new type of two-bin policy to provide differentiated service to the two demand classes. The proposed two-bin policy assigns separate bins of inventory for the two demand classes. However, when the bin intended for the higher priority class is empty, that demand can still be fulfilled with the inventory from the lower class’ bin. Results of our computational study show that the proposed policy is able to provide a much higher service level for the lower priority demand (while maintaining the high service level for the higher priority demand) without much increase in cost.

14 Nov

Large deviations for empirical measures arising in importance sampling
120 Hanes Hall

Consider applying Monte Carlo simulation to approximate a probability distribution in a certain region. Various properties of the distribution might be of interest, for instance, expectations, quantiles, or L-statistics. To improve the approximation it may be useful to apply importance sampling. The output of an importance sampling algorithm can be represented as a weighted empirical measure, where the weights are given by the likelihood ratio between the original distribution and the sampling distribution. In this talk we study efficiency of an importance sampling algorithm by means of large deviations for the weighted empirical measure. The main result, stated as a Laplace principle for the weighted empirical measures, can be viewed as a weighted version of Sanov's theorem. We show how the associated rate function can be used to quantify the performance of an algorithm. This is joint work with Henrik Hult.

05 Nov

Samurai Sudoku, Parallel Computing and Stochastic Optimization
120 Hanes Hall

Samurai Sudoku is a popular puzzle. The game-board consists of five overlapping Sudoku grids, for each of which several entries are provided and the remaining entries must be filled subject to no row, column and three-by-three subsquare containing duplicate numbers. By exploiting these three uniformity properties, we construct a new type of statistical design, called a Samurai Sudoku-based space-filling design. Such a design has an appealing slicing structure and is useful for a variety of statistical applications including meta-analysis, cross-validation and experiments with mixed factors. I will also discuss other designs with related slicing structures intended for efficient parallel computing and stochastic optimization.

22 Oct

Subsampling and Numerical Formal Concept Analysis

In this talk, we shall present two on-going projects. The first is subsampling for feature selection in large Data. The popular approaches to feature selections have been using penalties and shrinkage estimation. We advocate a subsampling approach. The idea is simple yet has worked wonderfully in many important situations. The second is on numerical formal concept analysis (nFCA) which borrows strength from both computer science and statistical techniques to provide a powerful machine learning technique. Some theory, methodology, computational tools as well as data applications will be provided. (Parts of the work are joint with J Ma, Y R. Fan and B. Hu)

11 Oct

Genesis of Gamma Bursts in Neural Local Field Potentials
130 Hanes Hall

Stochastic process modeling and analysis is beginning to play a role in the understanding of neural processing. This talk addresses local field potential (LFP) data from a recording electrode in the visual cortex of the brain. An LFP time series shows various oscillations including periods of "gamma burst" in a frequency band around 40 Hz. Study of the stochastic dynamics of a 2-dimensional stochastic differential equation model allows us to interpret gamma bursts in terms of excursions of an Ornstein-Uhlenbeck stochastic process.

08 Oct

Bayesian Hierarchical Multi-subject Multiscale Analysis of Functional MRI Data
120 Hanes Hall

We develop methodology for Bayesian hierarchical multi-subject multiscale analysis of functional Magnetic Resonance Imaging (fMRI) data. We begin by modeling the brain images temporally with a standard general linear model. After that, we transform the resulting estimated standardized regression coefficient maps through a discrete wavelet transformation to obtain a sparse representation in the wavelet space. Subsequently, we assign to the wavelet coefficients a prior that is a mixture of a point mass at zero and a Gaussian white noise. In this mixture prior for the wavelet coefficients, the mixture probabilities are related to the pattern of brain activity across different resolutions. To incorporate this information, we assume that the mixture probabilities for wavelet coefficients at same location and level are common across subjects. Furthermore, we assign for the mixture probabilities a prior that depends on few hyperparameters. We develop empirical Bayes methodology to estimate the hyperparameters and, as these hyperparameters are shared by all subjects, we obtain precise estimated values. Then we carry out inference in the wavelet space and obtain smoothed images of the regression coefficients by applying the inverse wavelet transform to the posterior means of the wavelet coefficients. An application to computer simulated synthetic data has shown that, when compared to single-subject analysis, our multi-subject methodology performs better in terms of mean squared error. Finally, we illustrate the utility and flexibility of our multi-subject methodology with an application to an event-related fMRI dataset generated by Postle (2005) through a multi-subject fMRI study of working memory related brain activation.

03 Oct

Optimization Problems and Algorithms for Tree Statistics
120 Hanes Hall

Medical imaging technology has made it possible to extract anatomical structures from images. Datasets of arteries and lungs are collected to study these anatomical structures. These structures have forms much like a graph theoretic tree. This talk focuses on the optimization problems associated with some of the basic elements required to build a statistical toolkit for analyzing samples of trees. The distance between two trees is an essential building block. Distance is useful for defining averages and variability. The first part of the talk focuses on a summary statistic called the Fréchet mean. The Fréchet mean of a sample of trees can be defined as a point that minimizes the squared distances to every tree in the sample summed together. The problem of calculating the Fréchet mean for phylogenetic trees is discussed. An algorithm and results from a simulation study are presented. The second part of the talk focuses on the distance between unlabeled trees. This problem is NP-complete. This distance is expressed naturally as the solution to a complicated nonlinear optimization problem. An integer programming reformulation is presented. This formulation makes it possible to use branch and bound or cutting plans to solve the problem.

01 Oct

Optimal Sequential Exploration: Bandits, Clairvoyants, and Wildcats
120 Hanes Hall

This paper was motivated by the problem of developing an optimal policy for exploring an oil and gas field in the North Sea. Where should we drill first? Where do we drill next? This sequential exploration problem resembles a multiarmed bandit problem, but probabilistic dependence plays a key role: outcomes at drilled sites reveal information about neighboring targets. Good exploration policies will take advantage of this information as it is revealed. We develop heuristic policies for sequential exploration problems and complement these heuristics with upper bounds on the performance of an optimal policy. We begin by grouping the targets into clusters of manageable size. The heuristics are derived from a model that treats these clusters as independent. The upper bounds are given by assuming each cluster has perfect information about the results from all other clusters. The analysis relies heavily on results for bandit superprocesses, a generalization of the multiarmed bandit problem. We evaluate the heuristics and bounds using Monte Carlo simulation and, in the North Sea example, we find that the heuristic policies are nearly optimal. [Joint work with David Brown, Fuqua School of Business, Duke University.] paper: http://faculty.fuqua.duke.edu/~jes9/bio/Optimal_Sequ ntial_ExplorationBCW.pdf

17 Sep

Queueing Output Processes Revisited
120 Hanes Hall

The talk surveys work on departure processes from queueing systems as a context in which a range of problems for point processes can be viewed: transformations, permutations, `anomalous' rate behaviour, regenerative systems, conservation laws, and heavy tail properties.

06 Sep

Probability Community Introductory Meeting
130 Hanes Hall

Postodcs from Duke and Chapel Hill will present 5 short talks on their research. The probability community will get to know the new postdocs in the area.

29 Aug

An Overview of the NCHS Surveys, Micro-data Files, and Research opportunities.
120 Hanes Hall

The National Center for Health Statistics (NCHS) conducts several household and establishment based health surveys. NCHS also collects vital statistics from each of the states in the U.S. This seminar will provide an overview of the NCHS data collection systems, data dissemination procedures, and availability of micro-data files to researchers for analyses. This presentation will also provide information on various opportunities for students to conduct research and/or work at the NCHS as fellows, interns, or full time employees. www.cdc.gov/nchs

25 Apr

Periodic Count Time Series via Stationary Renewal Processes
120 Hanes Hall

Discrete renewal processes are ubiquitous in stochastic phenomenon. In this talk constructing a discrete process where renewals are more (or less) likely during specified seasons is of specific interest. For example thunderstorms in the Southern United States can take place at any time in the year, but are most likely during the summer. Hurricanes, tornadoes, and snowstorms are other meteorological count processes obeying periodic dynamics. Rare disease occurrences, accidental deaths, and animal sightings are non-meteorological examples of count phenomenon following a periodic structure. In this talk a periodic version of classical discrete-time renewal sequences is developed. Given that a renewal occurs at a given time t, the time until the next renewal is allowed to depend on the season corresponding to time t. In this manner, one can build processes where renewals behave periodically. By superimposing or mixing versions of periodic renewal processes, one can construct models for periodic sequences of counts. The advantage of this method (over many time series of counts techniques like binomial thinning) is that negative autocorrelations between counts can be achieved. The methods are used to develop an autocorrelated periodic count model, fitting a stationary count model to a weekly rainfall data set that has binomial marginal distributions.

23 Apr

Challenges of Statistics of Large Roll Motions of a Ship in Waves
120 Hanes Hall

This presentation reviews some problems and solutions related to the statistics of large amplitude roll motions and capsizing of a ship in irregular (random) waves. Roll motion of a ship in irregular waves is described by a system of integro‐differential equations and can be characterized by significant nonlinearity. The probabilistic properties of roll response are quite complex; its statistical characteristics cannot be estimated based on a single record of practical length (this is known as “practical non‐ergodicity”). This process is non‐Gaussian and there is a significant dependence between the process and its first derivative. The reasonable estimation of the probability of large roll angles or capsizing represents a significant challenge. Advanced numerical simulations and/or model experiments are the only methods to obtain reliable information on roll motions. Large roll angles are too rare and the sample volume is too small to directly estimate the probability. Therefore, extrapolation is the only practical way to conduct the analysis. One of the most promising methods of extrapolation is the split‐time method. The split‐time method is capable of estimating the probability of large roll angles, including capsize. The idea is to separate a difficult problem into two related problems that are individually more manageable to solve. The first problem is an upcrossing of a specified intermediate level (typically associated with a physical threshold). This level may be also too high to get sufficient sample volume for direct counting of upcrossings. A Peak‐over‐Threshold method is applied to estimate the upcrossing rate. The second problem is in finding critical conditions at the instant of upcrossing that will lead to a given large roll angle or capsizing. This problem can be formulated as an estimation of a distribution of a dependent process at the instant of upcrossing. The derivation of a theoretical solution and a practical approximate method is considered in the presentation.

11 Apr

Linearly Constrained Lasso with Application in Glioblastoma Data
120 Hanes Hall

The knowledge and information on cancer is continuously accumulated as the advances in cancer research. How to appropriately incorporate them in data analysis to obtain more meaningful results presents a challenge to the statistical society. In this talk, we are concentrated in Glioblastoma, a most common and aggressive brain cancer. The objective is to identify genes that are related to Glioblastoma with incorporating the information on genetic pathways. The problem is formulated as a linearly constrained lasso problem. In general we have a lasso-type problem with linear equality and inequality constraints. We develop a solution path algorithm to fit this model efficiently, and also work out some asymptotic properties to understand its advantages. The method is proven to be efficient and flexible as demonstrated in simulation studies and real data analysis.

09 Apr

Regularized Higher-Order Principal Components Analysis
120 Hanes Hall

High-dimensional tensors or multi-way data are becoming prevalent in areas such as biomedical imaging, chemometrics, networking and bibliometrics. Traditional approaches to finding lower dimensional representations of tensor data include flattening the data and applying matrix factorizations such as principal components analysis (PCA) or employing tensor decompositions such as the CANDECOMP / PARAFAC (CP) and Tucker decompositions. The former can lose important structure in the data, while the latter Higher-Order PCA (HOPCA) methods can be problematic in high-dimensions with many irrelevant features. I introduce frameworks for sparse tensor factorizations or Sparse HOPCA based on heuristic algorithmic approaches and by solving penalized optimization problems related to the CP decomposition. Extensions of these approaches lead to methods for general regularized tensor factorizations, multi-way Functional HOPCA and generalizations of HOPCA for structured data. I illustrate the utility of my methods for dimension reduction, feature selection, and signal recovery on simulated data and multi-dimensional microarrays and functional MRIs.

02 Apr

Approximating Performance for Service Systems with Time-Varying Arrivals
120 Hanes Hall

Unlike most textbook queueing models, real service systems (such as call centers and hospitals) typically have time-varying arrival rates, usually with significant variation over the day. As a result, the system can experience periods of overloading and underloading due to the fact (i) that the arrival process and service times have significant stochastic fluctuations, and (ii) that system managers are unwilling or unable to change the number of servers dynamically in real time because of constraints on the shifts or because frequent changes may be quite costly in most service systems. When the system is overloaded, customer delay increases which directly causes customer dissatisfaction and the loss of revenue due to the abandonment of impatient customers; when the system is underloaded, excessive staffing can be reduced to save costs since idling servers are not needed. To better understand the systems experiencing periods of overloading, we propose a deterministic fluid model and a stochastic (diffusion) refinement to approximate the dynamics for the stochastic G_t/GI/s_t +GI queueing models, which has time-varying arrivals (the G_t), time-varying staffing (the s_t), and non-exponential service (the first GI) and patience (the last GI) times. These fluid and diffusion approximations are based on the many-server heavy-traffic (MSHT) functional law of large numbers (FLLN) and functional central limit theorems (FCLT). These models are mathematically simple and tractable; they offer strong intuition to the real stochastic systems, and they provide accurate time-dependent approximating performance measures during a transient time period (for instance a 24-hour day or a 7-day week). Simulation experiments also verify the effectiveness of the approximation.

28 Mar

Joint Estimation of Multiple Graphical Models
120 Hanes Hall

Gaussian graphical models explore dependence relationships between random variables, through estimation of the corresponding inverse covariance matrices. In this paper we develop an estimator for such models appropriate for data from several graphical models that share the same variables and some of the dependence structure. In this setting, estimating a single graphical model would mask the underlying heterogeneity, while estimating separate models for each category does not take advantage of the common structure. We propose a method which jointly estimates the graphical models corresponding to the different categories present in the data, aiming to preserve the common structure, while allowing for differences between the categories. This is achieved through a hierarchical penalty that targets the removal of common zeros in the inverse covariance matrices across categories. We establish the asymptotic consistency and sparsity of the proposed estimator in the high-dimensional case, and illustrate its superior performance on a number of simulated networks. An application to learning semantic connections between terms from webpages collected from computer science departments is also included. This is joint work with Jian Guo, Elizaveta Levina, and George Michailidis.

19 Mar

Statistical Inference for Linear Models with Functional Responses
120 Hanes Hall

With modern science and technology development, functional responses are observed frequently in many scientific fields such as biology, meteorology, ergonomics among others. Consider statistical inferences for functional linear models in which the response functions depend on a few time-independent covariates, but the covariate effects are functions of time. Of interest is to test a general linear hypothesistesting (GLHT) problem about the covariate effects. In this talk, an F-type test for this GLHT problem is introduced and its asymptotic power is derived. It is shown that the F-type test is root-n consistent. Applications of the F-type test in one-sample, two-sample and k-sample problems, and in variable selections and outlier detections in functional linear models are discussed. The F-type test is illustrated via applications to a real functional data set collected in ergonomics.

15 Mar

Clark-Ocone formula and Central Limit Theorem for the Brownian local time increments
130 Hanes Hall

The purpose of this talk is to discuss some applications of the Clark-Ocone representation formula. This formula provides an explicit expression for the stochastic integral representation of functionals of the Brownian motion in terms of the derivative in the sense of Malliavin calculus. We will compare this formula with the classical Ito formula and we will discuss its application to derive a central limit theorem for the modulus of continuity in the space variable of the Brownian local time increments.

12 Mar

Parametric Stability of Solutions in Models of Economic Equilibrium
120 Hanes Hall

Recent results about strong metric regularity of solution mappings are applied to a model of market equilibrium in the exchange of goods. The solution mapping goes from the initial endowments of the agents to the goods they end up with and the supporting prices, and the issue is whether, relative to a particular equilibrium, it has a single-valued, Lipschitz continuous localization. A positive answer is obtained when the chosen goods are not too distant from the endowments. A counterexample demonstrates that, when the distance is too great, such strong metric regularity can fail, with the equilibrium then being unstable with respect to tiny shifts in the endowment parameters, even bifurcating or, on the other hand, vanishing abruptly.

16 Feb

Exact Simulation Algorithms for Reflected Brownian Motion and Other Multidimensional Queueing Models.
130 Hanes Hall

Reflected Brownian Motion (RBM), is a multidimensional stochastic process that is defined in terms of a constrained map, known as the Skorokhod map. The map involves a multidimensional local-time like process that is implicitly defined in terms of the solution to a certain stochastic differential equation (SDE). RBM plays a central role in Operations Research (OR) as it arises as the diffusion limit of a large class of queueing systems. So, designing numerical methods for computing expectation of RBM is of great interest in OR. In this talk we explain how to construct unbiased estimators of expectations of multidimensional RBM (both transient and steady-state). Some of the basic ideas and techniques actually are useful even beyond RBM. For instance, we shall see how key ideas behind the RBM algorithms can be used to simulate the state description of an infinite server queue in steady state. Based on joint work with Xinyun Chen, Jing Dong, and Aya Wallwater.

03 Feb

Fast Iterative Methods for Structurally Constrained High Dimensional Problems
120 Hanes Hall

It is widely recognized that in order to deal with modern complex and high dimensional data sets, we need to exploit structure. But "structure" can mean different things in different contexts. Notions of structure for vectors and matrices include sparsity, low rank, and group sparsity. For graphs or networks, notions of structure include small world phenomenon, transitivity (the tendency of a friend of friend to become a friend), power laws, and community structure. In all these cases, the presence of structure acts as a constraint in high dimensional learning and estimation. In this talk, I will address three key issues. First, how does structure impact the statistical efficiency of learning methods? Second, can we also hope to exploit structure to design computationally efficient learning methods? Third, to what extent can we hope to generalize our techniques meant for one structure, say sparsity, to a wide variety of structures? The focus of the talk will be on iterative methods like coordinate descent and online mirror descent. In such methods, each iteration is typically quite fast unlike, say, interior point methods and hence we can expect them to scale to millions of dimensions. (Talk is based on several papers written jointly with many co-authors to whom I am indebted for providing ideas and inspiration.)

25 Jan

Some applications of shape-restricted estimation and bootstrap techniques to semiparametric and nonparametric regression models.
120 Hanes Hall

We will start by considering some nonparametric regression models from the point of view of shape restricted statistical inference. The first problem to be tackled will be that of multivariate convex regression. We will define the (nonparametric) least squares estimator of a multivariate convex regression function, describe its finite-sample properties and state its asymptotic consistency theorem. The behavior of the estimator under model uncertainty will also be discussed. In addition, we will show how similar techniques can be applied to regression models combining convexity with component-wise monotonicity restrictions. We will then state some open problems and ongoing research projects in this area. In the second part of the talk we will be concerned with applications of the bootstrap to semiparametric models. We will focus on two cases: change-point regression and Manski’s maximum score estimator. We will exhibit the inconsistency of the classical bootstrap in these scenarios and provide model-based bootstrap procedures that produce asymptotically valid confidence intervals. We will finish with a discussion of some open questions.

23 Jan

To Adapt or Not To Adapt: The Power and Limits of Adaptivity for Sparse Estimation
120 Hanes Hall

In recent years, the fields of signal processing, statistical inference, and machine learning have come under mounting pressure to accommodate massive amounts of increasingly high-dimensional data. Despite extraordinary advances in computational power, the data produced in application areas such as imaging, remote surveillance, meteorology, genomics, and large scale network analysis continues to pose a number of challenges. Fortunately, in many cases these high-dimensional signals contain relatively little information compared to their ambient dimensionality. For example, signals can often be well-approximated as sparse in a known basis, as a matrix having low rank, or using a low-dimensional manifold or parametric model. Exploiting this structure is critical to any effort to extract information from such data. In this talk I will overview some of my recent research on how to exploit such models to recover high-dimensional signals from as few observations as possible. Specifically, I will primarily focus on the problem of estimating a sparse vector from a small number of noisy measurements. To begin, I will consider the case where the measurements are acquired in a nonadaptive fashion. I will establish a lower bound on the minimax mean-squared error of the recovered vector which very nearly matches the performance of ℓ1-minimization techniques, and hence shows that these techniques are essentially optimal. I will then consider the case where the measurements are acquired sequentially in an adaptive manner. I will prove a lower bound that shows that, surprisingly, adaptivity does not allow for substantial improvement over standard nonadaptive techniques in terms of the minimax MSE. Nonetheless, I will also show that there are important regimes where the benefits of adaptivity are clear and overwhelming.

20 Jan

Application of high-dimensional regression to communication
120 Hanes Hall

A sparse linear model with specific coefficient structure provides a framework for a problem in communication. Apart from the maximum-likelihood estimator, we also provide theoretical analysis of estimates obtained from an iterative algorithm that is similar in spirit to forward stepwise regression. We show that the algorithm has near optimal performance when compared to information-theoretic limits. This provides theoretically provable, low computational complexity communication systems based on our statistical framework. In another direction, it contributes to the understanding of thresholds for variable selection, as a function of the quantities sparsity, sample size, dimension, and signal-to-noise ratio, in high-dimensional regression.

19 Jan

Large Deviations, Metastability, Monte Carlo Methods for Multiscale Problems, & Applications
130 Hanes Hall

We discuss large deviations, metastability and Monte Carlo methods for multiscale dynamical systems that are stochastically perturbed by small noise. Depending on the type of interaction of the fast scales with the strength of the noise we get different behavior, both for the large deviations and for the corresponding Monte Carlo methods. Using stochastic control arguments we identify the large deviations principle for each regime of interaction. The large deviations principle can then be used to study metastability for such problems, as well as asymptotic problems for related PDE's. Furthermore, we derive a control (equivalently a change of measure) that allows to design asymptotically efficient importance sampling schemes for the estimation of associated rare event probabilities and expectations of functionals of interest. Standard Monte Carlo methods perform poorly in these kind of problems in the small noise limit. In the presence of multiple scales one faces additional difficulties and straightforward adaptation of importance sampling schemes for standard small noise diffusions will not produce efficient schemes. We resolve this issue and demonstrate the theoretical results by examples and simulation studies.

18 Jan

Statistical inference for non-parametric models in high dimensions: Computational and statistical aspects
120 Hanes Hall

High-dimensional statistical inference problems, where the number of features exceeds the number of samples have recently received a significant amount of attention and research. Most of the past research on the theory and methodology for high-dimensional inference problems has involved assuming the data follows a parametric model. However for many applications, the assumption that the response has a known parametric form may be too restrictive. On the other hand, non-parametric models are known to suffer severely from the curse of dimensionality meaning there are a number of computational and statistical issues associated with their implementation for high-dimensional problems. In this talk, I present two problems that address these statistical and computational issues. Firstly, I provide analysis of sparse additive models, a non-parametric generalization of sparse linear models. In particular, I present a polynomial-time algorithm for estimating sparse additive models that achieves minimax optimal rates in mean-squared error. The second problem provides analysis of the early stopping strategy for gradient descent applied to non-parametric regression models. My analysis yields an optimal data-dependent stopping rule that has minimax optimal statistical mean-squared error rate. Both projects I present are based on joint work with Professors Martin Wainwright and Bin Yu.

13 Jan

Singular Value Decomposition for High-Dimensional Data
120 Hanes Hall

Singular value decomposition (SVD) is a widely used tool for dimension reduction in multivariate analysis. However, when used for statistical estimation in high-dimensional low rank matrix models, singular vectors of the noise-corrupted matrix are inconsistent for their counterparts of the true mean matrix. In this talk, we suppose the true singular vectors have sparse representations in a certain basis. We propose an iterative thresholding algorithm that can estimate the subspaces spanned by leading left and right singular vectors and also the true mean matrix optimally under Gaussian assumption. We further turn the algorithm into a practical methodology that is fast, data-driven and robust to heavy-tailed noises. Simulations and a real data example further show its competitive performance. This is a joint work with Andreas Buja and Zongming Ma.

09 Jan

A Double Backward SDE approach for KPZ equation.
130 Hanes Hall

We propose a different way of defining the solution of KPZ. We show that this solution is consistent with the schemes already known. Our method is based on a Doubly Backward Stochastic Differential Equation driven by infinite dimensional noises. We motivate the introduction of DBSDE by posing a stochastic version of the viscosity solution for the heat equation.

28 Nov

Adaptive Bayesian Multivariate Density Estimation with Dirichlet Mixtures
120 Hanes Hall

The kernel method has been an extremely important component in the nonparametric toolbox. It has undergone tremendous development since its introduction over fifty years ago. Bayesian methods for density estimation using kernel-smoothed priors were first introduced in the mid-eighties, where a random probability measure following typically a Dirichlet process is convoluted with a kernel to induce a prior on smooth densities. The resulting prior distribution is commonly known as a Dirichlet mixture process. Such priors gained popularity in the Bayesian nonparametric literature after the development of Markov chain Monte-Carlo methods for posterior computation. Posterior consistency of a Dirichlet mixture prior with a normal kernel was established by Ghosal et al. (1999, Annals of Statistics). Subsequent papers relaxed conditions for consistency, generalized to other kernels and studied rates of convergence, especially in the univariate case. More recently, it has been found that Bayesian kernel mixtures of finitely supported random distributions have some automatic rate adaptation property --- something a classical kernel estimator lacks. We consider Bayesian multivariate density estimation using a Dirichlet mixture of normal kernel as the prior distribution. By representing a Dirichlet process as a stick-breaking process, we are able to extend convergence results beyond finitely supported mixtures priors to Dirichlet mixtures. Thus our results have new implications in the univariate situation as well. Assuming that the true density satisfies Holder smoothness and exponential tail conditions, we show that the rates of posterior convergence are minimax-optimal up to a logarithmic factor. This procedure is fully adaptive since the priors are constructed without using the knowledge of the smoothness level. This is a joint work with Weining Shen, a graduate student at Department of Statistics in North Carolina State University.

18 Nov

Measuring Forecast Accuracy under Incomplete Data
120 Hanes Hall

We focus on a crucial success element in Collaborative Inventory Management (CIM) processes: accuracy of information shared between parties. One of the core concepts in CIM is the management of inventory levels according to the forecasted demand and supply of both parties. The success of inventory management critically depends on the accuracy of the forecasts issued by each side to the other. The task of measuring this accuracy can get complicated when some of the necessary data is missing. In this work we investigate a CIM model where the issued forecasts and inventory levels (supply) are available, but the actual demand information is missing. To estimate the demand we define a novel version of the Switching Model where backlogging demand from one period to the next is allowed. We propose and compare methods to solve this model. BIO: Burcu Aydin completed her PhD degree at the Statistics and Operations Research Department of UNC-Chapel Hill in 2009. Her PhD thesis focused on object oriented data analysis for non-Euclidean spaces. Specifically, she worked on developing principal component analysis tools for tree shaped objects with advisors Gabor Pataki and J.S. Marron. Since graduation, she has been working at Hewlett Packard Laboratories as a Research Scientist. She is part of the procurement analytics group where she focuses on research subjects such as collaborative inventory management, forecast analytics, and procurement award allocation.

16 Nov

Approximate Bayesian Computing for Spatial Extremes
120 Hanes Hall

Statistical analysis of max-stable processes used to model spatial extremes has been limited by the difficulty in calculating the joint likelihood function. This precludes all standard likelihood-based approaches, including Bayesian approaches. In this paper we present a Bayesian approach through the use of approximate Bayesian computing. This circumvents the need for a joint likelihood function by instead relying on simulations from the (unavailable) likelihood. This method is compared with an alternative approach based on the composite likelihood. We demonstrate that approximate Bayesian computing often results in a lower mean square error than the composite likelihood approach when estimating the spatial dependence of extremes, though at an appreciably higher computational cost. We also illustrate the performance of the method with an application to US temperature data to estimate the risk of crop loss due to an unlikely freeze event. This is joint work with Richard Smith.

09 Nov

A Tandem Queueing Model of Direct & Indirect Waiting in an Appointment System
120 Hanes Hall

Jianzhe Luo (The University of North Carolina at Chapel Hill) We develop a useful queueing model for an appointment system which consists of an appointment queue and a service queue. We propose a Decoupled-Two-Queue model in which the appointment delay customers encounter in the appointment queue affects their show-up probability at the service station. We obtain some important performance measures of interest and manage to solve the optimization problem that aims to minimize the server's long-run average idle time while keeping customer long-run average waiting times in both queues below given levels. This is a joint work with Vidyadhar Kulkarni and Serhan Ziya.

Alex Mills, UNC Chapel Hill

Mass-Casualty Triage Strategies

Alex Mills (The University of North Carolina at Chapel Hill) A mass-casualty incident occurs when the number of injured patients overwhelms the resources available to take care of them. In such incidents, the process of sorting and prioritizing patients, known as triage, is especially important. The most widely used standard for mass-casualty triage, START, relies on a fixed priority ordering among the different classes of patients, and does not explicitly consider resource limitations or the patients' probabilities of survival. We construct a fluid model of patient triage in a mass-casualty incident that incorporates these factors and characterize its optimal policy. We use this characterization to obtain useful insights about the type of simple policies that have a good chance to perform well in practice, and we demonstrate how one could develop such a policy. Using a realistic simulation model and data from emergency medicine literature, we show that the policy we developed based on our fluid formulation outperforms START in all scenarios considered — sometimes substantially. This is a joint work with Nilay Tanik Argon and Serhan Ziya.

07 Nov

Sparse PCA Asymptotics & Analysis of Tree Data
120 Hanes Hall

A general asymptotic framework is developed for studying consistency properties of principal component analysis (PCA). Our framework includes several previously studied domains of asymptotics as special cases and allows one to investigate interesting connections andtransitions among the various domains. We are really excited about the unification power and additional theoretical insights offered by our general framework for PCA. After seeing the benefit of SparsePCA when the true model is indeed sparse, we are intrigued to develop a similar general framework for Sparse PCA. In addition to the sample size, the dimension, the spike information, a fourth factor now also plays an important role, the degree of sparsity. The second part of my topic is about developing statistical methods for analyzing tree-structured data objects. This work is motivated by the statistical challenges of analyzing a set of blood artery trees, which is from a study of Magnetic Resonance Angiography (MRA) brain images of a set of 98 human subjects. The non-Euclidean property of tree space makes the application of conventional statistical analysis, including PCA, to tree data very challenging. We develop an entirely new approach that uses the Dyck path representation, a tool for asymptotic analysis of point processes. This builds a bridge between the tree space (a non-Euclidean space) and curve space (standard Euclidean space). That bridge enables the exploitation of the power of functional data analysis to explore statistical properties of tree data sets. This is a joint work with Dr. J.S. Marron and Dr. Haipeng Shen.

31 Oct

A Model for Extremes on a Regular Spatial Lattice
120 Hanes Hall

We propose a model which can be used to characterize extremes on a regular spatial lattice. Analogous to lattice models from spatial statistics, the proposed model creates an overall model composed from many smaller models. The spatial domain is covered by a number of small and overlapping subregions and data on these subregions is modeled with parametric multivariate extreme value models of low dimension. We show that these subregion models can be combined in such a way that the angular measure of the overall model meets the requirements of an extreme value distribution. The model is designed specifically for threshold exceedance data and is used to model data, such as extreme precipitation which at any given time is likely to be extreme only in limited areas of the study region. The talk will begin with an overview of spatial extremes and applications.

27 Oct

Convergence in Density of Multiple Integrals and Applications
130 Hanes Hall

In this talk, we shall give conditions, under which densities of suitably centered and normalized multiple Wiener-to integrals exist and converge to the density of a standard normal random variable. The tool that we use is the Malliavin calculus.

24 Oct

Stochastic properties of optical black holes
120 Hanes Hall

An optical black hole, or rather an optical vortex, is a point in an optical field, e.g. in a laser field, where the light intensity is exactly zero. "Light can be twisted like a corkscrew around its axis of travel. Because of the twisting, the light waves at the axis itself cancel each other out. When projected onto a flat surface, an optical vortex looks like a ring of light, with a dark hole in the center. This corkscrew of light, with darkness at the center, is called an optical vortex." / Wikipedia When the light falls on a plane it can be modelled as a complex Gaussian field with independent zero mean real and imaginary parts. The black points are the points where the zero level-curves of the real and imaginary parts cross each other - there the light intensity is exactly zero. By using a multivariate form of Rice formula for the expected number of level crossings in a stochastic process one can build a "Slepian model" for the conditional distribution of the light field near the black points. It is "well known" that near a vortex the curves of constant light intensity are ellipses with random orientation and random excentricity. This can be motivated by a Taylor expansion of the fields, and is also experimentally verified (with observational error). The new Slepian model modifies this result and shows that the intensity level curves are not exactly elliptical, but are slightly expanded at the extreme ends. It also gives an exact description of the stochastic variability of the ellipses.

10 Oct

Probabilistic hashing for similarity searching and statistical learning on massive high-dimensional data
120 Hanes Hall

This talk will present our most recent work on probabilistic hashing algorithms for two important tasks on massive high-dimensional (e.g., 2^64) data: (A) highly efficient similarity search using small storage; and (B) highly efficient statistical learning such as logistic regression and support vector machines (SVM). Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is the minwise hashing algorithm. This talk presents b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice. We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme. Most recently, we discovered that (b-bit) minwise hashing can be seamlessly integrated with statistical learning algorithms such as logistic regression and SVM. Learning on massive data faces numerous challenges: (1) data may not fit in memory; (2) data loading (and transmission over network) may dominate the cost; (3) training may be time consuming; (4) testing may be too slow to meet the demand in some scenarios (e.g., search engines); (5) the model itself may be too large to store (e.g., logistic regression with 2^64 coefficients); (6) exploring high-order (e.g., pairwise, 3-way, etc) interactions becomes extremely difficult. Our method of b-bit minwise hashing provides an extremely simple efficient solution. This talk will describe fitting logistic regression and SVM on two text datasets: a small dataset of 24GB (in 16 million dimensions) and a larger dataset of 200 GB (in 1 billion dimensions). Using our method on a single desktop, fitting (regularized) SVM takes only about 3 seconds on the small dataset and about 30 seconds on the larger dataset. Our technique is purely statistical/probabilistic and is orthogonal to the underlying optimization procedure. Over the past decade or so there has been considerable interest in random projections for dimension reductions and compressed sensing. However, this talk will demonstrate that b-bit minwise hashing can be substantially more accurate than random projection and its variants. References: [1] Li and Konig: Theory and Applications of b-bit Minwise Hashing, Research Highlights in Communications of the ACM, August 2011. http://www.stat.cornell.edu/~li/CACM_hashing.pdf [2] Li, Konig, and Gui, b-bit minwise hashing for estimating three-way similarities, NIPS 2010. http://books.nips.cc/papers/files/nips23/NIPS2010_1143.pdf [3] Li, Shrivastava, Moore, and Konig, Hashing algorithms for large-scale learning, NIPS 2011. See a more recent report at http://www.stat.cornell.edu/~li/reports/HashLearningMoreExp.pdf

06 Oct

On the conditional ergodic theory of Markov processes
130 Hanes Hall

Consider a bivariate ergodic Markov process. One component is observed and the other component is hidden. What can we say about the ergodic properties of the hidden process conditionally on the observed process? Such questions are of direct relevance to the performance of nonlinear filtering algorithms that are widely used by engineers. Mathematically, the question hinges on an insidious measure-theoretic problem that appears in many probabilistic settings and remains far from well understood. When the underlying Markov process is ergodic in total variation, the conditional ergodic properties are inherited under mild assumptions (this resolves a long-standing problem in this area). Under weaker notions of ergodicity, however, new phenomena can appear. Things get even worse when one considers random fields rather than Markov processes. I will discuss these results, and describe some ongoing work aimed at understanding conditional ergodic properties in high dimensional settings.

03 Oct

Spectral Approximation of Infinite Dimensional Black-Sholes Equation with Memory
120 Hanes Hall

In this talk, we consider the pricing of a European option using a market in which the stock price and the asset in the riskless bank account both have hereditary price structures. An infinite dimensional Black-Scholes equation with memory will be derived. Under a smoothness assumption of the payoff function, it is shown that the infinite dimensional Black-Scholes equation possesses a unique classical solution. A spectral approximation scheme is developed using the Fourier series expansion in a space of continuous functions for the Black-Scholes equation. It is also shown that the nth approximant resembles the classical Black-Sholes equation in finite dimensions.

29 Sep

An application of the Wiener chaos expansion and Malliavin calculus to the numerical error estimates for a stochastic finite element method.
130 Hanes Hall

We consider the numerical solution of a class of elliptic and parabolic SPDEs driven by a Gaussian white noise using a stochastic finite element method. To derive a priori error estimates and quantify the optimal rate of convergence of the numerical method, we tap on ideas from the theory for deterministic finite element, where the extension of numerical error estimates from elliptic PDE to parabolic PDE is a standard technique. By expressing the stochastic perturbation in terms of the Malliavin divergence operator, we are able to use tools from the Malliavin calculus to derive the error estimates for the parabolic SPDE, in close mimicry with the techniques from the deterministic finite element theory. In particular, the analysis employs a formal stochastic adjoint problem arising from the adjoint relationship between the Malliavin derivative and the Malliavin divergence operator, to obtain the optimal order of spatial convergence. Some numerical simulations will be shown.

The National Security Agency's (NSA) Office of Operations Research, Modeling and Simulation applies scientific and quantitative methods to help NSA and other Intelligence Community organizations better understand and find solutions to their operational problems. The Summer Program for Operations Research Technology (SPORT) offers students the opportunity to apply their academic knowledge in a stimulating professional environment. As a SPORT intern you will work closely with full-time analysts applying the technical skills you've learned in college to challenging real-world problems of your choice. Learn more about SPORT as we show a brief video on NSA followed by a discussion of program details.

22 Sep

On diffusions interacting through their ranks
130 Hanes Hall

We will discuss systems of diffusion processes on the real line, in which the dynamics of every single process is determined by its rank in the entire particle system. Such systems arise in mathematical finance and statistical physics, and are related to heavy-traffic approximations of queueing networks. Motivated by the applications, we address questions about invariant distributions, convergence to equilibrium and concentration of measure for certain statistics, as well as hydrodynamic limits and large deviations for these particle systems. Parts of the talk are joint work with Amir Dembo, Tomoyuki Ichiba, Soumik Pal and Ofer Zeitouni.

19 Sep

Simultaneous supervised clustering and feature selection
120 Hanes Hall

In network analysis, genes are known to work in groups by their biological functionality, where distinctive groups reveals different gene functionalities. In such a situation, identifying grouping structures as well as informative genes becomes critical in understanding progression of a disease. Motivated from gene network analysis, we investigate, in a regression context, simultaneous supervised clustering and feature selection over an arbitrary undirected graph, where each predictor corresponds to one node in the graph and existence of a connecting path between two nodes indicates possible grouping between the two predictors. In this talk, I will review recent developments and discuss computational methods for simultaneous supervised clustering and feature selection over a graph. Numerical examples will be given, in addition to some theoretical aspects of supervised clustering and feature selection. This is joint with Hsin-Cheng Huang and Wei Pan.

12 Sep

Active Sequential Decision-Making in an Uncertain World: Fundamental Limits and Optimal Strategies
120 Hanes Hall

Many problems at the forefront of statistics, machine learning, and network science involve adaptively collecting observations to make a decision about the unobserved state of some complex "black box" system. The increasing complexity and scale of such problems raises two questions: (1) What is the minimum number of observations needed to make a good decision? (2) Can we develop an efficient strategy that would reach this lower bound? My recent work addresses both of these questions for a broad class of problems that can be cast as a "game of Twenty Questions with a Black Box." I will present a unifying framework for analyzing active sequential decision-making that combines techniques from feedback information theory, approximation theory, and optimization. This framework yields important new insights into the dynamics of adaptive decision strategies. For instance, it reveals a law of diminishing returns for the information gain of each subsequent observation, while also suggesting possible ways of mitigating this effect. Potential applications include designing better statistical experiments for more "informative" data acquisition in applications where economical use of costly observations is important. Brief Biography: Maxim Raginsky received the B.S. and the M.S. degrees in Electrical Engineering in 2000 and the Ph.D. degree in Electrical Engineering in 2002 from Northwestern University. His doctoral research was in the area of quantum information theory. From 2002 and 2004 he was a postdoctoral scholar with the Center for Photonic Communication and Computing at Northwestern. Between 2004 and 2007, he was a Beckman Foundation Fellow at the University of Illinois, Urbana-Champaign, working on problems related to information-theoretic limits of learning systems and to computational neuroscience. He has been with Duke University since 2007, where he is now an Assistant Research Professor of Electrical and Computer Engineering. He is interested in theoretical and practical aspects of information processing and decision-making in uncertain environments under resource and complexity constraints. His research combines techniques from machine learning, optimization and control, and information theory.

25 Apr

A Cluster Identification Framework Illustrated by a Filtering Model for Earthquake Occurrences
120 Hanes Hall

A general dynamical cluster identification framework including both modeling and computation is developed. The earthquake declustering problem is studied to demonstrate how this framework applies. A stochastic model is proposed for earthquake occurrences that considers the sequence of occurrences as composed of two parts: earthquake clusters and single earthquakes. We suggest that earthquake clusters contain a ``mother quake'' and her ``offspring''. Applying the filtering techniques, we use the solution of filtering equations as criteria for declustering. A procedure for calculating MLE's and the most likely cluster sequence is also presented.

11 Apr

Diagnostic Accuracy Under Congestion

In diagnostic services, agents typically need to weigh the benefit of running an additional test and improving the accuracy of diagnosis against the cost of congestion, i.e., delaying the provision of services to others. Our paper analyzes how to dynamically manage this accuracy/congestion trade-off. To that end, we study an elementary congested system facing an arriving stream of customers. The diagnostic process consists of a search problem in which the agent providing the service conducts a sequence of imperfect tests to determine whether a customer is of a given type. We find that the agent should continue to perform the diagnose as long as her current belief that the customer is of the searched-for type falls in a congestion dependent interval. As congestion intensifies, the search interval should shrink. Due to congestion effects, the agent should sometimes diagnose the customer as being of a given type, even when all preformed tests are indicating otherwise. The optimal structure also implies that, when false negatives are negligible, the agent should first let the maximum number of customers allowed in the system increase with the number of performed tests. Finally, we show numerically that improving the validity of tests can sometimes decrease accuracy while faster tests can increase congestion.

04 Apr

Approximating the Nondominated Frontier for Multi-criteria Problems
120 Hanes Hall

It is difficult to find the set of nondominated solutions in many multi-criteria problems. In multi-criteria combinatorial optimization problems, even finding several nondominated points may be hard. In such problems, an approximate representation of the set of nondominated points may be useful. We develop an approach that approximates the set of nondominated points by fitting a simple hypersurface. The hypersurface passes through a small number of available nondominated solutions. We demonstrate its performance on several combinatorial optimization and continuous solution-space problems.

30 Mar

Sparse Regression with Incentive
120 Hanes Hall

Spare regularization methods for high dimensional regression have received much attention recently as an alternative of subset selection methods. Examples are lasso (Tibshirani 1996), bridge regression (1993), scad (2001), to name just few. An advantage of sparse regularization methods is that it gives a stable estimator with automatic variable selection and hence the resulting estimator performs well in prediction. Also, sparse regularization methods have many desirable properties when the true model is sparse. However, there are several disadvantages of sparse regularization methods as discussed by Zou and Hastie (2005). First, when p > n; the solution has at most n many nonzero coefficients. Second, if there is a group of covariates whose correlations are very high, the solution usually takes one covariate from the group and does not care which one is selected. Third, empirically, where there high correlations between covariates, the prediction performance of sparse solutions are dominated by non-sparse solutions such as ridge regression (Tibshirani 1996) where all of the highly correlated covariates are used. Hence, there is a need of less sparse solutions than the aforementioned sparse regularization methods. To make a less sparse solution, Zou and Hastie (2005) proposed the elastic net penalty, which is a linear combination of the lasso and ridge penalties. Friedman and Popescu (2004) proposed the gradient directed regularization which is a modified gradient descent method where more than one predictor variables are updated at each iteration. There are limitations in the elastic net and modified gradient directed regularization. The elastic net requires the rescaling of the solution to avoid overshrinkage. Even though Zou and Hastie (2005) proposed the rescaling factor heuristically for linear regression, it is not clear how to rescale the solution for other problems such as logistic regression. The estimator of the gradient directed regularization is not defined by a minimizer of a penalized empirical risk and hence it is difficult to study properties of the estimator. In this talk, we propose a new regularization method called the sparse regression with incentive (SRI) to overcome deficiencies of the elastic net and gradient directed regularization. Advantages of the SRI over the elastic net and gradient directed regularization is that the estimator is defined by a minimizer of a penalized empirical risk and it does not require an ad-hoc post-rescaling as the elastic net does. Also, we can incorporate prior information of the group structure of covariates.

21 Mar

Taking advantage of Degeneracy in Cone Optimization with Applications to Sensor Network Localization and Molecular Conformation.
120 Hanes Hall

The elegant theoretical results for strong duality and strict complementarity for linear programming, LP, lie behind the success of current algorithms. However, the theory and preprocessing techniques that are successful for LP can fail for cone programming over nonpolyhedral cones. Surprisingly, many instances of semidefinite programming, SDP, problems that arise from relaxations of hard combinatorial problems are degenerate. (Slater's constraint qualification fails.) Rather than being a disadvantage, we show that this degeneracy can be exploited. In particular, several huge instances of SDP completion problems can be solved quickly and to extremely high accuracy. In particular, we illustrate this on the sensor network localization and Molecular conformation problems.

23 Feb

Bad semidefinite programs: they all look the same
120 Hanes Hall

Semidefinite Programming (SDP) is the problem of optimizing a linear objective function of a symmetric matrix variable, with the requirement that the variable also be positive semidefinite. SDP has been called ``linear programming for the twenty-first century'': it is vastly more general than LP, with applications ranging from engineering to combinatorial optimization, while it is still efficiently solvable. Duality theory is a central concept in SDP, just like it is in linear programming, since in optimization algorithms a dual solution serves as a certificate of optimality. However, in SDP, unlike in LP, rather fascinating ``pathological'' phenomena occur: nonattainment of the optimal value, and positive duality gaps between the primal and dual problems. This research was motivated by the curious similarity of pathological SDP instances appearing in the literature. We prove an exact characterization of badly behaved semidefinite systems, ie. show that -- surprisingly -- ``all bad SDPs look the same''. We also prove that all badly behaved semidefinite systems can be reduced to a minimal such system with just one variable, and two by two matrices in a well defined sense. This result is analogous to ``excluded minor'' type results in graph theory (a quite unrelated field), like Kuratowski's theorem. Our characterizations imply that recognizing badly behaved semidefinite systems is in NP and co-NP in the real number model of computing. While the main tool we use is convex analysis, the results have a combinatorial flavor. The talk will be self-contained, and not assume any previous knowledge of semidefinite programming, or duality theory. I will also give some motivation on why SDP is interesting and show some applications.

21 Feb

Minimax Estimation for Mixtures of Whishart Distributions

The space of positive definite symmetric matrices has been studied extensively as a means of understanding dependence in multivariate data along with the accompanying problems in statistical inference. Many books and papers have been written on this subject, and more recently there has been considerable interest in high- dimensional random matrices with particular emphasis on the distribution of certain eigenvalues. Our present paper is motivated by modern data acquisition technology, particularly, by the availability of diffusion tensor-magnetic resonance data. With the availability of such data acquisition capabilities, smoothing or nonparametric techniques are required that go beyond those applicable only to data arising in Euclidean spaces. Accordingly, we present a Fourier method of minimax Wishart mixture density estimation on the space of positive definite symmetric matrices. *Joint with L.R. Haff, P.T. Kim and D. Richards

14 Feb

A Compact Functional Estimate of a Functional Variance-Covariance or Correlation Kernel

In functional data analysis, as in its multivariate counterpart, estimates of the bivariate covariance kernel σ(s,t)and its inverse are useful for many things. However, the dimensionality of functional observations often exceeds the sample size available to estimate σ(s,t). Then the analogue of the multivariate sample estimate is singular and non-invertible. Even when this is not the case, the high dimensionality of the usual estimate often implies unacceptable sample variability and loss of degrees of freedom for model fitting. The common practice of employing low-dimensional principal component approximations to σ(s,t) to achieve invertibility also raises serious issues. This talk describes a functional and nonsingular estimate of σ(s,t) defined by an expansion in terms of finite element basis functions that permits the user to control the resolution of the estimate as well as the time lag over which covariance may be nonzero. This estimate also permits the estimation of covariances and correlations at observed pairs of sampling points, and therefore has applications to many classical statistical problems, such as discrete but unequally spaced time and spatial series.

24 Jan

Bayesian Control of Multiplicity
120 Hanes Hall

Issues of multiplicity in testing are increasingly being encountered in a wide range of disciplines, as the growing complexity of data allows for consideration of a multitude of possible hypotheses (e.g., does gene xyz affect condition abc); failure to properly adjust for multiplicities is possibly to blame for the apparently increasing lack of reproducibility in science. Bayesian adjustment for multiplicity is straightforward conceptually, in that it occurs through the prior probabilities assigned to models/hypotheses. It is, hence, independent of the error structure of the data, the main obstacle to adjustment for multiplicity in classical statistics. Not all assignments of prior probabilities adjust for multiplicity, and assignments in huge model spaces typically require a mix of subjective assignment and appropriate hierarchical modeling. These issues will be reviewed through a variety of examples, including vaccine trials.

12 Jan

High Dimensional Statistical Learning
120 Hanes Hall

In this talk, I will present some new contributions to the area of high dimensional statistical learning. The focus will be on both classification and clustering. Classification is one of the central research topics in the field of statistical learning. For binary classification, we proposed the Bi-Directional Discrimination (BDD) method which generalizes linear classifiers from one hyperplane to two or more hyperplanes. For multiclass classification, we have generalized the DWD method from the binary case to the multiclass case. Clustering is another important topic in statistical learning. Sigclust is a recently developed powerful tool to assess the statistical significance of clusters. SigClust is under continuous development. We are working in several directions to improve the performance of SigClust method.

29 Nov

Disjunctive conic cuts for second-order cone programming
120 Hanes Hall

We develop the basics of the theory of disjunctive programming mixed integer second-order cone programming (MISOCP) problems. We consider the intersection between the relaxed feasible set, assumed to be an ellipsoid, and a disjunction. Under mild assumptions, we show the existence of a second order cone yielding the convex hull of that intersection. One can use the conic cut in branch-and-cut algorithms for MISOCP problems. We compare our cut against the conic cut of Atamtürk and Narayanan and show that our disjunctive cut is always stronger. Along the way we also prove an interesting result about the family of quadratic surfaces whose intersection with two given hyperplanes is fixed. This is joint work with Pietro Belotti, Julio C Góez, Ted Ralphs and Tamás Terlaky.

18 Nov

MATCHING STATISTICS OF AN ITO PROCESS BY A PROCESS OF DIFFUSION TYPE
130 Hanes Hall

Suppose we are given a multi-dimensional Ito process, which can be regarded as a model for an underlying asset price together with related stochastic processes, e.g., volatility. The drift and diffusion terms for this Ito process are permitted to be arbitrary adapted processes. We construct a weak solution to a diffusion-type equation that matches the distribution of the Ito process at each fixed time. Moreover, we show how to also match the distribution at each fixed time of statistics of the Ito process, including the running maximum and running average of one of the components of the process. A consequence of this result is that a wide variety of exotic derivative securities have the same prices when written on the original Ito process as when written on the mimicking process. This is joint work with Gerard Brunick.

17 Nov

Diffusion Approximations for Multiscale Stochastic Networks in Heavy traffic
120 Hanes Hall

We study a sequence of nearly critically loaded queueing networks, for which the arrival and service rates, as well as the routing structure change over time. The nth network is described in terms of two independent finite state Markov processes {Xn(t): t ≥ 0} and {Yn(t) : t ≥ 0} which can be interpreted as the random environment in which the system is operating. The process Xn changes states at a much higher rate than the typical arrival and service times in the system, while the reverse is true for Yn. The variations in the routing mechanism of the network are governed by Xn, whereas the arrival and service rates at various stations depend on the state process (i.e. queue length process) and both Xn and Yn. Denoting by Zn the suitably normalized queue length process, it is shown that, under appropriate heavy traffic conditions, the pair Markov process (Zn, Yn) converges weakly to the Markov process (Z, Y), where Y is a finite state continuous time Markov process and the process Z is a reflected diffusion with drift and diffusion coefficients depending on (Z, Y) and the stationary distribution of Xn. Study of stability properties of such queueing systems is of a central engineering importance. The second part of my talk focuses on the stability properties of the limit process (Z, Y). Under an appropriate stability condition, which is formulated in terms of the "averaged" drift (with respect to the stationary distribution of Y), we show that the Markov process (Z, Y) is positive recurrent and has a unique invariant measure. In fact, we establish a significantly stronger result, namely the process (Z, Y) is geometrically ergodic and its invariant distribution has a finite moment generating function in a neighborhood of zero. We also obtain uniform time estimates for polynomial moments (of all orders) of the process and functional central limit results for long time fluctuations of the empirical means around their stationary averages.

10 Nov

Synthesis of Gaussian and non-Gaussian stationary time series using circulant matrix embedding
120 Hanes Hall

What is the best method to generate Gaussian stationary time series? What is an extension of the method to the multivariate context? How to generate non-Gaussian stationary time series with prescribed marginal distribution and covariance structure? Can this be done for any marginal distribution and covariance structure? These are some of the questions I will discuss in the talk.

08 Nov

Second-Order Comparison of Functional Data and DNA Geometry
120 Hanes Hall

Given two samples of continuous zero-mean iid Gaussian random functions on [0,1], we consider the problem of testing whether they share the same covariance operator. Our study is motivated by the problem of determining whether the mechanical properties of short strands of DNA are significantly affected by their base-pair sequence; though expected to be true, had so far not been observed in three-dimensional electron microscopy data. The testing problem is seen to involve aspects of ill-posed inverse problems and a test based on a Karhunen–Loève approximation of the Hilbert–Schmidt distance of the empirical covariance operators is proposed and investigated. When applied to a dataset of DNA minicircles obtained through the electron microscope, our test seems to suggest potential sequence effects on DNA shape.

03 Nov

Algebraic Statistics
120 Hanes Hall

Algebraic statistics advocates polynomial algebra as a tool for addressing problems in statistics and its applications. This connection is based on the fact that most statistical models are defined either parametrically or implicitly via polynomial equations. The idea is summarized by the phrase "Statistical models are semialgebraic sets". I will try to illustrate this idea with two examples, the first coming from the analysis of contingency tables, and the second arising in computational biology. I will try to keep the algebraic prerequisites to an absolute minimum and keep the talk accessible to a broad audience.

01 Nov

Understanding Sensitivities in Paleoclimatic Reconstructions
120 Hanes Hall

Recent publications have seen the introduction of a number of new statistical methods for paleoclimatic reconstructions. When applied to archived multiproxy datasets, most of these methods produce reconstructed curves very similar to the "hockey stick" shape that was first observed by Mann, Bradley and Hughes. However, one recent reconstruction, by McShane and Wyner, produced a sharply different shape. Trying to understand the reasons for this leads to important insights for both statistical methodology and paleoclimatic datasets. The "divergence" phenomenon - that the relationship between temperature and some of the proxies may not be constant over time - has been extensively discussed in the paleoclimate literature, but mostly in the context of certain classes of northern hemisphere tree rings, which are not included among the proxies examined here. Closer scrutiny of the data suggests a new divergence phenomenon, associated with lake sediments. When these are removed from the data, the resulting reconstruction is much closer to the familiar hockey stick shape. This highlights the need both for careful scrutiny of the data, and for statistical methods that are robust against the divergence phenomenon.

27 Oct

Ancestor problem for branching trees
120 Hanes Hall

In a branching tree that is not extinct by generation n pick two individuals from the nth generation and trace their lines back till they meet. Call that generation number Xn. In this talk we will discuss the distribution of Xn and its weak limits as n gets large for the case of a Galton Watson tree for the four cases: m <1,m=1,1

25 Oct

Markov Decision Processes with Censored Observations: Analysis and Applications in Dynamic Pricing
120 Hanes Hall

Censored (or truncated) observations are quite prevalent in practice. How does the existence of imperfect observations affect optimal decisions? We consider this problem in the setting of a general finite-horizon Markov decision process with two-sided censoring. We first show a general value function result based on a generalized envelope theorem, which quantifies the loss of informational value due to censoring. This result is subsequently used to characterize how the existence of imperfect observations affect optimal decisions. We apply these results to two dynamic pricing problems with or without inventory replenishment and a Name-Your-Own-Price problem.

13 Oct

Small noise limits and Feller selection
120 Hanes Hall

This talk will describe two results involving small noise limits of diffusions. The first is a proposal to use small noise limits for selection of a Feller family of solutions when uniqueness of weak solutions is lacking (joint work with K. Sureshkumar). The second is an extension of a result of Sheu concerning small noise limits for invariant measures of diffusions (joint work with Anup Biswas).

11 Oct

Covariate adjusted functional principal component analysis for PET time-course data
120 Hanes Hall

Classical principal component analysis (PCA) has been extended to functional data and termed functional PCA. In this talk, we explore applications of functional PCA when covariate information is available. We present two approaches and apply them to the analysis of PET time course data by borrowing information across space. The new approaches reduce the noise in the data in the preprocessing step and lead to substantial improvement in subsequent analysis by methods such as Spectral Analysis. This talk is based on joint work with Ciren Jiang (SAMSI) and John Aston. (University of Warwick).

04 Oct

Corrected-loss Estimation for Quantile Regression with Covariate Measurement Error

We study estimation in quantile regression when covariates are measured with error. Existing work in the literature often requires stringent assumptions, such as spherically symmetric joint distribution of the regression and measurement error variables, or linearity of all quantile functions, which restrict model flexibility and complicates computation. In this paper, we develop a new estimation approach based on corrected scores to account for a class of covariate measurement errors in quantile regression. The proposed method is simple to implement, and its validity only requires linearity of the particular quantile function of interest. In addition, the proposed method does not require any parametric assumptions on the regression error distributions. We demonstrate with simulation study that the proposed estimators are more efficient than existing methods in various models considered. Finally we illustrate the proposed method through the analysis of a dietary data. This is a joint work with Leonard A. Stefanski and Zhongyi Zhu

23 Sep

Stable Processes with Drifts
130 Hanes Hall

A rotationally symmetric stable process in Euclidean space with a drift is a strong Markov process X whose infinitesimal generator L is a fractional Laplacian perturbed by a gradient operator. In this talk, I will present recent results on the sharp estimates on the transition density p_D (t, x, y) of the sub-process of X killed open leaving a bounded open set D. This transition density function p_D(t, x, y) is also the fundamental solution (or heat kernel) of the non-local operator L on D with zero exterior condition. Based on joint work with P. Kim and R. Song.

20 Sep

Non-concave Penalized Composite Likelihood Estimation of Sparse Ising Models
120 Hanes Hall

The Ising model is a useful tool for studying complex interactions within a system. The estimation of such a model, however, is rather challenging especially in the presence of high dimensional parameters. In this work, we propose efficient procedures for learning a sparse Ising model based on a penalized composite likelihood with non-concave penalties. Non-concave penalized likelihood estimation has received a lot of attention in recent years. However, such an approach is computationally prohibitive under high dimensional Ising models. To overcome such difficulties, we extend the methodology and theory of non-concave penalized likelihood to penalized composite likelihood estimation. An efficient solution path algorithm is devised by using a new coordinate-minorization-ascent algorithm. Asymptotic oracle properties of the proposed estimator are established with NP-dimensionality. We demonstrate its finite sample performance via simulation studies and further illustrate our proposal by studying the Human Immunodeficiency Virus type 1 (HIV-1) protease structure based on data from the Stanford HIV Drug Resistance Database. This talk is based on a joint paper with Lingzhou Xue and Tianxi Cai.

02 Sep

On the Positive Recurrence of Semimartingale Reflecting Brownian Motions in Three Dimensions
222 Greenlaw Hall

Let Z be an n-dimensional Brownian motion confined to the non-negative orthant by oblique reflection at the boundary. Such processes arise in applied probability as diffusion approximations for multi-station stochastic processing networks. For dimension n = 2, a simple condition is known to be necessary and sufficient for positive recurrence of Z. The obvious analog of that condition is necessary but not sufficient in three and higher dimensions, where fundamentally new phenomena arise. Building on prior work by Bernard and El Kharroubi (1991) and El Kharroubi et al. (2000, 2002), we provide necessary and sufficient conditions for positive recurrence in dimension n = 3. In this context we find that the fluid-stability criterion of Dupuis and Williams (1994) is not only necessary for positive recurrence but also sufficient; that is, in three dimensions Z is positive recurrent if and only if every path of the associated fluid model is attracted to the origin. I will also discuss recent development for problems in four and higher dimensions. Joint work with Maury Bramson and Michael Harrison.

Luxi Zheng, UNC Chapel Hill

Chao Deng, UNC Chapel Hill

Shemra Rizzo, UNC Chapel Hill