Research
Selected research topics and publications
Modularity is a popular metric for quantifying the degree of community structure within a network. The distribution of the largest eigenvalue of a network’s edge weight or adjacency matrix is well studied and is frequently used as a substitute for modularity when performing statistical inference. However, we show that the largest eigenvalue and modularity are asymptotically uncorrelated, which suggests the need for inference directly on modularity itself when the network size is large. To this end, we derive the asymptotic distributions of modularity in the case where the network’s edge weight matrix belongs to the Gaussian orthogonal ensemble, and study the statistical power of the corresponding test for community structure under some alternative models. We empirically explore universality extensions of the limiting distribution and demonstrate the accuracy of these asymptotic distributions through Type I error simulations. We also compare the empirical powers of the modularity based tests with some existing methods. Our method is then used to test for the presence of community structure in two real data applications.
Citation: Ma, R., & Barnett, I. (2021). The asymptotic distribution of modularity in weighted signed networks. Biometrika, 108(1), 1-16.
Principal component analysis (PCA) is a popular method for dimension reduction in unsupervised multivariate analysis. However, existing ad hoc uses of PCA in both multivariate regression (multiple outcomes) and multiple regression (multiple predictors) lack theoretical justification. The differences in the statistical properties of PCAs in these two regression settings are not well understood. In this paper we provide theoretical results on the power of PCA in genetic association testings in both multiple phenotype and SNP-set settings. The multiple phenotype setting refers to the case when one is interested in studying the association between a single SNP and multiple phenotypes as outcomes. The SNP-set setting refers to the case when one is interested in studying the association between multiple SNPs in a SNP set and a single phenotype as the outcome. We demonstrate analytically that the properties of the PC-based analysis in these two regression settings are substantially different. We show that the lower order PCs, that is, PCs with large eigenvalues, are generally preferred and lead to a higher power in the SNP-set setting, while the higher-order PCs, that is, PCs with small eigenvalues, are generally preferred in the multiple phenotype setting. We also investigate the power of three other popular statistical methods, the Wald test, the variance component test and the minimum p-value test, in both multiple phenotype and SNP-set settings. We use theoretical power, simulation studies, and two real data analyses to validate our findings.
Citation: Liu, Z., Barnett, I., & Lin, X. (2020). A comparison of principal component methods between multiple phenotype regression and multiple SNP regression in genetic association studies. The Annals of Applied Statistics, 14(1), 433-451.
Objective: Studies that use patient smartphones to collect ecological momentary assessment and sensor data, an approach frequently referred to as digital phenotyping, have increased in popularity in recent years. There is a lack of formal guidelines for the design of new digital phenotyping studies so that they are powered to detect both population-level longitudinal associations as well as individual-level change points in multivariate time series. In particular, determining the appropriate balance of sample size relative to the targeted duration of follow-up is a challenge. Materials and Methods: We used data from 2 prior smartphone-based digital phenotyping studies to provide reasonable ranges of effect size and parameters. We considered likelihood ratio tests for generalized linear mixed models as well as for change point detection of individual-level multivariate time series. Results: We propose a joint procedure for sequentially calculating first an appropriate length of follow-up and then a necessary minimum sample size required to provide adequate power. In addition, we developed an accompanying accessible sample size and power calculator. Discussion: The 2-parameter problem of identifying both an appropriate sample size and duration of follow-up for a longitudinal study requires the simultaneous consideration of 2 analysis methods during study design. Conclusion: The temporally dense longitudinal data collected by digital phenotyping studies may warrant a variety of applicable analysis choices. Our use of generalized linear mixed models as well as change point detection to guide sample size and study duration calculations provide a tool to effectively power new digital phenotyping studies.
Citation: Barnett, I., Torous, J., Reeder, H. T., Baker, J., & Onnela, J. P. (2020). Determining sample size and length of follow-up for smartphone-based digital phenotyping studies. Journal of the American Medical Informatics Association, 27(12), 1844-1849.
Objective: To identify trends in mobility and daily pain levels among a cohort of patients with clinically diagnosed spine disease. Methods: Participants with spine disease were enrolled from a general neurosurgical clinic and installed a smartphone application (Beiwe) designed for digital phenotyping to their personal smartphone. This application collected passive meta-data on a minute-to-minute basis, including global positioning system (GPS), WiFi, accelerometer, text and telephone logs, and screen on and off time. The application also administered daily visual analog scale pain surveys. A linear mixed model framework was used to test for associations between self-reported pain and mobility and sociability from the passively collected data. Results: A total of 105 patients were enrolled, with a median follow-up time of 94.5 days; 55 patients underwent a surgical intervention during the follow-up period. The weekly pain survey response rate was 73.2%. By the end of follow-up, the mean change in pain for all patients was −1.3 points (4.96 at the start of follow-up to 3.66 by the end of follow-up). Increased pain was significantly associated with reduced patient mobility as measured using 3 daily GPS summary statistics (i.e., average flight length, maximum diameter travelled, total distance travelled). Conclusions: Patients with spine disease who reported greater pain had reduced mobility, as measured by the passively collected smartphone GPS data. Smartphone-based digital phenotyping appears to be a promising and scalable approach to assess mobility and quality of life of patients with spine disease.
Citation: Cote, D. J., Barnett, I., Onnela, J. P., & Smith, T. R. (2019). Digital phenotyping in patients with spine disease: a novel approach to quantifying mobility and quality of life. World neurosurgery, 126, e241-e249.
Feature-based classification of networks
Network representations of systems from various scientific and societal domains are neither completely random nor fully regular, but instead appear to contain recurring structural features. These features tend to be shared by networks belonging to the same broad class, such as the class of social networks or the class of biological networks. Within each such class, networks describing similar systems tend to have similar features. This occurs presumably because networks representing similar systems would be expected to be generated by a shared set of domain-specific mechanisms, and it should therefore be possible to classify networks based on their features at various structural levels. Here we describe and demonstrate a new hybrid approach that combines manual selection of network features of potential interest with existing automated classification methods. In particular, selecting well-known network features that have been studied extensively in social network analysis and network science literature, and then classifying networks on the basis of these features using methods such as random forest, which is known to handle the type of feature collinearity that arises in this setting, we find that our approach is able to achieve both higher accuracy and greater interpretability in shorter computation time than other methods.
Citation: Barnett, I., Malik, N., Kuijjer, M. L., Mucha, P. J., & Onnela, J. P. (2019). EndNote: Feature-based classification of networks. Network Science, 7(3), 438-444.