Big Data in Healthcare

Health data are increasingly being generated at a massive scale, at various levels of phenotyping and from different types of resources. We are using nationwide electronic health record data on millions of individuals from several countries, with the aim of developing machine learning algorithms for predicting future onset of disease, identifying causal drivers of disease, and unraveling personalized responses to drugs. We aim to understand health trajectories of different people, how they unfold along different pathways, how the past affects the present and future health, and the complex interactions between different determinants of health over time.

Analyses of such large-scale medical data have the potential to identify new and unknown associations, patterns and trends in the data that may pave the way to scientific discoveries in pathogenesis, classification, diagnosis, treatment and progression of disease. Such work includes using the data for constructing computational models to accurately predict clinical outcomes and disease progression, which have the potential to identify people at high risk and prioritize them for early intervention strategies, and to evaluate the influence of public health policies on ‘real-world’ data. Our prediction of gestational diabetes is one such example.

We use state of the art data science methods to analyze these large datasets, including:

Descriptive analysis. Such approaches are useful for unbiased exploratory study of the data and for finding interesting patterns in the data, which may lead to testable hypotheses.
Prediction analysis. Prediction analysis aims to learn a mapping from a set of inputs to some outcome of interest, such that the mapping can later be used to predict the outcome from the inputs in a different unseen set. Prediction analysis holds the potential for improving disease diagnostic and prognostic.
Counterfactual prediction. One major limitation of any observational study is its inability to answer causal questions, as observational data may be heavily confounded and contain other limiting flaws. Counterfactual prediction thus aims to construct models that address limiting flaws inherent to observational data for inferring causality.