©2018 by Imminenth Inc.

Preservation of Statistical Accuracy

Can we learn the same insights with synthetic data?

In this section, we delve deeper into how we demonstrate the statistical validity of our synthetic datasets. We want to showcase that the properties from a real dataset are preserved with its synthetic counterpart (i.e. "statistical equivalency"). Note that some of these metrics are used to guide the synthesis process. Essentially, our algorithm's  iterative process tries to minimize the statistical differences between the real and synthetic datasets.

We compare a real dataset to it's synthetic counterpart on 3 different levels:

  1. Univariate Comparison

  2. Multivariate Comparison

  3. Neural Network Performance

Cohort Description:

In this example cohort, we selected patients who have undergone any cardiac procedure and have stayed in critical care units in the Boston Greater Area. The size of this subset is 8000 patients with multiple static and time-varying features, including demographic (such as gender, insurance, religion, age, ethnicity, etc.) clinical (such as diagnoses, procedures, medication, etc.) and physiological (including blood tests, fluids entering/leaving the body, transfusions, nutrients, etc.)

 

Univariate Comparison

At a fundamental level, our algorithm learns and encodes the distribution of individual variables/features. It then samples from these encoded characteristics to and rebuilds the distribution. For example if a specific feature has a normal distribution with mean "x" and standard deviation "s", it rebuilds data with those constraints in mind.

 

Histograms

Our algorithm compares distributions between individual variables (histograms) as well as bivariate relationships (scatterplots) between the real and synthetic datasets.

Scatter Plots

This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!

Survival

This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!

Multivariate Comparison

Preserving the distribution across one or two variables is a relatively easy task. However, for every additional variable the task of preserving the overall joint distribution increases the difficulty exponentially.

 

Dimensionality Reduction

This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!

Logistic Regression

This is your Project description. Whether your work is based on text, images, videos or a different medium, providing a brief summary will help visitors understand the context and background. Then use the media section to showcase your project!

Neural Network Performance

 

Coming Soon

Data Source

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635