Synthetic Data in Data Science

By: Matt Eckman

Introduction

Synthetic data is a new approach to getting data for use in analysis, and it is being used in a variety of applications including product development, research, and software development. Synthetic data is created by analyzing the original data and using the numeric and text values in the original data set to create a new data set. This new data set is statistically similar to the original data but different enough that the original data cannot be recreated from the new data set.

Applications of Synthetic Data

We are seeing many different use cases for synthetic data and how it can add value to accomplish certain business goals. Below are some examples of what can be done.

Build Viable Volumes of Data

Synthetic Data provides the ability to generate a higher volume of records than would otherwise be accessible for data science projects. Many data science projects suffer from an insufficient volume of data to draw meaningful results. 

Product Development

Synthetic data allows for early prototypes that can be used for client feedback without having actual production data or when product data needs to be kept private. Designing and building a software product requires many iterations of feedback loops, and when a software application requires data to work, it can cause chicken-and-egg problems. 

Conceal Confidential Data For Research

The dream use-case for SDV is to liberate data sets for research in highly regulated industries such as healthcare or finance. If data sets can be generated which are useful for research while not retaining any of the personally identifiable information from the original data set, then companies with important data sets could put out synthetic data sets that would help advance research.

Retaining Historic Data in Regulated Environments

Many industries are limited in how long they can retain client data. By modeling that data, synthetic data models can be used for historical data in long-term research.

Digitally Twin an Organization’s Data

Synthetic data could be used to digitally twin a company's data, allowing for exciting, and otherwise inaccessible opportunities to study and evaluate a potential acquisition partner, test new products, or even run data science tests.

Data for Development and Testing Environments

Synthetic data is perfect for use in development or testing environments. A synthetic data model can be used to generate data on the fly, which eliminates the need for database access in your pipelines and local development environments.

Clinical Trials

Clinical trials, at least in the initial phases where actual data has not arrived yet, could benefit from synthetic data. It would allow trial investigators to begin their research early.

CompassRed’s Uses of Synthetic Data

At CompassRed, synthetic data is enabling rapid development cycles as we solve specific problems for our clients. One of our more cutting edge clients is working with us to create a prototype analytics solution for their end clients. We set up a demo of the proposed solution using synthetic data to protect the privacy of their end clients, which allows us to gather feedback quickly and also allows our client to do business development using the prototype.

Another use case is happening in our product development efforts. We are currently building an in-house product to offer clients - a cost-effective Data Warehouse. As we use a Lean approach we were able to leverage synthetic data to begin the development with realistic data. Having a source of data allowed us to create prototypes in house that we have confidence will resonate with customers when we begin our sales efforts with new and existing clients.

Where do I start?

If you want to start working with synthetic data, you can either find a managed tool or start building internal expertise with synthetic data libraries. We focused on building our internal capabilities, so we focused on generating synthetic data with the Python library SDV. This library was created by a team at MIT and is growing in functionality continuously. To supplement SDV we used custom R libraries and Faker, another useful Python library. 

New synthetic data companies are being started all the time, but currently the ones that look most promising are Gretel, Tonic, and Mostly.AI.