Synthetic Data

This is a recent evloving practice in IT industry. Mostly related to testing the product/project.

For most of our testing, we may use copy of production data/de identified data for that product.

So what is a de-identified data:
It snothing but Masking or stripping off the sensitive areas of data that actually revelas the source of data.

More information here:
http://en.wikipedia.org/wiki/De-identification

So is that the synthetic data is an better alternative to Production/De-identified data? – Yes

Using the Production data for testing might reveal the sensitive/secured information to the outside world.
This is not a good practice.

Also De identifying data should be properly done. It should not expose any form of real information outside the members/clients environment.

For example people who all are working in HealthCare domain should know about the HIPAA – Health insurance Portability and Accountability Act.
It is like protecting the patient/members information. More details here: http://www.hhs.gov/ocr/privacy/

So the De identifying process should be done carefully such that it should not expose any form of sensitive data to the real world.
So instead of testing the application using Production/Data or De-identified data, synthetic data/artificial test data is always a good alternative.

Also we can easily attach a Expected Result for the Synthetic Data.

For example consider this scenario: Employees Appraisal Review Process:

This is a process, which has some business logics written to do some back end calculations and arrive the rating for all the employees.

I will create my own Employee called Employee_1_Test_John_Rating-3.5
and in the subsequent tables, I will enter the records that affects that employee performance to be derived as 3.5

Now I will run the backend job created by my developer and will see whehter the EMployee that I have created is getting 3.5 as rating in the
actual result or not.

By attaching the Expected Result to all the data that we are creating, we can easily compare the application actuals.
Otherwise we have to create the backend query with same logics that the developer has created to verify the application results.

This is all about my usage of Synthetic Data to test my products. More generic details here: http://en.wikipedia.org/wiki/Synthetic_data

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s