An Introduction to Synthetic Data
Imagine a computer 'teaching' self driving car software how to drive.
By creating arenas like dense, urban environments with lots of turns, curbs (RIP my old Mazda CX-9) and cars, the software can be 'taught' without ever actually leaving a laboratory setting. At this point, over 10 billion simulated/synthetic miles have been driven!
Flight simulators are another great example, as they're commonly used to train pilots prior to putting anything or anyone at risk in the real world. In fact, synthetic training of jet fighter pilots reached the point of sophistication where pilots describe real world environments as flying 'exactly like the simulator.'
Synthetic data has been around for a while. In the early 90's, Donald Rubin created the first framework for synthetic data by generating a dataset of anonymous U.S. census responses based on real Census data. In doing so, he successfully created a new synthetic dataset that matched high-level population statistics of the real census data.
Use of synthetic data is gaining steam. Accenture notes it as one of the top trends to watch in the life sciences + medtech space, while Gartner predicts that 60% of data use will be synthetic by 2024.
It's a compelling complement to real patient data for a few different reasons.
It alleviates most privacy concerns. Because fictional patients are used and generated, privacy concerns are largely obsolete.
- Firms using synthetic data worry much less about HIPAA (or GDPR, for that matter) compliance. Is this what freedom feels like?
It's flexible and expandable. Healthcare data engineers can design datasets for their specific use case and balance demographics to avoid algorithmic bias in the dataset. Data teams can expand the datasets to increase their size and overcome a lack of data (e.g., more miles for self-driving cars or more types of rare patients).
- Once they've received the desired dataset, teams can build out models for testing and iteration. Without synthetic datasets, data teams would have to rely on rigid real-world data and expend lots of time and resources to access it. Further, firms still wouldn't be able to access enough data for specific patient populations with limited available data.
It's accessible and economical. Since synthetic data is procedurally generated, the only big cost involved is training the dataset – a computer intensive process – which means that the cost to access synthetic data compared to actual data is orders of magnitude lower.
- Sourcing real data in contrast is expensive. For instance, getting data from a single patient in a clinical trial can cost upward of $20,000 while licensing real-world data can cost $100,000 into the millions of dollars.
Synthetic data is 100% developed by machine learning and natural language processing algorithms (buzzwords, I know) but again, entirely based on real-world data. Although both are used to create synthetic data, firms are over time leaning toward natural language processing for this purpose.
The Current State of Data in Healthcare
Healthcare has always experienced a conflict between the benefits of data and the importance of patient privacy.
Accessing healthcare data for commercial and even academic use is notoriously difficult and expensive. For instance, accessing and getting approved to use health system data could take up to 24 months. Patient-level data can cost hundreds of thousands of dollars or more depending on scope. These roadblocks stifle progress.
We're in the second inning related to the sophistication and application of data in healthcare. Privacy issues, lack of data standards, and data silos all have hamstrung progress and innovation, but this dynamic is rapidly evolving, especially given the emergence of generative AI products.
Despite these challenges, more data is being used than ever before, with platforms like Snowflake and AWS racing to provide tools and capture the potential of this information. The rise of cloud computing capabilities is unlocking the pent-up demand to enable more sophisticated data analytics and quicker product development.
Synthetic data has the potential to solve a lot of problems associated with data access in healthcare.
Synthetic data is data that is created by a computer program, based on real data or from algorithms. It is used to imitate real data and can be used instead of real datasets in different applications.
Although synthetic data cannot completely replace real data in every situation, privacy-preserving synthetic data is a valuable addition that enables researchers, engineers, and developers to work more effectively in various stages, including early feasibility and exploration, product development, scenario planning, and model training. This iteration allows for better fine-tuning of final products once they're ready to be tested or implemented with less secure, more expensive real data.
Synthetic data is much more flexible than real patient data for product development and research purposes. It solves for the traditional complexities associated with the use of healthcare data while minimizing privacy concerns.
Challenges with Real World Data in Healthcare
Healthcare is notoriously slow-moving and a lot of this lack of progress stems from data practices (and fax machines, of course). While policy and access are progressing, there are still a number of issues hampering innovation:
Privacy issues: Healthcare data breaches hit all-time high's in 2021, affecting 45 million people. With the recent Supreme Court abortion decision and expected fallout, patients and consumers are more wary than ever about privacy protections, even more so after Meta and hospitals were sued for collecting sensitive healthcare data and targeted advertising based on that protected data.
The hyper-sensitivity around patient data privacy is abundantly clear:
Compliance / HIPAA: Data requires stringent measures involving lots of red tape. To be HIPAA compliant, healthcare data specifically requires patient de-identification through one of the following methods under the Privacy Rule:
- Safe Harbor, a complete redaction of the 18 data fields containing patient health information ("PHI") – AKA, all of the useful information like age, dates, locations, etc; or
- Expert Determination ("ED"), which requires a partial redaction of data, then an expert determines (get it lol) whether the data is now appropriate to share. ED is problematic since no universal explicit standard exists for healthcare data sharing.
Any use of healthcare data, whether for commercial or research purposes, has to be extremely secure, which limits an organization's ability to test products or accelerate research projects and collaborations.
Data Complexity: Different data standards and formats exist. Databases are inconsistent and lack normalized structures. Valuable engineering time is wasted on the idiosyncrasies involved with traditional healthcare data. As the Tuva Project puts it, "Compared to other disciplines, doing healthcare data engineering and data science requires a tremendous amount of domain knowledge."
Incumbent Status Quo Issues: Data infrastructures at provider organizations are closed and by default do not communicate with one another. Silos of data exist across organizations. Pricing is prohibitive for new entrants and favors incumbents in its current form.
Lack of Data Representation: Not only is data access broadly difficult, but healthcare also suffers from a lack of data representing diverse populations. Especially in the current health tech boon, many groups are underrepresented in the data used to train AI/ML models. For example, currently available datasets often do not have enough representation of rare disease patients to allow for effective predictive modeling, meaning the model's impact in a real-world setting will be subpar and insufficient for the patients it is meant to help.
Use-Cases for Synthetic Data in Healthcare.
In the same way that a fighter pilot trains in a simulated environment, healthcare organizations can harness synthetic data to validate and iterate clinical workflows or set baselines for drug development in clinical trials (or even discover what treatments are working better than others).
Some specific use cases:
Digital Health and Interoperability: A digital health company building interoperability infrastructure is leveraging synthetic data for building and testing its offerings first in a non-HIPAA environment. The use of synthetic data here reduces development costs and risks associated with working with real data to build products.
Life Sciences, Real-World Evidence and Clinical Trial Design: A global pharma company used synthetic data to access EU partner datasets in order to improve and accelerate real-world evidence (RWE) research and health economics outcomes research. In addition to RWE research, there are also several use cases for synthetic data in clinical trial design, particularly looking at how to design trial eligibility and intervention / control arms. Beyond clinical trial design, there's an appetite to use synthetic patient-level data for commercial use cases, such as post-launch market surveillance for label expansion and diagnostic / risk stratification algorithms to identify under-treated patients. As we all know from those annoying cookie notifications in browsers, GDPR is notoriously strict. Consequently, synthetic data becomes even more valuable in EU nations.
Academic Medical Centers and Research and Education: A top academic university created a synthetic version of its EHR dataset to enable more secure research and mitigate privacy risks with less IRB oversight. The data is also being used in the academic setting to teach machine learning classes.
Public Health & Predictive Analytics: Synthetic data can be used to create scenarios to predict outbreaks, patient inflows, and other healthcare trends. This can be especially useful in planning and resource allocation.
Medical Imaging: In medical imaging, synthetic data can be used to augment datasets, especially when certain conditions or anomalies are rare. This helps in training better diagnostic models.
No comments