Hi Hospitalogists, I'm incredibly excited to share this newsletter with you guys. Over the next 2 weeks, I'll be talking about the innovation side of healthcare with a 2-part deep dive on synthetic data and how it has a ton of potential to change the game in the healthcare data space. - Part 1 (today) covers the background of the healthcare data problem and what synthetic data actually is.
- Part 2 will be next Thursday, September 15th and covers synthetic data in healthcare, including a startup leading the way in the space - Syntegra - and what sets the firm apart in how it's helping to solve the long-standing data access problem.
When Part 2 is published, I'll also publish both parts online into one nice, tidy 3,000+ word essay that you can share with all of your friends, lovers, and colleagues. Let's dive in! | Was this email forwarded to you? | | | SYNTHETIC DATA IN HEALTHCARE | Key Takeaways and Core Thesis. | In healthcare, there has historically existed a fundamental conflict between the potential impact and utility of data and the need to preserve patient privacy. Accessing healthcare data for commercial and even academic use is notoriously difficult and expensive. For instance, accessing and getting approved to use health system data could take up to 24 months. Patient-level data can cost hundreds of thousands of dollars or more depending on scope. These roadblocks stifle progress. We're in the second inning related to the sophistication and application of data in healthcare. Privacy issues, lack of data standards, and data silos all have hamstrung progress and innovation. Despite these challenges, more data is being used than ever before, with platforms like Snowflake and AWS racing to provide tools and capture the potential of this information. The rise of cloud computing capabilities is unlocking the pent-up demand to enable more sophisticated data analytics and quicker product development. While synthetic data won't replace real data for all cases, privacy-preserving synthetic data is an excellent complement that allows researchers and builders to work more efficiently through early stage feasibility and exploration, product development, scenario planning, and model training prior to fine tuning the final product with less secure real data. | Current Healthcare Data Challenges | Healthcare is notoriously slow-moving and a lot of this lack of progress stems from data practices (and fax machines, of course). While policy and access are progressing, there are still a number of issues hampering innovation: Privacy issues: Healthcare data breaches hit all-time high's in 2021, affecting 45 million people. With the recent Supreme Court abortion decision and expected fallout, patients and consumers are more wary than ever about privacy protections, even more so after Meta and hospitals were sued for collecting sensitive healthcare data and targeted advertising based on that protected data. The hyper-sensitivity around patient data privacy is abundantly clear: - Google has gotten into several patient data privacy scuffles over the years with both Project Nightingale and its work with UChicago. Although the firm worked with both Ascension and UChicago under a legal agreement, the public damage was already done.
- HIPAA violations are rampant among provider organizations as fines stack up.
Compliance / HIPAA: Data requires stringent measures involving lots of red tape. To be HIPAA compliant, healthcare data specifically requires patient de-identification through one of the following methods under the Privacy Rule: - Safe Harbor, a complete redaction of the 18 data fields containing patient health information ("PHI") - AKA, all of the useful information like age, dates, locations, etc; or
- Expert Determination ("ED"), which requires a partial redaction of data, then an expert determines (get it lol) whether the data is now appropriate to share. ED is problematic since no universal explicit standard exists for healthcare data sharing.
Any use of healthcare data, whether for commercial or research purposes, has to be extremely secure, which limits an organization's ability to test products or accelerate research projects and collaborations. Data Complexity: Different data standards and formats exist. Databases are inconsistent and lack normalized structures. Valuable engineering time is wasted on the idiosyncrasies involved with traditional healthcare data. As the Tuva Project puts it, "Compared to other disciplines, doing healthcare data engineering and data science requires a tremendous amount of domain knowledge." Incumbent Status Quo Issues: Data infrastructures at provider organizations are closed and by default do not communicate with one another. Silos of data exist across organizations. Pricing is prohibitive for new entrants and favors incumbents in its current form. Lack of Data Representation: Not only is data access broadly difficult, but healthcare also suffers from a lack of data representing diverse populations. Especially in the current health tech boon, many groups are underrepresented in the data used to train AI/ML models. For example, currently available datasets often do not have enough representation of rare disease patients to allow for effective predictive modeling, meaning the model's impact in a real-world setting will be subpar and insufficient for the patients it is meant to help. I'll be discussing how synthetic data addresses these healthcare-specific problems in Part 2. But first, let's talk about what synthetic data even is to set the stage. | Current Healthcare Data Challenges | Realistic but not real: Synthetic data looks and acts like real data. It reflects the statistical properties of an underlying real dataset or multiple datasets. At the same time, synthetic data is completely fictional and does not contain any actual patient information. - It's 100% developed by machine learning and natural language processing algorithms (buzzwords, I know) but again, entirely based on real-world data. Although both are used to create synthetic data, firms are over time leaning toward natural language processing for this purpose (Syntegra uses natural language processing).
For you movie fans out there, I'd compare synthetic data to living in a simulation like the Matrix or an Inception-like dream. AKA, it's so realistic that you can't tell the difference between the simulation/dream and actual reality. In the same way, the data generated doesn't actually exist in our world, but it still creates actionable insights that happen in the world. | | | Here's another great synthetic data analogy: imagine a computer 'teaching' self driving car software how to drive. By creating arenas like dense, urban environments with lots of turns, curbs (RIP my old Mazda CX-9) and cars, the software can be 'taught' without ever actually leaving a laboratory setting. At this point, over 10 billion simulated/synthetic miles have been driven! Flight simulators are another great example, as they're commonly used to train pilots prior to putting anything or anyone at risk in the real world. In fact, synthetic training of jet fighter pilots reached the point of sophistication where pilots describe real world environments as flying 'exactly like the simulator.' Synthetic data has been around for a while. In the early 90's, Donald Rubin created the first framework for synthetic data by generating a dataset of anonymous U.S. census responses based on real Census data. In doing so, he successfully created a new synthetic dataset that matched high-level population statistics of the real census data. Use of synthetic data is gaining steam. Accenture notes it as one of the top trends to watch in the life sciences + medtech space, while Gartner predicts that 60% of data use will be synthetic by 2024. It's a compelling complement to real patient data for a few different reasons. It alleviates most privacy concerns. Because fictional patients are used and generated, privacy concerns are largely obsolete. Firms using synthetic data worry much less about HIPAA (or GDPR, for that matter) compliance. Is this what freedom feels like? It's flexible and expandable. Healthcare data engineers can design datasets for their specific use case and balance demographics to avoid algorithmic bias in the dataset. Data teams can expand the datasets to increase their size and overcome a lack of data (e.g., more miles for self-driving cars or more types of rare patients). - Once they've received the desired dataset, teams can build out models for testing and iteration. Without synthetic datasets, data teams would have to rely on rigid real-world data and expend lots of time and resources to access it. Further, firms still wouldn't be able to access enough data for specific patient populations with limited available data.
It's accessible and economical. Since synthetic data is procedurally generated, the only big cost involved is training the dataset - a computer intensive process - which means that the cost to access synthetic data compared to actual data is orders of magnitude lower. - Like I mentioned before, sourcing real data in contrast is hella expensive. For instance, getting data from a single patient in a clinical trial can cost upward of $20,000 while licensing real-world data can cost $100,000 into the millions of dollars.
| Conclusion: Synthetic Data is just getting Started. | Hopefully this gives you a solid framework and understanding of what synthetic data is and how life sciences firms, payors, digital health orgs, and providers need a better, more practical solution to current data practices & challenges in the space. Next week I'll be diving into how Syntegra solves these pain points as one of the first movers and pioneers in the synthetic data space, and why they're positioned to win big. Along with diving into Syntegra's business model, Part 2 will discuss current use-cases for synthetic data in healthcare, and, in true Blake Madden, Hospitalogy fashion and in the sake of a good argument, the bear case & challenges facing synthetic data. Thanks for reading! | | | Here are some jobs that I'm curating for the healthcare industry. Use this link to submit your role to be featured if you're looking to hire. Exclusive Hospitalogy Role: Founding Head of Product at Stealth Startup Frederik Mueller is looking for a Head of Product to help take the lead on the startup's technology platform build, develop a product vision and roadmap, and build a technology organization. This is an incredibly unique opportunity to work with a new MSO care platform model that already has publicly traded company backing. Senior Software Engineer, Syntegra Syntegra is looking for a senior Software Engineer / Architect to help scale its synthetic data engine and platform. The role will be responsible for all parts of the stack including scaling the data prep for the deep learning algorithm, parallelizing and optimizing code across training and inference time, and building APIs to run the system and access synthetic data. Machine Learning Engineer, Syntegra Syntegra is ALSO actively looking for a senior MLE responsible for building, training, testing, and debugging advanced Deep Learning Models. | | | Thanks for the read! Let me know what you thought by replying back to this email. — Blake | | | Want your message in front of 8,400 executives and healthcare decision-makers? | Workweek Media Inc. 2952 Higgins Street Austin, TX 78722 Want to ruin my day? Unsubscribe. | | | |
No comments