| From collection to activation Much of the most valuable clinical detail lives in formats a data lake struggles to reach: physician notes, imaging reports, and discharge summaries. A 2025 study in the Journal of Medical Internet Research that analyzed 1.8 million primary care records found that only 13% of the clinical concepts captured in free-text notes had matching counterparts in the structured data of those same records. Even structured records have meaningful gaps. For example, a 2022 study in the Journal of the American Medical Informatics Association found that only 59.4% of chronic conditions were consistently captured across encounter diagnoses and problem lists within a network of more than 500 community health centers. Activation of data, rather than aggregation alone, is the new differentiator. Yet many of these data repositories have quietly turned into what could be accurately described as “data swamps.” They are vast, deep, and nearly impossible to navigate without the right clinical lens applied on top. The real challenge facing health data platform companies and health system data teams is what to do with all of it. Fragmentation is the deeper problem Even when individual data points are accurate, they often arrive disconnected from the clinical context that gives them meaning. A laboratory result without its associated problem, a medication without its indication, or a diagnosis without its supporting evidence cannot answer the questions clinicians actually ask. Data quality work alone produces tidier data, which falls short of the ultimate goal of creating clinical understanding. Three capabilities missing from most data lakes The first is data extraction. A significant share of clinically relevant information lives in PDFs, scanned documents, free-text notes, and discharge summaries. Without natural language processing (NLP) tools that can convert that narrative content into reliable, coded, structured data, the most clinically rich material in any patient record remains inaccessible to downstream analytics, reporting, and AI systems. The second is a clinical lens. Clinicians do not think in data tables. Rather, they think by problem, such as the status of any given patient’s heart failure, diabetes, or recent surgical recovery. Activating a data lake requires the ability to filter, organize, and present information by problem, by specialty, and by relevance to the decision at hand. The third is scrubbing. Tools that resolve duplicates, reconcile conflicting records, normalize terminology, replace invalid or retired codes, and validate diagnoses against supporting evidence allow the underlying data foundation to be trusted. — By MedCity Influencer David Lareau |
No comments