By Jonathan D. Gold, MD MHA MSc FAMIA FHIMSS
May 13, 2024
Challenge yourself:
What considerations are essential when comparing health timelines from multiple patients?
What roles can artificial intelligence play in enhancing healthcare data interpretation and analysis?
What factors must be addressed in constructing predictive models for patient outcomes and clinical trajectories?
Current utilization of large healthcare databases focuses mainly on shared access to patient medical data, billing, and such critical strategic business concerns as quality assurance, regulatory compliance and population health. Clearly, robust stores of medical data provide an exceptional opportunity for advanced clinical decision support at the point of care and real-world clinical research. Matching multiple patient characteristics enables patient-specific decision support and customized, precision medicine—medical decisions tailored to an individual.
Up to this point we have focused on a single patient. Our initial presentation considered a single event from a single note about one patient.[1] Then we discussed reconciliation of accounts from multiple sources regarding a single event for one patient.[2] A patient’s health timeline was constructed to connect multiple events captured from one or more sources for a single patient. In due course, we should reflect on how to compare and utilize health timelines from multiple patients. The value in doing this could be immense, providing highly specific comparisons and guidance for exceptionally similar patients—enabling predictive modeling.[3]
Edward Tufte[4] has pointed out that exceedingly explicit health data becomes quite fragile when attempting to compare patients; that is, interpretable results become increasingly elusive as we add ample detailed information to comparisons of patients. The addition of temporality adds substantial specificity to a patient’s record that is crucial but massive. Acknowledging Tufte’s point, we must weigh the importance of extremely specific data that identify few patients to compare to our specific patient against a more “normalized” patient who can be compared to a larger number of similar “simplified” patients.
Predictive Modeling
How predictive modeling construction and data analytics will be performed cannot yet be fully defined, we can detail many of the important questions and considerations which will need to be covered:
Inclusion Criteria and Data Concordance
What inclusion and exclusion criteria should be considered?
How many variables should be allowed?
Is the number of variables dependent on a minimum cohort size available for an index patient?
What level of concordance will be needed between patients to include them as “doppelgangers” for an index patient? There is a need to build an algorithm to determine concordance and discordance between the index patient and each “doppelganger”.
Which data included in one record but not found in a second record invalidate the comparison of the two records?
Normalization and Data Bucketing
What degree of event granularity will be needed for comparing patients?
What type of information and how much should be grouped rather than using specific data points for a patient?[5]
How should we bucket the data to create usable groups and still retain high definition? Event type, normalized diagnoses, time ranges, age ranges?
What are reasonable-sized clumps of information to generate meaningful comparisons?
What time ranges (not just points in time) for events should be considered—including both the interpreted date ranges and sequences of when events may have occurred[6] and actual groupings of time ranges?[7]
How will sequencing be affected when dates include upper and lower limits?[8]
Query Construction
What is the question to be queried? Using a literature search or other reviews of research may help formulate the question.
What should be the order of priority to include information in a query? Demographics, genomics, events, key functions (renal, hepatic), location?
Reviewing the results of the initial query may help modify it to yield more general or more specific results.
Query Limitations
What are the requirements to construct cohorts dynamically around an index patient?
What are the limitations regarding demographics for predictive modeling and age requirements for finding “similar” patients?
How do we identify patients who have had roughly the same sets of events up until now and are now older than the index patient?
In predictive modeling the cohort will require patients with similar events at a similar age for the index patient, who are now older than the index patient in order to anticipate potential influencing factors that may affect the index patient’s clinical trajectory.
Are there showstoppers that would remove a patient from a cohort?
Are there pool sizes for a cohort that would be below an acceptable threshold?[9]
In Figure 1, one possible process is presented demonstrating how a predictive model might be constructed around an index patient to provide clinical guidance when determining a plan of action.

Figure 1: Predictive Modeling Construction
Determining a reasoned and practical methodology for indexing patient histories, defining goodness-of-fit between patients and acceptable characteristic limits for patient histories, and ensuring safeguards to protect patient privacy will all play roles in the secondary use of data for predictive modeling.
Conclusion
In harnessing the vast potential of healthcare databases for precision medicine, we confront challenges of data interpretation, artificial intelligence integration, and predictive modeling construction. Addressing these complexities requires meticulous attention to inclusion criteria, data normalization, and query construction. By surmounting these obstacles, we can unlock the transformative power of data-driven healthcare, tailoring medical decisions to individual patients with unprecedented precision.
[1] See the blog post: “Temporal Perspectives in Healthcare Records: Understanding Events through Time”.
[2] See the blog post: "Unifying Events: Event Linking and Temporal Normalization in Healthcare Data".
[3] Gold JD. Medical Researcher, Meet Mrs. Roseburg. Journal of Healthcare Information Management. Summer 2013, 27(3): 76-8.
[4] Statistician and professor emeritus of political science, statistics, and computer science at Yale University.
[5] For example, the use of “normalization” to less granular diagnoses, normalization of labs to a 5-point scale, medications to generic or group equivalents, demographics and genomic similarities, etc.
[6] For example, if the record provides an account that an event occurred five years ago for a patient who is now 58 years 3 months old, on the health timeline we capture the point in time as exactly five years prior to the date stamp of entry (at age 53 years 3 months old) with upper and lower limits of five and a half and four and a half years prior, respectively. In biographic time this event occurred when the patient was between the ages of 52 years 9 months - 53 years and 9 months old.
[7] Grouping time ranges differs from the range used for a particular event on a patient’s record. For instance, when constructing predictive models, the builder may set time values in uneven groups of 0-1 years, >1-5years, >5-10 years, and >10 years. Any range overlap for a particular event might fit into one or more categories—in the previous example that would include both the >1-5 years and the >5-10 years groups.
[8] By convention, when sequencing Events A and B, it might be reasonable to compare the distance between their points in time (like 364 days between Event A and Event B). In actuality, the distance between the two events can get much blurrier. Using the example of 364 days as the interim period between Event A and Event B, if both events have ranges measured in years and the points in time occurred within slightly less than one year of each other, potentially, the two events could have occurred almost two years apart or there could be an overlap of when the two events occurred. This difference might even change the sequence in which they may have occurred with Event B occurring before Event A by one day. If sequencing time between events reflects the period when the two events would be closest to each other (the point in time plus measure delimiter for Event A and point in time minus measure delimiter for Event B) or furthest away from each other (the point in time minus measure delimiter for Event A and point in time plus measure delimiter for Event B), the shortest period between two events and the two longest periods between the events could lead to confusion around the actual sequence.
[9] Possibly because the cohort group would now contain individuals who could be identifiable (or re-identified).

Comentarios