Which hospitalized patients are the most likely to die? In our new paper, we show that the patients at highest risk of death are often not the most unhealthy patients overall, but rather the most unhealthy of the patients who don’t get aggressive treatment.
For example, serum creatinine is an indicator of kidney dysfunction. So we would expect the mortality risk to increase with increased levels of serum creatinine (below, left). But when we perform data-driven analysis of real-world medical records, we get a much different picture (below, right). The highest-risk patients are those with serum creatinine at 5mg/dL, and patients with serum creatinine at 7mg/dL are the least likely to die.
Why do we see this strange risk curve? The shape of the curve gives us some clues. The risk curve is concave between 3mg/dL and 5mg/dL, with sharp changes in direction at each of these round numbers. This suggests that changes in clinician behaviors are triggered by thresholds of 3 and 5 mg/dL, and the extremely low risk of patients with severely elevated creatinine >7mg/dL indicates that there exist effective treatments for the patients.
So why does this happen? In short, the observed risk is a combination of the patients’ underlying risk and interaction with an intervention system that has all of the quirks and biases of human judgements and actions. Clinicians are humans making difficult decisions based on heuristics and thresholds. As it turns out, these heuristics and thresholds impact the quality of healthcare received: patients who are just below the threshold for treatment often have the worst outcomes. The effects are so strong that we pick up these patterns with retrospective data-driven analyses.
How to Find These Treatment Effects
So how can we find out if our data-driven model has learned patterns of treatment protocols, rather than learning patterns of underlying risk?
It’s challenging because the treatment effects are not an indictment of limited samples or a subpar model. Rather, the confounding effects arise from real-world behavior, so any model that accurately predicts observed outcomes must recapitulate these effects. It’s also not a problem of insufficient features: even if we were to observe all characteristics of treatment for all patients, standardized treatments without randomization do not provide statistical identifiability to separate the impact of treatments from the impact of features that drove treatment decisions.
Luckily, we have new interpretable machine learning tools!
By learning a high-resolution model that can be understood and audited, we can find out which patterns in risk curves are coming from underlying patient risk and which patterns are coming from treatment protocols. We use the GAMs implemented in Interpret.ML — these are really good because they are tree-based models, so the splits accurately model treatment decisions informed by medical protocols and thresholds.
Problem -> Opportunity
Having a tool which allows us to find these strange risk patterns actually turns the confounding problem into an opportunity:
While these counter-causal risk patterns might trip up predictive models, they also highlight opportunities for improvements in medical practice. In our paper, we categorize these risk curve shapes into recognizable patterns, and break down the implications of each of these shapes on the underlying treatment protocol and its possible misalignment from optimal protocols.
Implications for ML for Health
By applying glass-box machine learning to healthcare data, we find that these patterns are ubiquitous in healthcare datasets. Standard datasets that have been used by thousands of researchers are full of these “paradoxical” treatment effects. For example, here’s the MIMIC datasets, a high-quality datasets with thousands of ICU visits recorded over decades:
In MIMIC-4, we have treatments recorded, so we can connect the mortality rates (top row) to the treatment decisions (bottom 2 rows). The strange patterns in mortality rate appear to be strongly influenced by treatment decisions, and treatment decisions follow recognizable clinical guidance and protocols. As a result, we should expect any ML system trained on the MIMIC datasets to recapitulate these treatment effects, and the ML prediction of risk is strongly confounded by treatment decisions. This means we should not consider an ML model of healthcare risk to be meaningful even if it is predictive on a large, canonical dataset — we must inspect what it learned to recapitulate from the data.
Implications for Medicine
This perspective also has implications for medical practitioners. We can see that mortality rates are strongly influenced by risk factors which guide treatment decisions. This means that our treatment decisions are not perfect — if the decisions were perfect, then we would have smooth, flat risk curves. Instead, real-world risk curves are jumpy and often highest for patients that don’t have the highest underlying risk.
We should continue to improve our treatment protocols so that patient risk is flattened and reduced overall. This is difficult because most treatment protocols must use thresholds to define risk categories. However, strict adherence to thresholds can harm the individuals just below the threshold level. As we move toward precision medicine, we can use more tools to help generate more holistic description of health which can inform treatment decisions better than any individual measurement.