How Machine Learning Can Advance Disease Predictability
This article is co-authored by Chao Li, PhD, Data Scientist at AbbVie, and Mike Munsell, Director of Research at Panalgo
As the healthcare landscape evolves and becomes increasingly digital, investments in real-world data (RWD) are skyrocketing, intensifying the need to make sense of large, often unwieldy, data sets. With more complex data available at our fingertips, it has become imperative that researchers implement the most effective processes to learn from the data and provide predictive insights.
To remain competitive in the industry and keep up with this rapidly growing warehouse of information, more analytics teams have started to implement machine learning (ML) techniques to glean real-world evidence (RWE) from this data. In a recent survey, 95% of life sciences executives said they expect to utilize ML in the next few years to generate RWE. However, despite this increase in implementation, lingering uncertainty around how to best leverage ML to generate actionable insights remains.
The Current State of Machine Learning in Healthcare
Compared to traditional statistical models, ML is a predictive tool that uses general purpose learning algorithms to find patterns in complex data sets with minimal assumptions. Through uncovering these patterns, ML technology can enhance clinical decision-making, enabling stakeholders to make informed decisions that lead to better patient outcomes. Today, ML is used across healthcare to optimize processes, streamline workflows, predict treatment outcomes, predict clinical events such as re-admissions or relapses, identify meaningful patient subgroups, and more.
One evolving application of ML is its use in advancing disease identification and predictability. In fact, a recent benchmark survey found that 46 percent of analytics leaders believe implementing ML techniques into their research efforts would make a significant impact in disease identification. To determine how impactful ML can truly be, we looked at how ML can be used to help predict disease, replicate prior studies, and conduct external validation for machine learning model performance.
Building an Effective Machine Learning Model to Predict Diagnosis
To gain a better understanding of how ML can be leveraged for disease predictability, we set out to replicate a previous study using a machine learning approach to predict Hidradenitis Suppurativa (HS) diagnosis. HS is a chronic inflammatory condition that causes small, painful lumps to form under the skin, usually in areas where the skin rubs together. Through this study, we learned that ML methods could not only be used to predict probability of diagnosis, but also to distinguish HS from other dermatologic diagnoses, such as cutaneous abscess and cellulitis.
By utilizing the most appropriate data analytics platform, our team was able to construct, train, and compare over 50 algorithms to identify the ML model that best predicts the probability of diagnosis. The platform enabled our team to efficiently compare our results to those of the previous study, which matched up well in precision, recall, and accuracy.
The key takeaways from this study indicate ML has a promising future in improving how we approach and treat disease within our healthcare system. If your team is aiming to incorporate ML methods into your analytics toolbox, here are a few components of an effective ML pipeline to keep in mind:
- Diverse Data Sources: Machine learning model performance is only as good as the data you leverage. A lack of diverse data when training your model can result in overfitting and prevent the model from learning from a variety of data points. For this study, we leveraged four specific databases and took extra precautions to ensure their integrity when defining the study sample, such as looking at a three-year baseline period prior to the index date and requiring at least two HS diagnoses during the follow-up to confirm that they were indeed HS patients.
- Balanced Data Sets: Imbalanced data sets can cause your model to be biased towards certain outcomes. By leveraging several databases in our study, we were able to work with a sample size of 17,000 HS patients in our case cohort, which allowed us to do a random sampling from our control cohort to develop a 1:1 ratio of case vs. control to prepare a balanced training data set.
- Appropriate Metrics: Only relying on accuracy is not the best indicator of machine learning model performance – a model’s performance should be tested across several metrics. By carefully choosing an analytics platform that best fit our needs, we were able to easily evaluate and test our model to see that it was high in area under the curve (AUC), average precision, accuracy, recall, and specificity – all important metrics for evaluating model performance.
Applications of ML are evolving, and this study is just one example that illustrates its dynamic capabilities. As we uncover more data-fueled insights in the future, we can expect to leverage ML in even more innovative ways to unlock its full potential in improving patient outcomes.