Skip to content

Projects | In Progress

Using Natural Language Processing and Machine Learning To Identify Potentially Preventable Hospital Admissions Among Outpatients With Chronic Lung Diseases


Hospital admissions are common, costly, and frequently undesired among patients with chronic lung diseases (CLDs), such as chronic obstructive pulmonary disease (COPD) and interstitial lung disease. Each year in the United States, there are over 4 million hospital admissions for CLDs. Early palliative care, home care, and advance care planning interventions are known to improve patient outcomes in patients with serious illness. Despite their potential benefits, patients with CLDs receive them at lower rates than other patient populations.

Clinical prediction models, which estimate the likelihood of an outcome in a specific patient, possess particular potential. Previously developed prediction models have used data that overlooks the social, behavioral, economic, and geographic environments in which patients are embedded. Models have utilized “structured” data from patients’ electronic health records – such as patient age, vital signs, and administrative data. Previous models also rely exclusively on linear regression approaches and too few predictor variables.

For this project, novel clinical prediction models will be developed to predict future risk of hospitalizations among patients with CLDs. The novel models will utilize “unstructured” data from patients – such as notes written by clinicians throughout the encounter. They will also use machine learning approaches which are well suited to identify non-linear relationships with many predictor variables.

To achieve this, first, we will conduct a mixed-methods study to identify mechanisms of potentially preventable hospitalizations. We will conduct interviews with hospitalized patients with CLDs and their caregivers and clinicians to gain their perspectives. We will conduct surveys to further assess specific patient contexts.

Second, we will build a patient classification model for each preventable mechanism based on both “structured” and “unstructured” data. In this, we will use natural language processing to identify phrases in clinical notes related to each mechanism. Natural language processing is a technique in computer science to analyze human text and speech.

Finally, we will use machine learning methods to predict future risk of hospital admission among patients with CLDs with each mechanism of interest. We will build a set of predictive models for each mechanism and compare their performance – and choose the best-performing model in each case.

Results & Impact

So far in our project’s progress, we have interviewed patients with CLDs who were admitted to the hospital, along with their caregivers and clinicians. Analyzing over 50 interviews, we identified common clinical, social, and behavioral processes that led to hospitalizations.

We found that many of these factors were present weeks, if not longer, before their hospital admission. This provides an opportunity for action to help prevent hospitalizations due to CLDs.

We have also tested and compared several natural language processing strategies in analyzing text data. We found that optimal strategy varied by situation; however, methods involving publicly available clinical data performed as well as those involving general data or private clinical data.

Additionally, we have developed a prediction model to identify text describing patient frailty in clinical notes. We trained several test models on a data set of clinical notes, where we identified markers of frailty for the models to learn. Our models trained on clinical notes performed better compared to those not trained on notes.


National Institutes of Health; National Heart, Lung, and Blood Institute