Skip to content

Development of a machine learning model to identify colorectal cancer stage in Medicare claims

JCO Clinical Cancer Informatics May 31, 2023

Read the full article

Research Areas

PAIR Center Research Team


PURPOSE: Staging information is essential for colorectal cancer research. Medicare claims are an important source of population-level data but currently lack oncologic stage. We aimed to develop a claims-based model to identify stage at diagnosis in patients with colorectal cancer.

METHODS: We included patients age 66 years or older with colorectal cancer in the SEER-Medicare registry. Using patients diagnosed from 2014 to 2016, we developed models (multinomial logistic regression, elastic net regression, and random forest) to classify patients into stage I-II, III, or IV on the basis of demographics, diagnoses, and treatment utilization identified in Medicare claims. Models developed in a training cohort (2014-2016) were applied to a testing cohort (2017), and performance was evaluated using cancer stage listed in the SEER registry as the reference standard.

RESULTS: The cohort of patients with 30,543 colorectal cancer included 14,935 (48.9%) patients with stage I-II, 9,203 (30.1%) with stage III, and 6,405 (21%) with stage IV disease. A claims-based model using elastic net regression had a scaled Brier score (SBS) of 0.45 (95% CI, 0.43 to 0.46). Performance was strongest for classifying stage IV (SBS, 0.62; 95% CI, 0.59 to 0.64; sensitivity, 93%; 95% CI, 91 to 94) followed by stage I-II (SBS, 0.45; 95% CI, 0.44 to 0.47; sensitivity, 86%; 95% CI, 85 to 76) and stage III (SBS, 0.32; 95% CI, 0.30 to 0.33; sensitivity, 62%; 95% CI, 61 to 64).

CONCLUSION: Machine learning models effectively classified colorectal cancer stage using Medicare claims. These models extend the ability of claims-based research to risk-adjust and stratify by stage.


National Institutes of Health