More Predictors: Slightly Better Predictions but Increased Complexity

Picture of by Dr Jan Roth

by Dr Jan Roth

In collaboration with the Swiss HIV Cohort Study and IBM Research Zurich, I have been involved in a large machine learning study to develop and validate different machine learning models and to compare their performance with short clinical scores. We focused on the prediction of chronic kidney disease in people living with HIV and demonstrated that for various prediction horizons and across diverse machine learning models trained on hundreds of variables, state-of-the-art predictive performance could be achieved (often slightly better than standard clinical scores).

Adding more predictor variables can result in improved predictions but increase the complexity of your models. Predictive modelling strategies have to be context-specific: More predictors or more advanced methods may not always be necessary for your project, and it is good practice to start with less complex models (e.g. logistic regression fitted on a few key predictors).  


Roth JA, Radevski G, Marzolini C, Rauch A, Günthard HF, Kouyos RD, Fux CA, Scherrer AU, Calmy A, Cavassini M, Kahlert CR, Bernasconi E, Bogojeska J, Battegay M. Cohort-derived machine learning models for individual prediction of chronic kidney disease in people living with HIV: a prospective multicentre cohort study. J Infect Dis. 2020 May 9:jiaa236. doi: 10.1093/infdis/jiaa236.

Roth JA, Battegay M, Juchler F, Vogt JE, Widmer AF. Introduction to machine learning in digital healthcare epidemiology. Infect Control Hosp Epidemiol. 2018 Dec;39(12):1457-1462. doi: 10.1017/ice.2018.265. Epub 2018 Nov 5.

Here’s more