Torsten Sauer , University of Rostock
Roland Rau, University of Rostock
Machine learning methods have become very popular in various scientific disciplines. Using Breimann’s random forests and data from the National Health Interview Survey (NHIS) and its mortality follow-up, we wanted to know 1) Could these methods be used to predict the occurrence of death? 2) Which variables are important for these predictions? We checked the accuracy of the forests by estimating the area under the ROC curve (AUC) for test data and showed that they perform relatively well, with an AUC from 0.83 to 0.87. To indicate the predictive power of every variable we estimated the mean decrease in accuracy (MDA). Not surprisingly ”age” is by far the most predictive, followed by ”mobility limitations” and ”self-rated health”. Typical sociodemographic mortality determinants like ”sex”, ”education”, and ”income” seem to be very weak in their predictive ability in each of the six selected intervals.
Presented in Session 3. Population, Development, & the Environment; Data & Methods; Applied Demography