/*************************************************************************** This code available at: http://pauldickman.com/software/stata/prediction_new_data.do The tutorial based on this code is available at: http://pauldickman.com/software/stata/prediction_new_data/ Illustrates how to fit a model using patient data and then predict in a second dataset specifically constructed to contain only the covariates for which we wish to predict. Age is modelled using a restricted cubic spline. Paul Dickman, October 2019 ***************************************************************************/ use https://pauldickman.com/data/colon if age>=40&age<=90, clear sort age // Create an indicator variable for sex generate female=(sex==2) // All-cause death as outcome, censor at 5 years stset surv_mm, failure(status=1,2) scale(12) id(id) exit(time 60.5) // Generate splines for age; store knot locations and projection matrix rcsgen age, gen(rcsage) df(4) orthog matrix Rage = r(R) global knotsage `r(knots)' list age rcsage? in 1 // fit the model allowing non-proportional hazards for sex stpm2 female rcsage1-rcsage4, scale(hazard) df(5) tvc(female) dftvc(2) // Now construct a new data set in which to make predictions // We will predict 5-year survival for each integer age (20-100) and each sex // Note that the model is only fitted to ages 40-90 so some of the // predictions are out-of-sample clear // Start with 81 observations, 1 obs for each integer value of age range age 20 100 81 // Now expand the data set (create two copies) // We exploit the fact that the new variable generated by -expand- // takes the value 0 for the original data and 1 for the copy // After this step our data set has 81+81=162 observations expand 2, generate(female) // Generate the values of the spline basis variables for each observations // using the same projection matrix and knots as in the patient data rcsgen age, gen(rcsage) rmatrix(Rage) knots($knotsage) // check that the values of the spline variables for age 40 // are the same as in the patient data list if age==40 // In order to predict, the variable _t must exist in the data set // Even if we are predicting at timevar() gen _t=. // We will first predict 5-year survival, so create a variable that takes // the value 5 for every observation generate t5=5 // Now make the predictions // Stata will use the last fitted model (which was fitted to the patient data) predict s5, survival timevar(t5) // Plot 5-year survival as a function of age at diagnosis twoway (line s5 age if female==0, sort lpattern(shortdash) lwidth(medthick) lcolor(blue)) /// (line s5 age if female==1, sort lpattern(solid) lwidth(medthick) lcolor(black)) /// , legend(label(1 "Male") label(2 "Female") ring(0) pos(6) col(2)) scheme(sj) name(s5, replace) /// ytitle("5-year survival proportion", size(*1.0)) xtitle("Age at diagnosis", size(*1.0)) // Let's now further extend the dataset so as to make predictions at a range of values of time // We create 24 extra copies of the dataset using "expand 25" // We then create the variable _t that takes values between 0 and 6 // Note that we have only fitted the model using up to 5 years follow-up, so the extra year will be "out-of-sample" drop _t t5 expand 25 sort female age generate _t=(mod(_n-1,25))/4 // Now prediction the survivor function for each observation predict s, survival timevar(_t) // Plot survival as a function of time since diagnosis twoway (line s _t if female==0&age==50, sort lpattern(shortdash) lwidth(medthick) lcolor(blue)) /// (line s _t if female==1&age==50, sort lpattern(solid) lwidth(medthick) lcolor(black)) /// , legend(label(1 "Male, age 50") label(2 "Female, age 50") ring(0) pos(1) col(1)) scheme(sj) name(s, replace) /// ytitle("Survival proportion", size(*1.0)) xtitle("Time since diagnosis (years)", size(*1.0)) // A simpler approach to creating the final prediction data set // Thanks to Nurgul Batyrbekova for introducing me to the -fillin- command clear range age 20 100 81 range female 0 1 2 range _t 0 6 25 fillin age female _t drop if missing(age, female, _t) rcsgen age, gen(rcsage) rmatrix(Rage) knots($knotsage) predict s2, survival timevar(_t) twoway (line s2 _t if female==0&age==50, sort lpattern(shortdash) lwidth(medthick) lcolor(blue)) /// (line s2 _t if female==1&age==50, sort lpattern(solid) lwidth(medthick) lcolor(black)) /// , legend(label(1 "Male, age 50") label(2 "Female, age 50") ring(0) pos(1) col(1)) scheme(sj) name(s2, replace) /// ytitle("Survival proportion", size(*1.0)) xtitle("Time since diagnosis (years)", size(*1.0)) // To do: Repeat in Stata 16 using frames