Multiple imputation for missing covariates when modelling relative survival
The code used in this tutorial, along with links to the data, is available here.
This exercise illustrates an approach to modelling relative survival with missing covariate data using multiple imputation. Falcaro et al. (2015) published a nice overview of methods for modelling net survival when covariate data are missing. They have also published a paper on multiple imputation for non-parametric estimation of net survival, which is not covered here.
Mechanisms for missingness in survival analysis can be classifed as follows:
- Missing completely at random (MCAR)
- Covariate-dependent missing at random (CD-MAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Much of the early literature on missing data uses three classifcations (MCAR, MAR, and MNAR). Falcaro et al. (2015) distinguish between MAR (where the probability of missingness depends on known covariates, the event indicator, and the survival time) and CD-MAR (where the probability of missingness depends on known covariates, but not the the event indicator or survival time).
The analytic approach depends on the mechanism. Multiple imputation is appropriate for any of the patterns other than NMAR; if the data are MNAR then the problem is more difficult. We cannot use the observed data to identify the mechanism, we need substantive scientific knowledge of the processes that gave rise to the data. The validity of inference depends on the (untestable) assumptions inherent in the mechanism.
We will proceed with multiple imputation to illustrate the approach, even if the appropriateness could be questioned (missing stage could depend on factors not available to us).
. use http://pauldickman.com/data/colon.dta, clear
(Colon carcinoma, diagnosed 1975-94, follow-up to 1995)
. stset surv_mm, failure(status=1 2) scale(12) exit(time 10*12)
failure event: status == 1 2
obs. time interval: (0, surv_mm]
exit on or before: time 10*12
t for analysis: time/12
------------------------------------------------------------------------------
15,564 total observations
0 exclusions
------------------------------------------------------------------------------
15,564 observations remaining, representing
10,451 failures in single-record/single-failure data
51,613.792 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 10
. gen _age = min(int(age + _t),99)
. gen _year = int(yydx + mmdx/12 + _t)
. merge m:1 _year sex _age using http://pauldickman.com/data/popmort
Result # of obs.
-----------------------------------------
not matched 8,279
from master 0 (_merge==1)
from using 8,279 (_merge==2)
matched 15,564 (_merge==3)
-----------------------------------------
. keep if _merge==3
(8,279 observations deleted)
.
. tab stage
Clinical |
stage at |
diagnosis | Freq. Percent Cum.
------------+-----------------------------------
Unknown | 2,356 15.14 15.14
Localised | 6,274 40.31 55.45
Regional | 1,787 11.48 66.93
Distant | 5,147 33.07 100.00
------------+-----------------------------------
Total | 15,564 100.00
.
. /* Check stage distribution over age and gender */
. tab stage agegrp, column
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
Clinical |
stage at | Age in 4 categories
diagnosis | 0-44 45-59 60-74 75+ | Total
-----------+--------------------------------------------+----------
Unknown | 83 262 858 1,153 | 2,356
| 11.29 11.06 13.01 19.65 | 15.14
-----------+--------------------------------------------+----------
Localised | 297 993 2,716 2,268 | 6,274
| 40.41 41.93 41.20 38.65 | 40.31
-----------+--------------------------------------------+----------
Regional | 114 329 772 572 | 1,787
| 15.51 13.89 11.71 9.75 | 11.48
-----------+--------------------------------------------+----------
Distant | 241 784 2,247 1,875 | 5,147
| 32.79 33.11 34.08 31.95 | 33.07
-----------+--------------------------------------------+----------
Total | 735 2,368 6,593 5,868 | 15,564
| 100.00 100.00 100.00 100.00 | 100.00
. tab stage sex, column
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
Clinical |
stage at | Sex
diagnosis | Male Female | Total
-----------+----------------------+----------
Unknown | 885 1,471 | 2,356
| 13.96 15.95 | 15.14
-----------+----------------------+----------
Localised | 2,620 3,654 | 6,274
| 41.32 39.61 | 40.31
-----------+----------------------+----------
Regional | 715 1,072 | 1,787
| 11.28 11.62 | 11.48
-----------+----------------------+----------
Distant | 2,120 3,027 | 5,147
| 33.44 32.82 | 33.07
-----------+----------------------+----------
Total | 6,340 9,224 | 15,564
| 100.00 100.00 | 100.00
.
. /* Graphs of survival by age group and stage */
. stpm2 ib1.stage##i.agegrp , df(5) bhaz(rate) scale(hazard) eform nolog
Log likelihood = -18205.571 Number of obs = 15,564
---------------------------------------------------------------------------------
| exp(b) Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
xb |
stage |
Unknown | 1.116447 .2864249 0.43 0.668 .675246 1.845925
Regional | 1.853897 .3673454 3.12 0.002 1.257251 2.73369
Distant | 7.849335 1.117766 14.47 0.000 5.937717 10.37639
|
agegrp |
45-59 | .8300786 .1236065 -1.25 0.211 .6199652 1.111402
60-74 | .9399807 .1281619 -0.45 0.650 .7195512 1.227937
75+ | 1.217131 .1711222 1.40 0.162 .9239804 1.603289
|
stage#agegrp |
Unknown#45-59 | 2.091891 .6053175 2.55 0.011 1.186403 3.688467
Unknown#60-74 | 2.183828 .5875785 2.90 0.004 1.288828 3.700342
Unknown#75+ | 4.335722 1.161605 5.48 0.000 2.564553 7.330122
Regional#45-59 | 1.492494 .3449121 1.73 0.083 .9488588 2.347596
Regional#60-74 | 1.56255 .3337045 2.09 0.037 1.02813 2.374762
Regional#75+ | 1.334842 .2959008 1.30 0.193 .8644497 2.061198
Distant#45-59 | 1.311706 .2211291 1.61 0.108 .9426265 1.825295
Distant#60-74 | 1.295841 .2001943 1.68 0.093 .9573039 1.754098
Distant#75+ | 1.310089 .207943 1.70 0.089 .95983 1.788163
|
_rcs1 | 3.050329 .036273 93.79 0.000 2.980057 3.122257
_rcs2 | 1.318658 .0119716 30.47 0.000 1.295402 1.342332
_rcs3 | .9920318 .0056634 -1.40 0.161 .9809937 1.003194
_rcs4 | 1.04735 .0038704 12.52 0.000 1.039791 1.054963
_rcs5 | 1.011087 .0029654 3.76 0.000 1.005292 1.016916
_cons | .1027228 .0128049 -18.26 0.000 .0804563 .1311516
---------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
. predict survival, surv
. line survival _t if stage==0, lpattern(dash) sort || ///
> line survival _t if stage==1, sort || ///
> line survival _t if stage==2, sort || ///
> line survival _t if stage==3, sort by(agegrp) ///
> legend(order(1 "Unknown" 2 "Localised" 3 "Regional" 4 "Distant")) ///
> name(s_by_stage, replace) ysize(8) xsize(11) ytitle("Relative survival") ///
> xtitle("years since diagnosis")
. /* Fit model using missing indicator approach */
. stpm2 ib1.stage i.agegrp , df(5) bhaz(rate) scale(hazard) eform nolog
Log likelihood = -18267.394 Number of obs = 15,564
------------------------------------------------------------------------------
| exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
stage |
Unknown | 3.241262 .1571215 24.26 0.000 2.947487 3.564319
Regional | 2.660883 .1403817 18.55 0.000 2.399487 2.950755
Distant | 10.00967 .3928204 58.70 0.000 9.268619 10.80997
|
agegrp |
45-59 | 1.101743 .0692448 1.54 0.123 .9740518 1.246174
60-74 | 1.241194 .0720058 3.72 0.000 1.107793 1.390659
75+ | 1.780897 .1042263 9.86 0.000 1.587898 1.997354
|
_rcs1 | 3.044807 .0362527 93.52 0.000 2.974576 3.116697
_rcs2 | 1.320444 .0119872 30.62 0.000 1.297157 1.344149
_rcs3 | .99282 .0056625 -1.26 0.206 .9817836 1.00398
_rcs4 | 1.048117 .0038603 12.76 0.000 1.040579 1.055711
_rcs5 | 1.011472 .0029515 3.91 0.000 1.005704 1.017273
_cons | .0765708 .0049307 -39.90 0.000 .0674918 .0868711
------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
.
. /* Refit model using complete records approach */
. replace stage=. if stage==0
(2,356 real changes made, 2,356 to missing)
. stpm2 ib1.stage i.agegrp , df(5) bhaz(rate) scale(hazard) eform nolog
Log likelihood = -15353.605 Number of obs = 13,208
------------------------------------------------------------------------------
| exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
stage |
Regional | 2.676154 .1410076 18.68 0.000 2.413576 2.967299
Distant | 10.3598 .4080125 59.36 0.000 9.590197 11.19117
|
agegrp |
45-59 | 1.061092 .0688816 0.91 0.361 .9343219 1.205062
60-74 | 1.204694 .0721444 3.11 0.002 1.071277 1.354727
75+ | 1.557469 .0950547 7.26 0.000 1.381876 1.755373
|
_rcs1 | 3.141848 .0410415 87.64 0.000 3.062429 3.223326
_rcs2 | 1.318276 .0133637 27.26 0.000 1.292342 1.34473
_rcs3 | 1.001148 .006439 0.18 0.858 .9886073 1.013848
_rcs4 | 1.050811 .0043694 11.92 0.000 1.042282 1.05941
_rcs5 | 1.010931 .00336 3.27 0.001 1.004367 1.017538
_cons | .0807468 .0053034 -38.31 0.000 .0709936 .0918398
------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
We now set up the data for multiple imputation. Theory dictates that the imputation model (specified later) should contain the outcome, which we will specify using the event indicator and estimated cumulative hazard. We therefore generate the Nelson-Aalen estimate of the cumulative hazard and store it in the variable H.
. sts gen H=na
. // Declare multiple-imputation data and register variables accordingly
. mi set flong
. mi register imputed stage
(2356 m=0 obs. now marked as incomplete)
. mi register regular subsite agegrp sex
. mi register passive _rcs* _d_rcs*
We now multiply impute missing values using chained equations. We set the seed so as to obtain reproducible results but that step is not necessary in practice.
Carpenter and Kenward (page 55) suggest 30 imputations but we will use only 10 to reduce computational time.
. // Perform imputation.
. set seed 29390
. mi impute chained (mlogit) stage = i.subsite sex i.agegrp H _d, add(10)
note: missing-value pattern is monotone; no iteration performed
Conditional models (monotone):
stage: mlogit stage i.subsite sex i.agegrp H _d
Performing chained iterations ...
Multivariate imputation Imputations = 10
Chained equations added = 10
Imputed: m=1 through m=10 updated = 0
Initialization: monotone Iterations = 0
burn-in = 0
stage: multinomial logistic regression
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
stage | 13208 2356 2356 | 15564
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
We’ll examine the imputed data for selected observation. The variable _mi_m
is zero for the observations in the
original data and a sequential integer (1 to 10) for the imputed observations. The first observation with missing stage (id==2287
) is a women aged 45-59 who died within 6 months of diagnosis. The distribution of imputed values is heavily weighted towards distant stage. The second observation we look at (id==3362
) is a woman aged 75+ years at diagnosis who dies after more than 6 years after diagnosis. The distribution of imputed values is heavily weighted towards localised.
. list id _mi_m agegrp sex stage _t _d if id==2287
+-------------------------------------------------------------+
| id _mi_m agegrp sex stage _t _d |
|-------------------------------------------------------------|
63. | 2287 0 45-59 Female . .04166667 1 |
15627. | 2287 1 45-59 Female Distant .04166667 1 |
31191. | 2287 2 45-59 Female Distant .04166667 1 |
46755. | 2287 3 45-59 Female Distant .04166667 1 |
62319. | 2287 4 45-59 Female Distant .04166667 1 |
|-------------------------------------------------------------|
77883. | 2287 5 45-59 Female Localised .04166667 1 |
93447. | 2287 6 45-59 Female Regional .04166667 1 |
109011. | 2287 7 45-59 Female Distant .04166667 1 |
124575. | 2287 8 45-59 Female Distant .04166667 1 |
140139. | 2287 9 45-59 Female Distant .04166667 1 |
|-------------------------------------------------------------|
155703. | 2287 10 45-59 Female Distant .04166667 1 |
+-------------------------------------------------------------+
. list id _mi_m agegrp sex stage _t _d if id==3362
+-------------------------------------------------------------+
| id _mi_m agegrp sex stage _t _d |
|-------------------------------------------------------------|
2270. | 3362 0 75+ Female . 6.2083333 1 |
17834. | 3362 1 75+ Female Localised 6.2083333 1 |
33398. | 3362 2 75+ Female Localised 6.2083333 1 |
48962. | 3362 3 75+ Female Localised 6.2083333 1 |
64526. | 3362 4 75+ Female Localised 6.2083333 1 |
|-------------------------------------------------------------|
80090. | 3362 5 75+ Female Localised 6.2083333 1 |
95654. | 3362 6 75+ Female Localised 6.2083333 1 |
111218. | 3362 7 75+ Female Localised 6.2083333 1 |
126782. | 3362 8 75+ Female Localised 6.2083333 1 |
142346. | 3362 9 75+ Female Localised 6.2083333 1 |
|-------------------------------------------------------------|
157910. | 3362 10 75+ Female Regional 6.2083333 1 |
+-------------------------------------------------------------+
. list id _mi_m agegrp sex stage _t _d if id==3501
+------------------------------------------------------+
| id _mi_m agegrp sex stage _t _d |
|------------------------------------------------------|
5080. | 3501 0 75+ Female . 10 0 |
20644. | 3501 1 75+ Female Localised 10 0 |
36208. | 3501 2 75+ Female Localised 10 0 |
51772. | 3501 3 75+ Female Localised 10 0 |
67336. | 3501 4 75+ Female Localised 10 0 |
|------------------------------------------------------|
82900. | 3501 5 75+ Female Localised 10 0 |
98464. | 3501 6 75+ Female Regional 10 0 |
114028. | 3501 7 75+ Female Localised 10 0 |
129592. | 3501 8 75+ Female Localised 10 0 |
145156. | 3501 9 75+ Female Localised 10 0 |
|------------------------------------------------------|
160720. | 3501 10 75+ Female Localised 10 0 |
+------------------------------------------------------+
We now refit the flexible parametric model, this time to the imputed data. The mi
estimate
command effectively fits the model to each of the 10 imputed data sets and
then combines the resulting estimates. Since stpm2
is not an official Stata command
we need to specify the cmdok
option to specify that the command is OK for use with
imputed data. We also save the resulting estimates in order to make predictions in a
later step.
. mi estimate, dots cmdok sav(mi_stpm2,replace): ///
> stpm2 ib1.stage i.agegrp, df(5) bhaz(rate) scale(hazard) nolog eform
Imputations (10):
.........10 done
Multiple-imputation estimates Imputations = 10
Number of obs = 15,564
Average RVI = 0.0678
Largest FMI = 0.2246
DF adjustment: Large sample DF: min = 191.94
avg = 161,766.44
Within VCE type: OIM max = 730,392.23
( 1) [xb]_rcs1 - [dxb]_d_rcs1 = 0
( 2) [xb]_rcs2 - [dxb]_d_rcs2 = 0
( 3) [xb]_rcs3 - [dxb]_d_rcs3 = 0
( 4) [xb]_rcs4 - [dxb]_d_rcs4 = 0
( 5) [xb]_rcs5 - [dxb]_d_rcs5 = 0
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
stage |
Regional | .9669048 .0554079 17.45 0.000 .8576181 1.076191
Distant | 2.32957 .0381676 61.04 0.000 2.254674 2.404465
|
agegrp |
45-59 | .0781396 .0638174 1.22 0.221 -.0469585 .2032376
60-74 | .2059373 .0602865 3.42 0.001 .0876842 .3241903
75+ | .5452103 .0593367 9.19 0.000 .428899 .6615217
|
_rcs1 | 1.144416 .012122 94.41 0.000 1.120655 1.168176
_rcs2 | .2693923 .0091673 29.39 0.000 .2514247 .2873598
_rcs3 | -.008823 .0058069 -1.52 0.129 -.0202044 .0025583
_rcs4 | .0471367 .0038664 12.19 0.000 .039558 .0547154
_rcs5 | .0118175 .0030844 3.83 0.000 .0057721 .0178629
_cons | -2.569309 .0651194 -39.46 0.000 -2.697028 -2.44159
-------------+----------------------------------------------------------------
dxb |
_d_rcs1 | 1.144416 .012122 94.41 0.000 1.120655 1.168176
_d_rcs2 | .2693923 .0091673 29.39 0.000 .2514247 .2873598
_d_rcs3 | -.008823 .0058069 -1.52 0.129 -.0202044 .0025583
_d_rcs4 | .0471367 .0038664 12.19 0.000 .039558 .0547154
_d_rcs5 | .0118175 .0030844 3.83 0.000 .0057721 .0178629
------------------------------------------------------------------------------
Now predict the survival function based on the fitted model. We specify timevar(_t)
(which is usually the default) to force recalculation of the spline variables.
Using the fact that the original data is still in the new dataset (with _mi_m==0
), we
can refit the model using the complete records approach and obtain predictions of
survival. We will graphically compare the predicted relative survival curves from the
imputation model and the complete records model for the 60-74 age-group.
. // predict survival using -mi predictnl-
. mi predictnl survimp2 = predict(survival at(agegrp 2) timevar(_t)) using mi_stpm2
. // compare predictions to complete case analysis
. stpm2 ib1.stage i.agegrp if _mi_m==0, df(5) scale(h) bhaz(rate)
Log likelihood = -15353.605 Number of obs = 13,208
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
stage |
Regional | .9843808 .0526904 18.68 0.000 .8811095 1.087652
Distant | 2.337933 .0393842 59.36 0.000 2.260741 2.415125
|
agegrp |
45-59 | .0592983 .0649158 0.91 0.361 -.0679343 .1865309
60-74 | .1862257 .0598861 3.11 0.002 .0688511 .3036003
75+ | .4430619 .0610316 7.26 0.000 .3234423 .5626816
|
_rcs1 | 1.144811 .0130629 87.64 0.000 1.119208 1.170414
_rcs2 | .2763246 .0101372 27.26 0.000 .256456 .2961932
_rcs3 | .0011476 .0064316 0.18 0.858 -.0114581 .0137533
_rcs4 | .0495627 .0041581 11.92 0.000 .0414129 .0577125
_rcs5 | .0108717 .0033237 3.27 0.001 .0043574 .0173861
_cons | -2.516437 .0656789 -38.31 0.000 -2.645166 -2.387709
------------------------------------------------------------------------------
. predict surv, survival at(agegrp 2)
. line surv survimp2 _t if stage==1 & _mi_m==0, sort || ///
> line surv survimp2 _t if stage==2 & _mi_m==0, sort || ///
> line surv survimp2 _t if stage==3 & _mi_m==0, sort ///
> title("Predicted survival for agegrp==2 (60-74)") ///
> legend(order(1 "Localised (Complete)" 2 "Localised (Imputed)" ///
> 3 "Regional (Complete)" 4 "Regional (Imputed)" ///
> 5 "Distant (Complete)" 6 "Distant (Imputed)")) ///
> name(imputed, replace) ysize(8) xsize(11) ytitle("Relative survival") ///
> title("Relative survival by stage") xtitle("years since diagnosis")