Model-based age-standardisation with stpm2
The code used in this tutorial, along with links to the data, is available here.
In this tutorial, we use a model-based approach to estimate all-cause survival that is age-standardised to the International Cancer Survival Standard (ICSS). In a separate tutorial, we accomplished this by predicting survival for each age group and then taking the weighted average of the age-specific estimates to get the age-standardised estimates.
In this tutorial we will apply the weights at an individual level, which precludes the need to explicitly estimate survival within each age group.
We begin by loading the colon cancer data (restricting to localised stage) and stset
.
. use https://pauldickman.com/data/colon.dta if stage==1, clear
(Colon carcinoma, diagnosed 1975-94, follow-up to 1995)
. keep surv_mm status yydx age
. stset surv_mm, fail(status==1,2) scale(12)
failure event: status == 1 2
obs. time interval: (0, surv_mm]
exit on or before: failure
t for analysis: time/12
------------------------------------------------------------------------------
6,274 total observations
0 exclusions
------------------------------------------------------------------------------
6,274 observations remaining, representing
3,291 failures in single-record/single-failure data
35,598.75 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
We now create the spline basis vectors to model year of diagnosis using restricted cubic splines and create new age groups corresponding to the International Cancer Survival Standard (Corazziari et al 2004). stpm2
supports Stata factor variable syntax (i.
) for main effects, but not time-varying effects so we will create dummy variables for agegrp.
. /*spline variable for year of diagnosis*/
. rcsgen yydx, df(3) gen(yearspl) orthog
Variables yearspl1 to yearspl3 were created
. // New age groups according to the International Cancer Survival Standard (ICSS)
. label drop agegrp
. egen agegrp=cut(age), at(0 15 45 55 65 75 200) icodes
. label variable agegrp "Age group"
. label define agegrp 1 "15-44" 2 "45-54" 3 "55-64" 4 "65-74" 5 "75+"
. label values agegrp agegrp
. // Generate dummy variables for agegrp
. quietly tab agegrp, gen(agegrp)
We will now create an individual weight for each observation, calculated as the proportion of patients in the given age group in the standard population (ICSS) divided by the proportion in that age group in our patient data.
In our data set, 4.73% of patients are in the youngest age group compared to 7% in the standard population. As such, each patients in the youngest age group in our population will be given a weight of 7/4.73=1.48
. At the other end of the age scale, 36.15% of the patients are in the oldest age group compared to 29% in the standard population. As such, each patients in the oldest age group in our population will be given a weight of 29/36.15=0.80
. Our population is older than the standard population, so when age-standardising we will up-weight the young and down-weight the elderly.
_N
is a Stata system variable that contains the total number of observations in the dataset. In the code below we write the total number of observations to a local macro variable (total
). When we reference _N
within bysort
its value will be the number of observations within the by group. In this way, we generate the variable a_age
to contain the proportion of observations in each age group.
. // Create weights to use for age-standardisation
. recode agegrp (1=0.07) (2=0.12) (3=0.23) (4=0.29) (5=0.29), gen(ICSSwt)
(6274 differences between agegrp and ICSSwt)
. local total= _N
. bysort agegrp:gen a_age = _N/`total'
. gen w = ICSSwt/a_age
. tabstat w, statistics( mean min max) by(agegrp)
Summary for variables: w
by categories of: agegrp (Age group)
agegrp | mean min max
-------+------------------------------
15-44 | 1.478721 1.478721 1.478721
45-54 | 1.447846 1.447846 1.447846
55-64 | 1.234406 1.234406 1.234406
65-74 | .9007227 .9007227 .9007227
75+ | .8022311 .8022311 .8022311
-------+------------------------------
Total | 1 .8022311 1.478721
--------------------------------------
The weights are constant within each age group, and decrease with increasing age.
We have created the weights, but will not use them just yet. We now fit a flexible parametric model adjusting for calendar year of diagnosis (as a restricted cubic spline) and age group.
. stpm2 yearspl? agegrp2 agegrp3 agegrp4 agegrp5, scale(h) df(4) eform ///
> tvc(yearspl? agegrp2 agegrp3 agegrp4 agegrp5) dftvc(1)
Iteration 0: log likelihood = -8253.8872
Iteration 1: log likelihood = -8182.2058
Iteration 2: log likelihood = -8179.6265
Iteration 3: log likelihood = -8179.6199
Iteration 4: log likelihood = -8179.6199
Log likelihood = -8179.6199 Number of obs = 6,274
----------------------------------------------------------------------------------
| exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
xb |
yearspl1 | .8683429 .0207813 -5.90 0.000 .8285528 .9100439
yearspl2 | .9961899 .0220676 -0.17 0.863 .9538637 1.040394
yearspl3 | 1.057278 .0230945 2.55 0.011 1.012969 1.103525
agegrp2 | 1.187213 .1970927 1.03 0.301 .8574685 1.643763
agegrp3 | 1.385942 .2046868 2.21 0.027 1.037607 1.851216
agegrp4 | 2.308078 .3230204 5.98 0.000 1.754377 3.036532
agegrp5 | 4.454763 .6146422 10.83 0.000 3.399232 5.838058
_rcs1 | 2.566524 .2379096 10.17 0.000 2.140136 3.077863
_rcs2 | .9525124 .0121212 -3.82 0.000 .929049 .9765683
_rcs3 | .9850944 .0075191 -1.97 0.049 .9704669 .9999424
_rcs4 | 1.005493 .0059038 0.93 0.351 .9939882 1.017131
_rcs_yearspl11 | 1.036719 .0225516 1.66 0.097 .9934475 1.081875
_rcs_yearspl21 | 1.002712 .0191478 0.14 0.887 .9658764 1.040952
_rcs_yearspl31 | .9806256 .0176908 -1.08 0.278 .9465581 1.015919
_rcs_agegrp21 | 1.033909 .1184814 0.29 0.771 .8259205 1.294274
_rcs_agegrp31 | 1.075245 .1089989 0.72 0.474 .881496 1.31158
_rcs_agegrp41 | 1.232513 .1184007 2.18 0.030 1.020989 1.487859
_rcs_agegrp51 | 1.091913 .1029258 0.93 0.351 .9077206 1.313481
_cons | .1386212 .0187748 -14.59 0.000 .1063024 .1807657
----------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
We will now predict survival based on the fitted model. By default, the predict
command will predict survival at the value of _t
(time of end of follow-up) for each observation. For simplicity and computational efficiency, we will create a temporary time variable for which to make predictions. Paul Lambert has written a tutorial on the use of temporary time variables.
We will predict survival for each of 101 unique values of time (every 0.1 years from 0 to 10) rather than for each of the 6,274 observations in the data set.
We start with predicting the unweighted (unstandardised) marginal survival. At each of the values of time, we predict the marginal (population-averaged) survival using the meansurv option to the predict command. For time=0, the marginal survival is trivially 1. The next value of temptime is 0.1. What meansurv effectively does is it predicts the survival at time 0.1 for each of the 6,274 observations in the data set (conditional on covariates for each observation) and then takes the average of these 6,274 observations. This continues for each of the other values of temptime.
. range temptime 0 10 101
(6,173 missing values generated)
. // marginal (population-averaged) survival
. predict s_unweighted, meansurv timevar(temptime)
. // marginal (population-averaged) survival standardised to ICSS
. predict s_weighted, meansurv meansurvwt(w) timevar(temptime)
Another way of conceptualising this, is that for each of the 6,274 observations in the data set we predict the survivor function (from time 0 to 10) for the individual with given values of year and age. We then average the 6,274 individual survival curves to get the marginal survival curve. This is stored in the variable s_unweighted
.
To get the age-standardised marginal (population-averaged) survival, we apply weights when we take the average of the individual survival curves. Following is the code for plotting the two resulting curves.
. twoway (line s_weighted temptime , sort lpattern(shortdash) lwidth(medthick) lcolor(black)) ///
> (line s_unweighted temptime , sort lpattern(solid) lwidth(medthick) lcolor(black)) ///
> , legend(label(1 "Marginal survival (standardised)") label(2 "Marginal survival (unstandardised)") ring(0) pos(7) col(1)) ///
> scheme(sj) name(surv1, replace) ysize(8) xsize(11) ///
> subtitle("`text`i''", size(*1.0)) ytitle("All-cause survival", size(*1.0)) xtitle("Years since diagnosis", size(*1.0)) ///
> ylabel(0 0.2 0.4 0.6 0.8 1.0, labsize(*1.0) angle(0)) yscale(range(0 1)) xlabel(, labsize(*1.0))
Standardised survival is higher than the unstandardised, which is as expected because our patient population is older than the standard population. The standardised survival curve is what we would see if our population had the same age-specific survival but the age distribution of the (younger) standard population. Below you see the age distribution of our population.
. tab agegrp
Age group | Freq. Percent Cum.
------------+-----------------------------------
15-44 | 297 4.73 4.73
45-54 | 520 8.29 13.02
55-64 | 1,169 18.63 31.65
65-74 | 2,020 32.20 63.85
75+ | 2,268 36.15 100.00
------------+-----------------------------------
Total | 6,274 100.00
Now let’s just estimate 5-year survival, to illustrate what predict, meansurv
is doing. Rather than a range of temporary time variable we have just one variable, t5
, which takes the value 5 for all observations.
We start by predicting the 5-year survival for each observation for the given values of year and age and store it in the variable s5
. We call this the 5-year survival conditional on covariates.
We then predict the marginal (population-averaged) 5-year survival both with and without weighting.
. drop temptime
. generate t5=5
. // 5-year survival conditional on covariates
. predict s5, survival timevar(t5)
. // 5-year marginal (population-averaged) survival
. predict s5_unweighted, meansurv timevar(t5)
. // 5-year marginal (population-averaged) survival standardised to ICSS
. predict s5_weighted, meansurv meansurvwt(w) timevar(t5)
We can have a look at the first 5 rows in the data set. s5
will potentially differ for each covariate pattern (since it is the predicted 5-year survival conditional on covariates).
. list yydx agegrp s5 s5_unweighted s5_weighted in 1/5
+---------------------------------------------------+
| yydx agegrp s5 s5_unwe~d s5_weig~d |
|---------------------------------------------------|
1. | 1982 15-44 .82990125 .61272163 .64306652 |
2. | 1987 15-44 .83144508 .61272163 .64306652 |
3. | 1981 15-44 .82593014 .61272163 .64306652 |
4. | 1981 15-44 .82593014 .61272163 .64306652 |
5. | 1994 15-44 .87029606 .61272163 .64306652 |
+---------------------------------------------------+
The unweighted marginal 5-year survival is simply the arithmetic average of the 6,274 values of s5
and will be the same for all obervations in the data set. We see below that the average of s5
is .6127216, identical to the value of s5_unweighted
above.
. summarize s5
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
s5 | 6,274 .6127216 .1494382 .3118262 .8702961
The standardised (weighted) marginal 5-year survival is simply the weighted arithmetic average of the 6,274 values of s5
and will be the same for all obervations in the data set. If we multiply s5
by the weight (w
) and average we see below that the result is identical to the value of s5_weighted
above.
. generate s5w=s5*w
. summarize s5w
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
s5w | 6,274 .6430665 .2985234 .2501567 1.286925
I have used the terms marginal and conditional to link to the way these terms are used in the causal inference literature. The conditional survival here is conditional on covariates. One can also estimate survival conditional on already having already survived up to some value of time, which is further explained here.
Possible extensions and questions
Q: What if I want to estimate age-standardised relative survival?
A: The code is almost identical. Just add the bhazard
option to stpm2
.
Q: Why did you use all-cause survival (rather than relative survival) in this tutorial?
A: Because I also used all-cause survival in this tutorial. In the other tutorial I did age-standardisation by first estimating age-specific estimates. This was problematic for the youngest age group where there were few cases. This illustrates the advantage of the approach using weighting, since we do not have to estimate within age strata.