Age-standardised survival using standsurv

Paul Dickman, Paul Lambert

Table of Contents

Download Stata code (do files)

There are two separate do files associated with this tutorial.

We start by describing the procedure for using standsurv, a user-written Stata command authored by Paul Lambert, to estimate age-standardised survival using first a so-called internal standard population and then using an external standard population.

The do files provide fully worked examples. They include code to download the patient data (melanoma) and reproduce the results shown on this page.

Introduction and aims

In a previous tutorial, we estimated the hazard ratio for a binary exposure (sex) as a function of year of diagnosis. We produced the following plot.

Hazard ratio for sex (males/females) predicted from a flexible parametric model with an interaction between sex and year of diagnosis (modelled as a restricted cubic spline). The hazard ratio is assumed to be constant over follow-up time (i.e., proportional hazards). Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.
Hazard ratio for sex (males/females) predicted from a flexible parametric model with an interaction between sex and year of diagnosis (modelled as a restricted cubic spline). The hazard ratio is assumed to be constant over follow-up time (i.e., proportional hazards). Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.

We are now going to estimate the age-standardised 5-year survival for each sex as a function of year of diagnosis and the difference in age-standardised 5-year survival between men and women. The linked code will produce these two graphs.

Age-standardised 5-year cause-specific survival for men and women. Standard population is the age-distribution among all patients for all years. Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.
Age-standardised 5-year cause-specific survival for men and women. Standard population is the age-distribution among all patients for all years. Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.
Difference in age-standardised 5-year cause-specific survival between men and women (women minus men). Standard population is the age-distribution among all patients for all years. Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.
Difference in age-standardised 5-year cause-specific survival between men and women (women minus men). Standard population is the age-distribution among all patients for all years. Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.

We have modelled cause-specific survival, but the same approach works if we are interested in relative survival (just include the bhazard() option when modelling and stset with all deaths as the outcome).

The data set-up is the same as in the previous tutorial. We stset with time since diagnosis as the timescale and censor at 10 years (120 months). We will model year of diagnosis as a restricted cubic spline with 3 degrees of freedom. We use the rcsgen command to create the 3 spline basis variables and then create the interaction between sex and year of diagnosis.

Fitting the flexible parametric model (stpm2)

We now fit the following model.

. stpm2 yearspl* male maleyr1 maleyr2 maleyr3 agegrp2 agegrp3 agegrp4, ///
>       scale(h) df(5) eform ///
>       tvc(agegrp2 agegrp3 agegrp4) dftvc(2)

[output omitted]

male is an indicator variable for male sex (takes the value 1 for males and 0 for females) and yearspl1, yearspl2, and yearspl3, are the spline basis vectors for year of diagnosis.

Note that sex and year of diagnosis do not appear in the tvc() option, so are assumed to be constant over time since diagnosis.

Our model contains interaction effects between sex and year of diagnosis (without these the hazard ratio and the standardised survival would be constant over year of diagnosis). The interaction effects were constructed as follows.

generate maleyr1=male*yearspl1
generate maleyr2=male*yearspl2
generate maleyr3=male*yearspl3

In the next step we will use the fact that each of these interaction effects are zero for women and equal to the corresponding spline basis vector for men.

We are not suggesting this is the optimal model for these data. We are using a model that is sufficiently complex to illustrate the principles, but kept simple for pedagogic purposes. A more realistic model might also include a sex by age interaction and additional time-varying effects, but we have used a simpler model to keep the code as readible as possible.

Age-standardisation with standsurv using an internal standard

The user-written standsurv command (written by Paul Lambert) estimates standardized survival curves and related measures based on a fitted model. It also estimates various contrasts (e.g., differences) between the standardized functions. Details can be found in these pages on Paul Lambert’s website:

We will use standsurv to estimate the age-standardised 5-year survival for males and for females along with the difference between them.

If we had fitted a simpler model (without interactions) then we could make the following call to standsurv.

generate t5=5
standsurv, at1(male 0) at2(male 1) timevar(t5) ci contrast(difference)

We begin by creating a value of time at which we wish to predict. Usually we predict for a range of values of time so as to get a standardised survival function, but here we will only predict the 5-year survival.

Each of the atn() options creates a standardised 5-year survival; one for females (the variable male is set to 0) and one for males (the variable male is set to 1).

The standardised 5-year survival for females is created by predicting (based on the fitted model) the 5-year survival for every observation in the data set under the assumption they are female (even if they are actually male). We then average the individual predictions to get the standardised survival for females. We then repeat the process, but predicting under the assumption that everyone is male.

The key is that we predict two values of the 5-year survival for each individual, once if they were male and once if they were female. When we average these values to get the standardised estimates, the average for males and the average for females are both taken over the exact same observations (with the same distribution of age and year) so are therefore comparable. With this simple example we are standardising over both age and year of diagnosis.

For our actual research question, we want to estimate the age-standardised 5-year survival for men and women for each year of diagnosis. Each of these estimates will be averaged over the age distribution of all patients; this is known as internal age standardisation (we are using the age distribution of the entire cohort as the standard).

The code is as follows:

generate t5=5 in 1
forvalues y = 1975/1994 {
  display "Calculating age-standardised survival for year: `y' "
  rcsgen, scalar(`y') knots(${yearknots}) rmatrix(R) gen(c) 
  standsurv,  at1(male 0 maleyr1 0 maleyr2 0 maleyr3 0) ///
              at2(male 1 maleyr1 `=c1' maleyr2 `=c2' maleyr3 `=c3') ///
              timevar(t5) contrast(difference) ci ///
              atvar(S_male`y' S_female`y') contrastvars(S_diff`y')
}

We loop over each value of year of diagnosis (from 1975 to 1994).

Specifying the appropriate covariate values is more complicated because our model is more complicated. For females we have male=0 as well as all of the sex by year interactions set to zero. Because we want to predict for females at a specific value of year of diagnosis we need to specify the appropriate values of the spline basis vectors for each year. We use the rcsgen command to recreate the spline basis vectors (using the same knot locations and projection matrix as previously used) and then save the resulting spline basis vectors to local scalars c1, c2, and c3 that can be fed into standsurv. For males, the sex by year interactions will be equal to the values of the respective spline basis vectors.

This code will create a series of valiable with the standardised estimates for each year. For example, for 1975 we will have the following valiables.

. describe *1975*

              storage   display 
variable name   type    format  
--------------------------------
S_male1975      double  %10.0g                
S_male1975_lci  double  %10.0g                
S_male1975_uci  double  %10.0g                
S_female1975    double  %10.0g                
S_female~75_lci double  %10.0g                
S_female~75_uci double  %10.0g                
S_diff1975      double  %10.0g                
S_diff1975_lci  double  %10.0g                
S_diff1975_uci  double  %10.0g     

Similar variables are created for each of the other years. To make plotting easier we will convert from wide to long format.

keep in 1
keep id S_male* S_female* S_diff*
reshape long S_male S_male@_lci S_male@_uci ///
             S_female S_female@_lci S_female@_uci ///
			 S_diff S_diff@_lci S_diff@_uci, i(id) j(yydx) 

We can now easily plot, for example, the age-standardised 5-year cause-specific survival for men and women.

twoway (rarea S_male_lci S_male_uci yydx, sort color(blue%25)) ///
       (line S_male yydx, sort lcolor(blue) lpattern(dash_dot)) /// 
	   (rarea S_female_lci S_female_uci yydx, sort color(red%25)) ///
       (line S_female yydx, sort lcolor(red) lpattern(solid)) /// 
                 , ysize(8) xsize(11) ///
				 title("Age-standardised 5-year cause-specific survival") ///
 				 subtitle("Standardised to the age-distribution among all patients for all years") ///
                 ylabel(,angle(h) format(%3.2f)) ///
                 ytitle("5-year survival (age-standardised)") name("agestand", replace) ///
                 xtitle("Year of diagnosis") ///
				 legend(label(2 "men") label(4 "women") order(2 4) ring(0) position(6) col(1))		
Age-standardised 5-year cause-specific survival for men and women. Standard population is the age-distribution among all patients for all years. Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.
Age-standardised 5-year cause-specific survival for men and women. Standard population is the age-distribution among all patients for all years. Patients diagnosed with melanoma 1975-1994 in an unspecified country with a 10 year follow-up.

Age-standardisation with standsurv using an external standard (ICSS)

The code for external (ICSS) age standardisation is available here.

We previously age-standardised using a so-called internal standard. That is, we used the age distribution of the patients (over all years) as the standard population. The age-standardised survival for each year of diagnosis is interpreted as the survival we would have observed if the age distribution of patients was the same at each year of diagnosis.

We will now age-standardise using an external standard population, namely the International Cancer Survival Standard (ICSS) (Corazziari et al 2004). ICSS consists of three standard populations, we will use number 2 which is intended for cancer sites with broadly constant incidence by age. See this page for an overview.

In traditional age-standardisation we calculate survival within each age group and then take a weighted average of the age-specific estimates. An advantage of the standsurv approach is that we apply weights at an individual level, which precludes the need to explicitly estimate survival within each age group. This makes it possible to apply age-standardisation with sparse data where the traditional approach can break down; see our paper Estimation of age-standardized net survival, even when age-specific data are sparse.

We first need to recreate the age groups in a manner consistent with ICSS.

. drop agegrp
. label drop agegrp
. drop if age < 15
(12 observations deleted)
. egen agegrp=cut(age), at(0 15 45 55 65 75 200) icodes

The following table shows the age distribution in the patient cohort for year of diagnosis 1975. The output has been modified to also show the percentage in the ICSS population and the ICSS weights (calculated as percent in ICSS divided by percent in patient cohort).

. tab agegrp if yydx==1975

 agegrp |  Freq.   Percent    ICSS %  weight 
--------+----------------------------------- 
  15-44 |     67     33.67     28      0.831    
  45-54 |     36     18.09     17      0.940    
  55-64 |     33     16.58     21      1.266    
  65-74 |     46     23.12     20      0.865    
    75+ |     17      8.54     14      1.639    
--------+----------------------------------- 
  Total |    199    100.00    100      

We can see that percentage of patients in each age group is roughly similar to the proportion in the ICSS population.

We will now create an individual weight for each observation, calculated as the proportion of patients in the given age group in the standard population (ICSS) divided by the proportion in that age group in our patient data. The idea is to upweight (assign a weight greater than 1) those age groups that are underrepresented in our patient cohort (relative to the standard) and downweight (assign a weight less than 1) those that are overrepresented.

In our data set, 33.67% of patients are in the youngest age group compared to 28% in the standard population. As such, each patients in the youngest age group in our population will be given a weight of 28/33.67=0.831. That is, our population contains more patients in the youngest age group than the standard population, so each patient receives a weight less than 1. At the other end of the age scale, 8.54% of the patients are in the oldest age group compared to 14% in the standard population. As such, each patient in the oldest age group in our population will be given a weight of 14/8.54=1.639. That is, for 1975 there are fewer patients in the highest age group than in the standard population so we upweight the patients in our cohort.

The age distribution differs by year of diagnosis. We see from the table below that the patients diagnosed in 1994 are, on average, older than those diagnosed 1975.

. tab agegrp if yydx==1994

 agegrp | Freq.   Percent   ICSS %  weight 
--------+--------------------------------- 
  15-44 |    70     19.23    28      1.456    
  45-54 |    77     21.15    17      0.804    
  55-64 |    79     21.70    21      0.968    
  65-74 |    70     19.23    20      1.040    
    75+ |    68     18.68    14      0.749    
--------+--------------------------------- 
  Total |   364    100.00   100      

We now calculate the individual weights and apply them when using standsurv. Note that we do not use the weights when fitting the model, only in standsurv. The only change to the call to standsurv compared to the code for internal standardisation is the addition of the indweights(w) option.

generate t5=5 in 1
forvalues y = 1975/1994 {
  display "Calculating age-standardised survival for year: `y' "
 
  // Create weights to use for external age-standardisation
  // Separate weights are required for each year (since age distribution varies by year)
  // total obs for each year
  count if yydx==`y'
  local total: display %3.0f r(N)
  
  // count of number of patients in each agegroup for each year
  by agegrp: egen n_age`y'=sum(yydx==`y')

  // generate weights
  gen w`y' = ICSSwt/(n_age`y'/`total') 
  // write the spline basis vectors to local macro variables
  rcsgen, scalar(`y') knots(${yearknots}) rmatrix(R) gen(c) 
  
   // estimate age-standardised survival for men and women (with difference) for year y
  standsurv if yydx==`y', at1(male 0 maleyr1 0 maleyr2 0 maleyr3 0) ///
              at2(male 1 maleyr1 `=c1' maleyr2 `=c2' maleyr3 `=c3') ///
              timevar(t5) contrast(difference) ci indweights(w`y') ///
              atvar(S_female`y' S_male`y') contrastvars(S_diff`y')

}

Comparison of approaches to external age standardisation

This is work in progress and the links are here primarily so I can easily find them.

On another page I illustrate an approach to external age-standardisation based on fitting separate models for each age strata. These files apply the two approaches to the same data (that used above) but I use a more complex model.

  1. fitting models for each age strata
  2. using standsurv as above

The results are not exactly the same because the approaches use different models. Model 2 would need interactions between age and all main effects (sex, year, sex*year) to be comparable to model 1 (which fits separate models for each age group).

Paul Dickman
Paul Dickman
Professor of Biostatistics

Biostatistician working with register-based cancer epidemiology.