Matching an exposed cohort with an unexposed cohort

The code used in this tutorial is available here. There is some more elegant code here.

The code uses the user-written -rangejoin- command. Install using:

ssc install rangestat

Introduction

This page illustrates how to construct a matched cohort study from two data sets (exposed and unexposed). Note we are matching on baseline characteristics (i.e., this is not for a nested case-control study).

The page illustrates how to use the rangejoin command to randomly select up to 5 unexposed comparators matched on sex, year, and age (plus or minus 5 years). It assumes you have a dataset containing the exposed and a separate data set containing the unexposed. In my example, these data sets are called exposed (with 156 observations) and unexposed (6054 observations).

Here is how we use the rangejoin command:

use exposed, clear
rangejoin age -5 5 using unexposed, by(sex yydx)

For each observation in the exposed dataset, rangejoin creates an observation with every match. For the exposed patient with id 8353 we identified 53 matches. Here is a list of the first 10.

. list id id_U sex yydx dx dx_U age age_U if id==8353

      +-------------------------------------------------------------------+
      |   id   id_U      sex   yydx          dx        dx_U   age   age_U |
      |-------------------------------------------------------------------|
   1. | 8353   3349   Female   1987   16jul1987   15jun1987    73      68 |
   2. | 8353   3962   Female   1987   16jul1987   15jun1987    73      68 |
   3. | 8353   4050   Female   1987   16jul1987   14jan1987    73      68 |
   4. | 8353   4391   Female   1987   16jul1987   14feb1987    73      68 |
   5. | 8353   3741   Female   1987   16jul1987   16may1987    73      69 |
      |-------------------------------------------------------------------|
   6. | 8353   3939   Female   1987   16jul1987   16mar1987    73      69 |
   7. | 8353   3992   Female   1987   16jul1987   15dec1987    73      69 |
   8. | 8353   4195   Female   1987   16jul1987   16apr1987    73      69 |
   9. | 8353   4240   Female   1987   16jul1987   04jan1987    73      69 |
  10. | 8353   3623   Female   1987   16jul1987   15oct1987    73      70 |

The next step is to randomly select 5 unexposed if there are more than 5 matches. We assign a random number to each observation, and then for each exposed patients we keep only the 5 lowest values of the random number.

set seed 8675309
gen double shuffle = runiform()
by id (shuffle), sort: keep if _n <= 5
drop shuffle

There are now 5 matches for exposed patient with ID 8353.

. count if id==8353
  5

. list id id_U sex yydx dx dx_U age age_U if id==8353

     +-------------------------------------------------------------------+
     |   id   id_U      sex   yydx          dx        dx_U   age   age_U |
     |-------------------------------------------------------------------|
396. | 8353   4052   Female   1987   16jul1987   17aug1987    73      75 |
397. | 8353   4196   Female   1987   16jul1987   28aug1987    73      70 |
398. | 8353   3962   Female   1987   16jul1987   15jun1987    73      68 |
399. | 8353   4050   Female   1987   16jul1987   14jan1987    73      68 |
400. | 8353   4085   Female   1987   16jul1987   16apr1987    73      72 |
     +-------------------------------------------------------------------+

The final step is to reshape from wide format to long. Have a look at the code in matching.do or matching2. matching.do contains code that is less elegant but may be easier to understand whereas matching2.do contains code that is more elegant and easier to gereralise.

After reshaping and some data manipulation we have the following observations for ID 8353. We created a variable, set_id to index the matched sets and a binary variable exposed. Each matched set contains one exposed and five unexposed.

+--------------------------------------------------------------------------------------+
| set_id     id   exposed      sex   yydx   age         status          dx        exit |
|--------------------------------------------------------------------------------------|
|   8353   8353         1   Female   1987    73          Alive   16jul1987   31dec1995 |
|--------------------------------------------------------------------------------------|
|   8353   3962         0   Female   1987    68   Dead: cancer   15jun1987   30jan1992 |
|   8353   4050         0   Female   1987    68          Alive   14jan1987   31dec1995 |
|   8353   4052         0   Female   1987    75   Dead: cancer   17aug1987   01feb1988 |
|   8353   4085         0   Female   1987    72          Alive   16apr1987   31dec1995 |
|   8353   4196         0   Female   1987    70    Dead: other   28aug1987   13may1989 |
+--------------------------------------------------------------------------------------+
Paul Dickman
Paul Dickman
Professor of Biostatistics

Biostatistician working with register-based cancer epidemiology.