/****************************************************************
MATCHING.DO

This code available at:
http://pauldickman.com/software/stata/matching.do

Construct a matched cohort study from two data sets (exposed 
and unexposed). Note we are matching on baseline characteristics
(i.e., this is not for a nested case-control study). 

The code illustrates how to randomly select up to 5 unexposed 
comparators matched on sex, year, and age (plus or minus 5 years).

exposed.dta contains the data for the exposed (n=156)
unexposed.dta contains data for the unexposed (n=7775)

See http://pauldickman.com/software/stata/matching/ for details.

Need to install rangejoin:
ssc install rangestat

matching.do contains code written by Paul Dickman.
matching2.do contains more elegant and easier to generalise code
             thanks to Bjarte Aagnes.

Paul Dickman (paul.dickman@ki.se)
13 March 2023
*****************************************************************/
clear all

local base http://pauldickman.com/software/stata/
tempfile exposed unexposed  
copy `base'/exposed.dta `exposed'   
copy `base'/unexposed.dta `unexposed' 


use `exposed', clear

// For each observation in exposed, select all unexposed
// with same sex and year of diagnosis with age +/- 5 years
// the data set goes from 156 to 6054 observations
rangejoin age -5 5 using `unexposed', by(sex yydx)

// there are 53 matches for ID 8353 
count if id==8353
list id id_U sex yydx dx dx_U age age_U if id==8353

// randomly select 5 unexposed if there are more than 5 matches
set seed 8675309
gen double shuffle = runiform()
by id (shuffle), sort: keep if _n <= 5
drop shuffle

// There are now 5 matches for ID 8353 
count if id==8353
list id id_U sex yydx dx dx_U age age_U if id==8353

// reshape from wide format to long format
// we want one observation per individual
// we will create the variable exposed (1 if exposed, 0 if unexposed)
// we will create the variable set_id to index the matched sets
// we start by renaming variables in a way easily recognised by -reshape-
rename age age1
rename status status1
rename dx dx1
rename exit exit1

rename age_U age2
rename status_U status2
rename dx_U dx2
rename exit_U exit2

// reshape from wide to long
reshape long age status dx exit, i(id id_U) j(exposed) 

// Look at ID 8353 
count if id==8353
list if id==8353

// We need to drop the duplicates for the exposed
// and create desired variables
generate set_id=id
replace exposed=0 if exposed==2
replace id=id_U if exposed==0
drop id_U

duplicates drop set_id if exposed==1, force

// Another look at the observations for ID 8353
list set_id id exposed sex yydx age status dx exit if set_id==8353

// A look at 3 matched sets
list set_id id exposed sex yydx age status dx exit in 1/18 , sepby(set_id)