/****************************************************************
MATCHING2.DO

This code available at:
http://pauldickman.com/software/stata/matching2.do

Construct a matched cohort study from two data sets (exposed 
and unexposed). Note we are matching on baseline characteristics
(i.e., this is not for a nested case-control study). 

The code illustrates how to randomly select up to 5 unexposed 
comparators matched on sex, year, and age (plus or minus 5 years).
exposed.dta contains the data for the exposed (n=156)
unexposed.dta contains data for the unexposed (n=7775)

See http://pauldickman.com/software/stata/matching/ for details.

Need to install rangejoin:
ssc install rangestat

matching.do contains code written by Paul Dickman.
matching2.do contains more elegant and easier to generalise code
             thanks to Bjarte Aagnes.

Bjarte Aagnes & Paul Dickman
16 March 2023
*****************************************************************/
clear all
 
local base http://pauldickman.com/software/stata/
tempfile exposed unexposed  
copy `base'/exposed.dta `exposed'   
copy `base'/unexposed.dta `unexposed' 
 
use `exposed', clear
rangejoin age -5 5 using `unexposed', by(sex yydx)

rename (id age status dx exit)(=1) // exposed
rename (*_U)(*0) // unexposed
 
// randomly select 5 unexposed if there are more than 5 matches
set seed 8675309
gen double shuffle = runiform()
by id1 (shuffle), sort: keep if _n <= 5
drop shuffle

// reshape from wide format to long format    
keep id1 sex yydx *0  
rename (id1 *0)(set_id *)
append using `exposed'  
replace set_id = id if mi(set_id)
gen byte exposed = (set_id==id ), after(set_id)
order *id exposed  
gsort set_id -exposed id

list set_id id exposed sex yydx age status dx exit in 1/18 , sepby(set_id)