/**************************************************************** MATCHING.DO This code available at: http://pauldickman.com/software/stata/matching.do Construct a matched cohort study from two data sets (exposed and unexposed). Note we are matching on baseline characteristics (i.e., this is not for a nested case-control study). The code illustrates how to randomly select up to 5 unexposed comparators matched on sex, year, and age (plus or minus 5 years). exposed.dta contains the data for the exposed (n=156) unexposed.dta contains data for the unexposed (n=7775) See http://pauldickman.com/software/stata/matching/ for details. Need to install rangejoin: ssc install rangestat matching.do contains code written by Paul Dickman. matching2.do contains more elegant and easier to generalise code thanks to Bjarte Aagnes. Paul Dickman (paul.dickman@ki.se) 13 March 2023 *****************************************************************/ clear all local base http://pauldickman.com/software/stata/ tempfile exposed unexposed copy `base'/exposed.dta `exposed' copy `base'/unexposed.dta `unexposed' use `exposed', clear // For each observation in exposed, select all unexposed // with same sex and year of diagnosis with age +/- 5 years // the data set goes from 156 to 6054 observations rangejoin age -5 5 using `unexposed', by(sex yydx) // there are 53 matches for ID 8353 count if id==8353 list id id_U sex yydx dx dx_U age age_U if id==8353 // randomly select 5 unexposed if there are more than 5 matches set seed 8675309 gen double shuffle = runiform() by id (shuffle), sort: keep if _n <= 5 drop shuffle // There are now 5 matches for ID 8353 count if id==8353 list id id_U sex yydx dx dx_U age age_U if id==8353 // reshape from wide format to long format // we want one observation per individual // we will create the variable exposed (1 if exposed, 0 if unexposed) // we will create the variable set_id to index the matched sets // we start by renaming variables in a way easily recognised by -reshape- rename age age1 rename status status1 rename dx dx1 rename exit exit1 rename age_U age2 rename status_U status2 rename dx_U dx2 rename exit_U exit2 // reshape from wide to long reshape long age status dx exit, i(id id_U) j(exposed) // Look at ID 8353 count if id==8353 list if id==8353 // We need to drop the duplicates for the exposed // and create desired variables generate set_id=id replace exposed=0 if exposed==2 replace id=id_U if exposed==0 drop id_U duplicates drop set_id if exposed==1, force // Another look at the observations for ID 8353 list set_id id exposed sex yydx age status dx exit if set_id==8353 // A look at 3 matched sets list set_id id exposed sex yydx age status dx exit in 1/18 , sepby(set_id)