log using heckman_log.txt, text replace *** Heckman Selection Models ************************************ *Downloading dataset from the web: use http://www.stata-press.com/data/r8/womenwk, clear * This is a dataset of women's wages, * where women who do not work have wage = . (missing) * Generating a dummy for those who work--ie, have positive wages: generate d = 1 replace d = 0 if wage == . ** Wages depend on (X vars) education and age... ** But there is a prior decision to get a job ** So the labor force participation decision affects the wage observed sample ** We need to estimate a "selection equation": ** Work-decision (a dummy) depends on (Z vars): being married, # of children ** plus education and age ** Note that X is a subset of Z--otherwise the system is not identified ** 1. Using Heckman (1979) "two-step consistent" procedure: heckman wage educ age, select (married children educ age) twostep * Note that stata automatically assumes missing wage cases as unobserved * And reports the estimated results of both the structural and the selection equations ** 2. Using a maximum-likelihood procedure (pretty much the same thing--but ** ML is a little biased): heckman wage educ age, select (married children educ age) nolog ** 3. To avoid ambiguity, you also can specify the selection equation: heckman wage educ age, select (d = married children educ age) twostep nolog * Same result as in model 1 * Recall that the selection equation is just a probit (because Heckman assumes * a normal distribution for "d") probit d married children educ age, nolog * It's the same result as the "selection" model in 3 ** Exploring Model 3 heckman wage educ age, select (d = married children educ age) twostep nolog predict cndwage, ycond * ycond calculates the expected value of the dependent variable conditional on the * dependent variable being observed/selected; E(y | y was observed). predict expwage, yexpected * yexpected calculates the expected value of the dependent variable (y*), where that * value is taken to be 0 when it is expected to be unobserved; * y* = P(y observed) * E(y | y was observed). * Create an artifact variable (actually, a left-censored variable) gen wage0 = wage replace wage0 = 0 if wage >= . * wage0 contains positives wages or zeros when wage was missing summarize wage cndwage if wage < . * The mean predicted wage (conditional on being observed) is the same * as the mean observed wages! summarize wage0 expwage * The mean predicted wage (for the full sample) is the same * as the mean of the wage0 artifact! * ...and now you know it... log close