log using heckman_log.txt, text replace
*** Heckman Selection Models
************************************
*Downloading dataset from the web:
use http://www.stata-press.com/data/r8/womenwk, clear
* This is a dataset of women's wages,
* where women who do not work have wage = . (missing)
* Generating a dummy for those who work--ie, have positive wages:
generate d = 1
replace d = 0 if wage == .
** Wages depend on (X vars) education and age...
** But there is a prior decision to get a job
** So the labor force participation decision affects the wage observed sample
** We need to estimate a "selection equation":
** Work-decision (a dummy) depends on (Z vars): being married, # of children
** plus education and age
** Note that X is a subset of Z--otherwise the system is not identified
** 1. Using Heckman (1979) "two-step consistent" procedure:
heckman wage educ age, select (married children educ age) twostep
* Note that stata automatically assumes missing wage cases as unobserved
* And reports the estimated results of both the structural and the selection equations
** 2. Using a maximum-likelihood procedure (pretty much the same thing--but
** ML is a little biased):
heckman wage educ age, select (married children educ age) nolog
** 3. To avoid ambiguity, you also can specify the selection equation:
heckman wage educ age, select (d = married children educ age) twostep nolog
* Same result as in model 1
* Recall that the selection equation is just a probit (because Heckman assumes
* a normal distribution for "d")
probit d married children educ age, nolog
* It's the same result as the "selection" model in 3
** Exploring Model 3
heckman wage educ age, select (d = married children educ age) twostep nolog
predict cndwage, ycond
* ycond calculates the expected value of the dependent variable conditional on the
* dependent variable being observed/selected; E(y | y was observed).
predict expwage, yexpected
* yexpected calculates the expected value of the dependent variable (y*), where that
* value is taken to be 0 when it is expected to be unobserved;
* y* = P(y observed) * E(y | y was observed).
* Create an artifact variable (actually, a left-censored variable)
gen wage0 = wage
replace wage0 = 0 if wage >= .
* wage0 contains positives wages or zeros when wage was missing
summarize wage cndwage if wage < .
* The mean predicted wage (conditional on being observed) is the same
* as the mean observed wages!
summarize wage0 expwage
* The mean predicted wage (for the full sample) is the same
* as the mean of the wage0 artifact!
* ...and now you know it...
log close