![]() |
![]() |
|
Library
Information
|
||||||
|
Page 2/2 | Return to Page 1
Statistical Modeling of Community Level Outcomes One approach that has been tried in the evaluation of the effects of programs measured at the level of communities or larger is statistical modeling of community level outcomes. In these cases, the procedure is to use past data on the outcome variable to estimate a statistical model and then use that model to generate a form of counterfactual - what would have happened to that outcome at the community level had the program not been instituted. The measured outcomes in the program community are then compared to the values predicted from the statistical model in order to assess the impact of the program on the outcome of interest. 1. Time-series Modeling Time series models of community level outcomes have long been advocated as a means of assessing the effects of program innovations or reforms[13]. In the simplest form, the time-series on the past values of the outcome variable for the community is linearly extrapolated to provide a predicted value for the outcome during and after the period of the program intervention. In a sense, the pre-post designs discussed above are a simple form of this type of procedure. It has been recognized for a long time that the simple extrapolation design is quite vulnerable to error because even in the absence of any intervention community variables rarely evolve in a simple linear fashion. Some attempts have been made to improve on the simple linear form by introducing some of the more formal methods of time-series modeling[14]. Introducing non-linearities in the form can allow for more complex reactions to the program intervention (McCleary and Riggs 1982). Another attempt uses preprogram measures of cohorts as a time-series of comparison group values used with the in-program treatment measures for previous cohorts (McConnell, 1982). The problem with these methods is that they do not explicitly control for variables other than the program intervention which may have influenced the outcome variable. 2. Multivariate Statistical Modeling Some attempts have been made to estimate multivariate models of the community level outcome variables in order to generate counterfactuals for program evaluation[15]. We have not been able to find examples of such attempts at the community level but there are several examples of attempts to estimates caseload models for programs (such as AFDC and Food Stamps) at the national and state level (Grossman 1985, Beebout and Grossman 1985, Garsky 1990, Garsky and Barnow 1992,Mathematica Policy Research..Puerto Rico: Volume II 1985). Most analysts have considered these results of these models to be unreliable for program evaluation purposes. For example, effects of changes in the low wage labor market appeared to have swamped the effects controlled for in models of the AFDC caseload in New Jersey leading to implausible estimates of the effects of an AFDC reform in New Jersey. Note that these models would have to attempt to separate out variables that are likely to effect the outcome variable but which would not themselves be affected by the program intervention and then to measure those variables in the intervention community during the course of the program and/or post-program period. For example, as noted in the New Jersey case, one would have to have obtained good measures of how the demand for low wage labor is affected at the level of the community in order to estimate the statistical model and then obtain measures of those variables during the period of the program for that community and use those measures in the statistical model to generate the counterfactual. Recall in the examples discussed above how comparison communities in EOPP were affected by floods, hurricanes and volcanic eruptions or in YIEPP where court-ordered school desegregation occurred in the comparison community. Adequate statistical modeling would have to attempt to incorporate such factors. Statistical modeling at the community level also runs up against the problem of the limited availability of small area data, particularly provided on a consistent basis over several periods of time or across numerous communities. Such data are necessary both for the estimation of the statistical model of the community level outcome and for the projection of the counterfactual value of the outcome for the program period, e.g. if the model includes local employment levels as affecting the outcome then data on local employment during the program period must be available to use in the model. A general problem in using this approach, which we will return to in the concluding section, is that there has been so little quantitative study of community level data. The development of good statistical models at the community level will require more extensive efforts to bring together community level data and to understand what factors influence how communities evolve. Types of Hypotheses Which Could Be Tested In this section we outline the types of hypotheses with respect to community wide initiatives which could be tested in various types of evaluative situations. We will discuss hypotheses in broad generic terms before considering some more directly tied to possible community-wide initiative concerns. It is important to be clear at the outset of this section that for the most part we are discussing hypothesis testing where random assignment has been possible. Thus we assume as background that the fundamental problems of developing community level counterfactuals discussed above are already assumed and here we are discussing added problems which arise according to the type of hypothesis to be tested. In a few places we remind the reader about the more fundamental community selection bias problems but we want to avoid continually repeating that this is an underlying problem. a. Single outcome from a single treatment The situation which most closely follows the classical experimental design is one in which there is a single outcome variable which is hypothesized to be affected by a single simple treatment. For example, the birth weight of children is hypothesized to be affected by the provision of guaranteed minimum level of cash income to the pregnant mother. (We use this example because, in fact, that was one of the unforeseen outcomes in one of the Negative Income Tax experiments in the 1970s, see Keherer and Wohlin ...). The outcome, birthweight, is easy to measure and relatively well monitored. The treatment is about as straightforward as any we can think of, though even in this case there can be numerous complications of definition and implementation. If this is not a community-wide initiative, then one could use random assignment of individual women to getting the guarantee or not getting it and thereby create a treatment group and control group. If the guarantee were made community-wide then one would face all the problems of creating an adequate constructed comparison group which have been outlined above. Note that we have no complex theory here of the mechanisms through which the treatment affects the outcome; this is the "black box" approach. We could hypothesize some simple mechanisms through which the treatment might operate and then seek to monitor related processes, e.g., the mothers use the better income to improve their diet, but we do not test these mechanisms directly. For example, an alternative mechanism might be the reduced stress on the mothers due to the removal of uncertainty about income. It is usually straightforward to extend this type of situation to the case where there are hypothesized to be multiple outcomes affected by the single treatment. For example, it could be hypothesized that the guaranteed income would improve not only birthweight of newborns but also the school performance of school age children in the household (another largely ignored outcome in several of the Negative Income Tax experiments). Most of the work-welfare experiments hypothesized, and measured, effects on both employment and receipt of welfare b. Single outcomes from multiple treatments Single outcomes could be affected by different structures of treatments. Multiple treatments can be generated by systematically varying parameters of a single type of treatment. An example of this is the National Health Insurance experiment. Health insurance was the type of treatment and the parameters which were systematically varied were the levels of deductibles and co-payments (as well as assignment to an HMO). The major outcome of interest was expenditures on medical care. Random assignment of individuals to insurance plans with different values for these parameters permitted tests of the independent effects of co-payments and deductibles. Critical here is the systematic structuring and variation of the parameters of interest; sufficient independent variation of deductibles and co-payments among the groups of individuals was necessary in order to estimate the effects of varying one while holding the other constant. There was no "null treatment" in this case; everyone had health insurance, the groups varied in the level of out-of-pocket cost per unit of utilization of medical services. Again, it was also possible to test hypotheses with respect to more than one outcome. In this example, researchers could test the effects of the treatment parameter variations on the types of medical services utilized, e.g. hospital or outpatient, medication, tests, specific procedures. Interestingly, health status was not, originally, a principle outcome hypothesized to be affected, and, indeed, the original design of the experiment had no provision for attempting to measure health status. In the end, however, extensive and innovative research was done to develop better measures of health status and a few effects of the health insurance parameters on health status were detected (see Newhouse, etal. 1993). c. Internal treatment differences: length of stay, participants and non-participants One of the aspects of formal evaluations which adhere to rigorous standards of random assignment which has been most vexing to operators of programs being evaluated and to policy-makers who seek guidance from evaluations are the limitations on testing hypotheses about the effects differences in exposure to the treatment within the group which was randomly assigned to the treatment category - as opposed to the control group. We call these "internal treatment differences". Of greatest common concern are differences in outcomes between those in the treatment group who actually participate in the program and receive services and those who do not participate. It appears sensible to most persons to evaluate a hypothesis about the effects of the treatment by seeing what happens to those who actually received services. The rigorous inference standards, however, require that the non-participants be included with the participants as part of the treatment group and compared to the control group in order to estimate the effects of the treatment on the outcome of interest. Naturally, to many it appears that the treatment is "diluted" by the inclusion of non-participants. The reason that rigorous inference standards call for this procedure is the same problem we have been discussing throughout this paper: how does one find an appropriate counterfactual? The problem in this case is to isolate the appropriate subset of the control group who would have participated had they been offered the treatment. Once again, the difficulty is with selection bias caused by unmeasured variables which affect both participation and the outcome. To repeat from early discussion: the unmeasured effects could go either way. Perhaps those who participated are better motivated or on the contrary perhaps they are those that had fewer alternative opportunities or less "gumption" to get out and "do it on their own". Or the selection may have been generated by aspects of the treatment such as the decisions of the program operators to make greater efforts to enroll the "better candidates" or on the contrary to assure receipt of services by those "most in need". In certain limited circumstances, it is possible to estimate differences between participants and non-participants but it requires one very strong assumption: that the process that resulted in the offer of the opportunity to receive the treatment did not itself generate behavior that affects the outcome of interest. If that assumption holds than it can be assumed that the proportion of non-participants and their average outcome would be the same in both control and experimental group and the average outcome for the non-participants in the treatment group can be assumed for the same proportion of the control group and subtracted from the average outcome for the whole control group (with variance of the estimates appropriately corrected)[16]. As an example of why this strong assumption might not hold, consider some of the work\welfare programs which have been based on mandatory entrance into the program process. After random assignment those assigned to the treatment group faced the threat of sanctions - reduction or elimination of welfare payments - if they did not participate in given program activities. Those who do not participate undoubtedly make greater effort to find alternatives to welfare than would their equivalents in the control group who do not face the threat of such sanctions. In addition to the problems of rigorous statistical inference, another consideration leads to an argument that one should estimate impacts with both non-participants and participants included as the treatment group. Usually we are interested in how a given program, which is the treatment, will affect a given population and part of their response to the program is the decision to participate, to take up the services offered. Unless we can somehow force people to utilize the services, the estimates of the programs impact should take into account the likely nonparticipation. However, as some have emphasized[17], it may be important to try to understand better the factors that affect program participation decisions. These decisions are undoubtedly affected to some degree by the way the program is presented, by the way it is implemented, by the "reputation" it develops. To the degree that these aspects can be controlled through policy then modifications of the program could induce different levels of participation in different groups. Better understanding of participation decisions could also lead to more effective quantitative modeling of participation decisions. Such models, if highly effective, might permit use of better "selection bias correction" methods similar to those outlined above. The types of problems created by participation phenomenon carry over into the issue of the effects of the length of stay or length of exposure. Here also program operators and policy-makers argue for estimates of differences in impacts associated with a longer exposure to the treatment; "don't those that stay longer get more services and therefore do better"? Once again, standards of rigorous inference preclude sound estimates of the effects of lengthy of stay; how do we separate out those in the control group who would have stayed longer so they may be compared to the long stayers in the treatment group? Why do some members of the treatment group stay longer? We can use their measured characteristics to "match" them with control group members but once again it is the unmeasured variables which may bias the estimates: motivation operating either for or against long staying, program operator actions to encourage particular individuals to stay or to leave early. Once again, one can try to model quantitatively the determinants of length of stay and correct for bias but the chances for success in this are quite small. Also, from the policy perspective, estimates of the effects of length of stay would only be useful to the extent that one had the policy instruments which would cause individuals to stay longer[18]. These same problems would arise with other forms of internal treatment differences such as choice of different training streams within a given training program. Since the choices occur after random assignment, unmeasured variables may influence how the choices are made and would bias any estimates derived from differences between treatments and controls[19]. d. Interaction Effects In this section we will discuss several types of interaction effects: those between characteristics of participants and the treatment, those among various dimensions of one treatment or multiple types of treatments, those among difference types of participants (or institutions). It seems evident that arguments for community-wide interventions are based on assumptions about the importance of several of these types of interaction effects. Obtaining estimates of some of these interactions is relatively straightforward within a context in which random assignment is possible[20]. We will discuss these first. i. Subgroups The groups involved in the intervention study can be divided in a variety of ways into subgroups based on their preprogram characteristics, e.g.,ethnicity, level of education, gender. Hypotheses regarding differences in the effects of the treatment can be tested by estimating separately for each group the difference in the outcome variable for those in the treatment group and the control group. The important differentiation between this and the just previously discussed situation is that the characteristics defining the subgroups were determined prior to entrance into the treatment or control group (they are exogenous to the determination of treatment status) so that there is no opportunity for selection on unmeasured variables. The only problem in this case is the reduction in sample size as the total study sample is broken down into smaller subgroups. The smaller the sample size, the bigger must be the impact on the outcome variable to pass the test for statistical significance; there is a greater chance that even though there is a sizeable difference in the outcome between treatment group members and control group members it will be judged to be statistically insignificance. ii. Treatment interactions Often there are multiple dimensions to a treatment or there is more than one type of treatment being administered. For example, a training program will also provide sex education program and the outcome of interest is number of births and the interaction to be tested is whether the combination of training and sex education has a greater effect in reducing births than each program taken alone. What is required to estimate the effects of these types is random assignment to different treatment combinations, e.g., training alone, training plus sex education and sex education alone[21]. Estimates of the interaction effects can be made by comparing effects in each group separately (or equivalently in an estimating equation including both linear and interaction terms). Once again, the major problem is assuring that there is sufficient sample size in the various groups to have statistical power adequate to find statistically significant interaction effects of the size relevant for the investigators interest (policy or scientific). iii. Group interactions Now we consider the type of interaction most relevant to community-wide initiatives, those involving interactions among individuals and between individuals and institutions which modify the impact of the treatment. Brown and Richmond illustrate this concern: "Too often in the past, narrowly defined interventions have not produced long-term change because they have failed to recognize the interaction among physical, economic and social factors that create the context in which the intervention may thrive or flounder"[22]. Commentators have classified such interactions in a variety of ways: contagion or epidemic effects, social capital, neighborhood effects, externalities and social comparison effects (some of these sometimes treated as sub-categories of others). We have not taken the time to carefully catalog and reorder these classification (though such an analysis might help with an orderly development of evaluation research). We simply give examples of some broad categories of group interactions which might be of concern to evaluators of community initiatives. iv. networks and group learning The importance of associational networks has been increasingly emphasized in the literature on communities and families. The general idea that response to an intervention can be conditioned by the associational networks stems most simply from the idea that information about the form of the intervention, how it treats individuals in various circumstances, is likely to be passed from individual to individual and as a result the group learning about the intervention is likely to be faster and greater than would be the learning of the isolated individual. This in turn may condition the individual response to the intervention. Different network structures would induce different degrees of group learning and therefore different responses. Stronger forms of interaction from networks would fall under what some have called "norm formation"[23]. Here one might mean both the way preexisting norms either impede or facilitate response to the intervention or the way group learning in response to the intervention reshapes group norms. For example, the existence of "gang cultures" may impede interventions or some interventions may seek to reshape the norms of the "gang culture" to cause them to facilitate other aspects of the intervention. Finally, some interventions may seek to operate directly on networks, having social network change as either an intermediate or final outcome of interest. The evaluation problems will differ depending on how these associational networks are considered. For example, suppose the objective is to test how different associational networks affect response to a given intervention. If networks are measured and classified prior to the intervention than individuals could broken into different subgroups according to network type and subgroup effects could be analyzed in the usual manner just described above (the network category is exogenous to the intervention). To the extent that the character of associational networks are an outcome variable (intermediate or final) of interest, they can be measured and the impact of the intervention upon them analyzed in the same fashion as for other outcome variables; measurement of the associational networks or those in the treatment group can be compared to the networks of those in the control group. Here the problems are primarily those associated with the reliability and consistency of measures of associational networks and what the properties of those measures are (what is their normal variance, how sensitive are they likely to be to impact from the intervention) as they relate to the adequacy of the sample design for evaluation. Notice that the previous paragraphs take the network to be something that can be treated as a characteristic of the individual and the individual the unit of analysis. These analyses could be carried out even when there is not a community-wide intervention. Most would argue, however, that the group learning effects are really most important when groups of people, all subject to the intervention, interact. When this is the case, we are immediately thrown into the problems covered above under the discussion of using communities for constructed comparison groups; since random assignment of individuals to the treatment or control group is precluded when one wishes to have groups of individuals potentially in the same network treated, testing for this form of interaction effect will be subject to the same problems of selection bias outlined above. v. Interactions through formal and informal institutions Most interventions take the form of alteration of some type of formal institution that affects the individuals: a day care center, a welfare payment, an education course. A given broad type of intervention can be delivered through different types of formal structures - e.g. income support can come through a cash payment or an in-kind (food stamps) payment. The interactions of formal institutions with broad treatment of this type have been evaluated - e.g. in studies of food stamps cash-outs. However, most of those concerned with community-wide initiatives appear to be more interested in either the way the formal institutional structure in a given community conditions the individuals responses or with the access to or behavior of the formal institutions themselves as outcomes of the intervention. With respect to the former concern, some studies seek to have the formal institutional structure as one of the criterion variables by which communities are matched and thus seek to neutralize the impact of interactions of formal institutions and the treatment. Both the Healthy Start and the School Dropout studies have already been mentioned as examples in which matching formal institutional structures are concerns in selection of comparison sites and we have already mentioned the problems of measurement and the limits of statistical gains from such attempted matches. With respect to access to or behavior of formal institutions as outcomes there are different problems. First, access as an outcome variable might be relatively straight forward to measure and it may be easy to estimate the effects of the intervention on it, e.g., the number of doctors visits of pregnant women, participation in bilingual education programs. With respect to behavior of institutions the question arises as to whether the institution itself is the primary unit of analysis. When the institution itself is the unit of analysis then we must face anew all the aspects of sample design if we wish to use formal statistical inference concerning the behavior of an institution: what is the measure of behavior of interest; what is its normal variance; how many units can be subject to the intervention treatment; can we do random assignment or can we rely on constructed comparison groups; can sufficient sample size be attained; and, more deeply, is the underlying behavior of the institution generated by a common stable process? Informal institutions are also subjects of interest. The associational networks discussed above are surely examples, as are gangs. But there are informal economic structures which also fall in this category. The labor market is an informal institution whose operations interact with the intervention and condition its impact. This can be most concretely illustrated by reference to a problem sometimes discussed in the literature on employment and training programs: "displacement". The basic idea is that workers trained by a program may enter the labor market and become employed but if there is already involuntary unemployment in the relevant labor market, total employment may not be increased because that worker simply "displaces" a worker who would have been employed in that job had the newly trained worker not shown up[24]. An evaluation with a number of randomly assigned treatment and control group members which is small relative to the size of the relevant labor market would be unable to detect these "displacement" effects if they did occur because their numbers are too small relative to the size of the market; the trained treatment group member is not likely to show up at exactly the same employer as the control group member would have. It has been argued by some that use of community-wide interventions in employment and training would provide an opportunity to measure the extent of such "displacement effects" because the size of the intervention would be large relative to the size of the local labor market, indeed one of the hopes for the YIEPP was that it provide such an opportunity. But as the experience with YIEPP, described above, illustrates, the use of comparison communities called for in this approach is subject to a number of serious pitfalls[25]. iv. interactions with external conditions Some attempts have been made to see how changes in conditions external to an intervention which are experienced commonly by the treatment and control group members have conditioned the response to the treatment. For example, in the National Supported Work Demonstration attempts were made to see if the response the to the treatment (supported work) varied systematically with the level of local unemployment. In this case there were no statistically significant differences in response but researchers felt this may well have been due to the weakness of statistics on the city by city unemployment rate. e. Dynamics Evaluations based on formal statistical methods have, to our knowledge, attempted to deal directly with issues of dynamics in very partial and limited ways. Usually we use the term dynamics to apply to the time dimension of either the treatment or the response. The classical experimental paradigm calls for a well identified treatment applied consistently to all the members of the treatment group. We recognize that there are dynamic aspects of most treatment implementations and often suggest that evaluations not begin their observational measurement during the initial period of program buildup because it is felt that the treatment regime is not yet stabilized. . More realistically, however, we recognize that, for nearly all social interventions, treatment regimes are really not stable and consistent. Perhaps some of the cash transfers (the negative income tax) approximated this condition since the rules for transfer amount determination remained constant over time, but most interventions are delivered through some administrative structures and these administrative structures evolve and change over time for a whole host of reasons of their own. Thus the best we can do in most cases is to say: where there is random assignment there is a control subject for each treatment subject and whatever was happening on average to the treatment subjects, under the broad general conceptual treatment, e.g. training, we can measure its effects relative to the controls. Discrete sequential experiments could be planned a priori where sequences are dependent on prior stage outcomes[26]. How to evaluate the dynamics of changes in treatment in response to learning about the implementation of the treatment, where the alterations in treatment are largely a matter of local response remains, to our knowledge, largely ignored problem[27]. There are a few attempts to measure dynamics in the response to treatments. Many employment and training programs have carried out post-program measurements at several points in time in order to attempt to measure the time path of treatment effects. These time paths are important for the overall cost-benefit analyses of these programs because the length of time over which benefits are in fact realized can greatly influence the balance of benefits and costs; the rate of decay of benefits has been an important issue in cost-benefit analyses of training and employment programs . Studies have shown both cases in which impacts appear in the early post-program period and then fade out quickly thereafter (as is often claimed about the effects of Headstart) and cases in which no impacts are found immediately post-program but emerge many months later (e.g. in the evaluation of the Job Corps). At the other end, some of the attempts to improve upon evaluation of education, training and employment programs have tried to use estimates of the preprogram time path of the variable to be used as an outcome and to attempt to assure that comparison groups adequately "match" in terms of such time paths[28]. Of course, the interrupted time series design discussed above deals with dynamics in this sense. We do not know of any attempts to trace with rigorous statistical methods dynamic patterns of response and feedback effects operating overtime through interactions of individuals with each other or institutions nor of any suggestions of what methods might be employed to do so. Steps in Development of Better Methods There are no strong recommendations which we can make concerning how best to approach the problem of evaluation of community-wide initiatives. In situations where random assignment of individuals to treatment and control groups is precluded there is no surefire method for assuring that the evaluation will not be subject to problems of selection bias when constructed comparison groups must be used - whether individuals or communities - to create the counterfactual and pre-post designs remain vulnerable to exogenous shifts in the context which may affect outcome variables in unpredictable (and often undetectable) directions. As of now, we do not see clear indications of what second-best methods might be recommended, nor have we identified particular situations which make a given method particularly vulnerable. It is important to stress, once again, that the vulnerability to bias in estimation of the impacts of interventions should not be taken lightly. First, the few existing studies of the problem show that the magnitude of errors in inference can be quite substantial even when the most sophisticated methods are used. Second, the bias can be in either direction: we may not only be led to conclude that an intervention has had what we consider to be positive impacts when in fact it had none, we may also find ourselves confronted with impact estimates which indicate, due to bias, that the intervention was actually harmful; we may be misled either to promote policies which in fact use up resources and provide few benefits or we may be led to discard types of interventions as unsuccessful which actually have underlying merit. Once such biased quantitative findings are in the public domain, it is very hard to get them dismissed, to prevent them from influencing policy decisions, even when we have strong intuition that they are biased. Beyond these rather dismal conclusions and admonitions, the best we can suggest at this time are some steps which might be taken to improve our potential for understanding how communities evolve over time and hope that that better understanding will help us to create methods of evaluation which are less vulnerable to the types of bias we have pointed out. a. Improve Small Area Data We have stressed at several points that detailed small area demographic data are very hard to come by except at the time of the decennial census. Increasingly, however, records data are being developed by a wide variety of entities which can be tied to specific geographic areas (geo-coded data). One type of work which might be fruitfully pursued is to combine various types of records data with two or more Censuses to try to develop models in which the trends in the Census data can be related to the time-series of the records data[29]. Cross-section correlations of base period records data with Census variables could be combined with the time-series of the records data to see how well they could predict the end period Census demographics for given small geographic areas. Our experience with availability of records data at the state-level (when working on the design of the evaluation of Pew's Children's Initiative) convinced us that there are far more systems-wide records being collected - in many cases with individual and geographic area level information - than we would have thought. Much of the impetus for the development of these data systems comes from the Federal government in the form of program requirements (both for delivery of services and for accountability) and, more importantly, from the Federal financial support for systems development. Evaluations of employment and training programs have already made wide use of Unemployment Insurance records and these records have broad coverage of the working population. More limited use has been made of Social Security records. In a few cases, it has been possible to merge Social Security and Internal Revenue Service records. Birth records collection has been increasingly standardized and some investigators have been able to use time series of these records tied to geographic location. The systems records, beyond these four, cover much more restricted populations, e.g., welfare and food stamps, Medicaid and Medicare, WIC. More localized record systems which present greater problems of developing comparability are education records and criminal justice records. However, in some states statewide systems have been, or are being, developed to draw together the local records. We are currently investigating other types of geo-coded data that might be relevant to community-wide measures. Data from the banking systems have become increasingly available as a result of the Community Reinvestment Act (HUMDA data). Local real estate transaction data can sometimes be obtained but information from Tax Assessments seems harder to come by. In all of these cases, whenever it is desired to obtain individualized data, problems of confidentiality present substantial barriers to general data acquisition by anyone other than public authorities. Even with the Census data, for many variables, one cannot get data at a level of aggregation below block group level. b. Enhance Community Capability to Do Systematic Data Collection We believe that it is possible to pull together records data of the types just outlined to create community databases which could be continuously maintained and updated. These data would provide communities with some means to keep monitoring, in a relatively comprehensive way, what is happening in their areas. This would make it possible to get better time-series data with which to look at the evolution of communities. To the degree that communities could be convinced to maintain their records within relatively common formats, an effort could be made to pull together many different communities to create a larger data base which would have time-series, cross-section structure and would provide a basis for understanding community processes. Going a step beyond this aggregation of records, attempts could be made to enhance the capability of communities to gather new data of their own. These could be anything from simple surveys of physical structures based on externally observed characteristics (type of structure, occupied, business or organization, public facility, etc.) carried out by volunteers within a framework provided by the community organization to full-scale household surveys on a sample or on a census basis. c. Create a Panel Study of Communities As already noted above, if many communities used common formats to put together local records data one would have a time-series, cross-section database potential. In the absence of that, admittedly unlikely, development, it might be possible to imitate the several nationally representative panel studies of individuals (The Panel Study on Income Dynamics, The National Longitudinal Study of Youth, High School and Beyond, to name the most prominent) which have been created and maintained, in some cases since the late 1960s. Here the unit of analysis would be communities - somehow defined. The objective would be to provide the means to study the dynamics of communities. They would provide us with important information on what the cross- section and time-series frequency distributions of community level variables look like, important ingredients, we have argued above, for an evaluation sample design effort with communities as units of observation. This would provide the best basis for our next suggestion, work on modeling community level variables. Short of creating such a panel study, some steps might be taken to at least get Federally funded research to try to pull together across projects information developed on various community level measures. There are increasing numbers of studies where community level data are gathered for evaluating or monitoring programs or for comparison communities. We noted above several national studies which were using a comparison site methodology (Health Start, Youth Fair Chance, the School Dropout Study) and some gains might be made if some efforts of coordination resulted in pooling some of these data. d. Modeling Community Level Variables As we mentioned above, statistical modeling might provide the basis for generating more reliable counterfactuals for community initiatives; a good model would generate predicted values for endogenous outcome variables for a given community in the absence of the intervention by using historical time series for that community and such contemporaneous variables as are judged to be exogenous to the intervention. At least such models would provide a better basis for attempting matching of communities if a comparison community strategy is attempted. e. Develop Better Measures of Social Networks and Community Formal and Informal Institutions. We have not studied the literature on associational networks in any depth, so our characterization of the state of knowledge in this area may be incorrect. However, it seems to us that considerably more information on and experience with different measures of associational networks is needed, given their central role in most theories relating to community-wide processes. Measures of the density and character of formal institutions appear to us to have been little developed - though, again, we have not searched the literature in any depth. There are industrial censuses for some subsectors. We know of private sector sources that purport to provide reasonably comprehensive listings of employers. Some Child Care Resource and Referral Networks have tried to create and maintain comprehensive listings of child care facilities. There must be comprehensive listings of licensed health care providers. Public Schools should be comprehensively listed. However, when for recent projects we have talked about how one would survey comprehensively formal institutions, what to draw on for a sampling frame was not at all clear. Informal institutions present even greater problems. Clubs, leagues, volunteer groups, etc. are what we have in mind. Strategies for measuring such phenomena on a basis which would provide consistent measures over time and across sites needs to be developed. f. Tighten Relationships between Short-term (intermediate) Outcome Measures and Long-term Outcome Measures. Inability or unwillingness to wait for the measurement of long term outcomes is a problem which many studies of children and youth, in particular, face. Increasingly we talk about "youth trajectories". Perhaps again, good comprehensive information, which we are not aware of, exist linking many short term, often softer, measures of outcomes to the long-term outcomes further along the trajectory. We find ourselves time and again asking what do we know about how that short term measure, participation in some activity, say, Boy Scouts, correlates with a long term outcome, say employment and earnings? Even more rare is information on how program induced changes in the short term outcome are related to changes in long term outcomes. We may know that the level of a short term variable is highly correlated with a long term variable but not know if that short term variable is changed to what degree does that correlate with a change in the long term variable. Thus we believe systematic compilations of information about short term and long term correlations for outcome variables would be very helpful and could set an agenda for more data gathering on these relationships where necessary. g. More Studies to Determine the Reliability of Constructed Comparison Group Designs. We have stressed the importance of information provided by the two sets of studies (Fraker, Maynard and LaLonde and Friedlander and Robins) which used random assignment data as a base and then constructed comparison groups to test the degree of error in the comparison group estimates. It should be possible to find more situations in which this type of study could be carried out. First, the replication of such studies should look at variables other than employment or earnings as outcomes to see if there is any difference in degrees of vulnerability according to the type of outcome variable and/or a different type of intervention. Second, more such studies would give us a far better sense of whether indeed the degree of vulnerability of the non-experimental methods is persistent and widely found in a variety of data sets and settings. Appendix Annotated Examples of Studies Using Various Evaluation Strategies Counterfactual from Statistical Modeling Bhattacharyya, M.N., and Layton, Allan P. "Effectiveness of Seat Belt Legislation on the Queensland Road Toll -- An Australian Case Study in Intervention Analysis." Journal of the American Statistical Association 74 (1979):596-603. This is an evaluation of the effect of three separate laws enacted between 1969 and 1972 in Australia which made compulsory both the installation of seatbelts into new cars and the wearing of seatbelts. The variable used to measure effectiveness of the legislation as the quarterly number of (relevant) road deaths from 1950 to 1976. Significance of the impact of each law was determined by a comparison of the forecasted levels of deaths in the post fit period (which only included the time up until the next law for the first two laws) to the actual observations. The first two laws showed no significant lack of fit, however the 1972 law (mandatory seat belt wearing) had significantly different results in reality than were predicted by the model so it was judged to be effective. The study makes assumptions about the constancy of accident related variables and the noise structure, the lack of other major interventions occurring simultaneously, and that the first two laws caused an exponential decline in the number of cars without seatbelts. The model (later augmented to account for the effect of the changing nature of the population of vehicles -- a growing percentage has seat belts) was judged to be inadequate because it predicted a permanent steady decline in the number of deaths, hypothetically until it reached zero, which is not feasible. Researchers then tried a causal model, in which they used the volume of driving as an independent variable. Gasoline consumption was used as a proxy for the volume of driving. The model accounts for the transitional effect after the first law of the conversion of all cars to those with seat belts. The noise structure in this model accounts for the autocorrelations between the observations. The qualitative results were the same, but with greater significance. Grossman, Jean Baldwin. "The Technical Report for the AFDC Forecasting Project for the Social Security Administration/ Office of Family Assistance." Mimeographed. Princeton: Mathematica Policy Research, February 1985. Beebout, Harold, and Grossman, Jean Baldwin. "A Forecasting System for AFDC caseloads and Costs: Executive Summary." Mimeographed. Princeton: Mathematica Policy Research, February 1985. These studies used data from 1975 to 1984 to develop caseload forecasting models which the government uses to predict future AFDC caseloads and expenditures. Judgemental method and the analytical method were compared, differentiated by the amount of reliance on the knowledge and experience of the forecaster. Both national and state-by-state models were created, with the national model performing slightly better (less variance). Garasky, Steven. "Analyzing the Effect of Massachusetts' ET Choices Program on the State's AFDC-Basic Caseload." Evaluation Review 14 (1990):701-710. Evaluation of Massachusetts Employment and Training (ET) Choices program. Modeled what the caseload would have been in the absence of the program using data from the first quarter of 1976 through the third quarter of 1983 when ET Choices was implemented. Garasky, Steven, and Barnow, Burt S. "Demonstration Evaluations and Cost Neutrality: Using Caseload Models to Determine the Federal Cost Neutrality of New Jersey's REACH Demonstration." Journal of Policy Analysis and Management 11(1992):624-636. Evaluation of the change in costs incurred by a switch to the NJ REACH (Realizing Economic Achievement) program. Cost neutrality was required to keep the program in operation. Developed AFDC caseload projection models to ensure that, within a tolerance band, the new demonstration did not exceed the costs of the prior AFDC program. Pre-intervention data used to derive the model was from 1978-87. The program was studied from the time it was phased in in the first three counties in October of 1987 until the following October when the cost neutrality negotiations were opened up again. The models calculated caseload savings that the federal government considered to be too large to be plausible. Kaitz, Hyman B. "Potential Use of Markov Process Models to Determine Program Impact." In Research in Labor Economics, edited by Farrell E. Bloch, pp. 259-283. Greenwich: JAI Press, 1979. Attempts to measure impact on labor force participation. The study describes use of longitudinal data (pre- and post-program) on labor force mobility of participants to model equilibrium labor force patterns with Markov processes (which assume that labor force participation in one period depends only on behavior in the previous period). Adjusts for aging and economic conditions. To the extent that there aren't any labor force data for a particular subgroup (they were young, in school, in the military, in prison) or that it cannot be considered representative (volatility of youth behavior), this approach will not be useful. The study gives some empirical examples from the Public Employment Program, 1971-1973. The author shows effects of loosening some of the model's strict assumptions and offers alternative techniques and extensions of the basic model. Mathematica Policy Research. "Evaluation of the Nutrition Assistance Program in Puerto Rico: Volume II, Effects on Food Expenditures and Diet Quality." Mimeographed. Princeton: Mathematica Policy Research, 1985 (see also Fraker, Thomas; Devaney, Barbara;and Cavin, Edward. "An Evaluation of the Effect of Cashing Out Food Stamps on Food Expenditures." American Economic Review 76 (1986):230-239.) An evaluation assessing the effect of a food stamp cashout program (Nutrition Assistance Program) in Puerto Rico (which in 1982 replaced the food stamp program which had been effect since 1974).Data are taken from two food intake surveys, 1977 and 1984. The authors modeled the food stamp caseload before the program was implemented. Then after the program went into effect, impacts were measured by comparing the current estimates of food expenditures to the expenditures that would have occurred in the absence of the cashout, as predicted by the model. The authors also modeled the participation decision in order to adjust for selection bias. McCleary, Richard, and Riggs, James E. "The 1975 Australian Family Law Act: A Model for Assessing Legal Impacts." In New Directions for Program Analysis: Applications of Time Series Analysis to Evaluation, edited by Garlie A. Forehand. New Directions for Program Evaluation, number 16, a publication of the Evaluation Research Society, Scarvia B. Anderson, Editor-in-Chief, San Francisco: Jossey-Bass, Inc., December 1982. An evaluation of the impact of the Australian Family Law Act (which provided for a form of no-fault divorce) on divorce rates. Annual data from 1946 to 1979. Used a compound impact model of the form developed by Box and Tiao (JASA, 1975) which estimates both temporary and permanent shifts in the series. Statistically significant impacts on divorce rates were found for both temporary and permanent components. McConnell, Beverly B. "Evaluating Bilingual Education Using a Time Series Design." In Applications of Time Series Analysis to Evaluation, edited by Garlie A. Forehand. New Directions for Program Evaluation, number 16, a publication of the Evaluation Research Society, Scarvia B. Anderson, Editor-in-Chief, San Francisco: Jossey-Bass, Inc., December 1982. Evaluation of the impact of Individualized Bilingual Instruction (IBI) on standardized test scores. The program targeted preschool to 3rd grade children of migrant farm workers. Children were pre-tested upon entering the program and then re-tested after every 100 days of attendance. The test scores were standardized by the child's age. Accumulation of pre-test
scores of children entering at various ages was collected for 4 years
and then used as a representation of how children in this sample would
have fared without the bilingual program. The results indicate that
three years of this program should, on average, improve the skills of
these children to a level competitive with children whose first language
is English. Comparison Group derived from survey data Ashenfelter, Orley "Estimating the Effect of Training Programs on Earnings." Review of Economics and Statistics 60 (1978):47-57. An evaluation of the impact of the Manpower Development and Training Act (MDTA) program on participant earnings. Studied trainees who began participation in the first quarter of 1964, using data from 1961 to 1969 (had data from before this period, tried 61,62, and 63 as alternate base years for the model). Data for treatment group was drawn from program records augmented with Social Security (SS) earnings data. Data for comparison group from the Continuous Work History Sample (CWHS). Preprogram earnings trends were different for treatment and comparison groups (p.51). Attempts were made to adjust for observed differences in earnings functions between groups using regression analysis. No attempt was made to adjust for selection bias. The author points out the problem of truncation of SS records at the maximum taxable amount (p.56). Ashenfelter, Orley, and Card, David. "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs." Review of Economics and Statistics 67(1985):648-60. An evaluation of the impact of the Comprehensive Employment and Training Act (CETA) on participant earnings. The authors studied the 1976 cohort of enrollees using data from 1970 to 1978. Data for treatment group was from the Continuous Longitudinal Manpower Survey (CLMS). Data for comparison group was from the Current Population Survey (CPS). Stratified random sample taken from those screened for eligibility criteria (p.649). Comparison group re-sampled to reflect the age distribution of CETA participants. Preprogram earnings trends different for treatment and comparison groups (p.51). The authors used a components of variance model with a random growth component and a selection rule for the participants in an attempt to control for observed differences in earnings functions as well as possible selection bias. Estimates are highly sensitive to changes in model specifications. Bassi, Laurie J. "Estimating the Effect of Training Programs with Nonrandom Selection." Review of Economics and Statistics 66(1984):36-43 (see also, Bassi, Laurie. "The Effect of CETA on the Post-Program Earnings of Participants." The Journal of Human Resources 18 (1983):539-56.) An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. Fiscal year 1976 enrollees were studies. The author uses 1973 and 1974 as base years and follows through 1978. Screening and stratified (or cell-)matching was used to construct a comparable comparison group. Attempts were made to control for both selection bias and "creaming" problem through fixed effects models. The author discusses problems of meager specification of CLMS data, truncation of SS earnings, contamination of CPS with CETA participants and other data problems. Barnow, Burt S. "The Impact of CETA Programs on Earnings." The Journal of Human Resources 22 (1987):157-193 Reviews 6 CETA studies: Westat (x2), Bassi, Bloom/McLaughlin, Dickinson/Johnson/West, and Geraci. Bloom, Howard S. "What Works for Whom? CETA Impacts for Adult Participants." Evaluation Review 11 (1987):510-527. An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. Studied participants who entered the program between 1/75 and 6/76 and followed through 1978 (unclear what they used as the base year although there are some graphs using data as far back as 1964 and since they're using the CLMS they should have SS data back to 1951). Random selection of comparison group was done from CPS subject to certain eligibility criteria and a time-varying fixed effects model was used. Bryant, Edward C., and Rupp, Kalman. "Evaluating the Impact of CETA on Participant Earnings," Evaluation Review 11(1987):473-92. An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. The authors examined FY 1976 and FY 1977 cohorts using data through 1979 and base years of 1972 and 1973 respectively. They screened for eligibility and then used stratified matching to construct comparison group. The authors claim matching strategy is robust to model specification and used an autoregressive earnings function. Dickinson, Katherine P.; Johnson, Terry R.; and West, Richard W. "An Analysis of the Sensitivity of Quasi-Experimental Net Impact Estimates of CETA Programs." Evaluation Review 11 (1987):452-472. An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. The authors examined 1976 enrollees through 1978 considering use of 1972, 1973, and 1974 as base years. They screened for eligibility and then used statistical, or "nearest neighbor," matching. The authors used a basic OLS model to control for measurable differences and then compared with estimates from an a model which uses a symmetric-difference estimator -- which should control for selection decision and individual specific fixed and random effects. They found that impacts estimates were sensitive to employment status of sample members prior to treatment period (p.469), and to the analytical model used, but robust to the matching procedure used. They admit that it is hard to determine which of the alternate estimates found through sensitivity analysis is the "correct" one. Finifter, David H. "An Approach to Estimating Net Earnings Impact of Federally Subsidized Employment and Training Programs." Evaluation Review 11 (1987):528-47. An evaluation of the impact of CETA on trainee earnings using CLMS and CPS. The author studied the cohort from FY 1976. He used stratified matching to construct comparison group. The author pooled data for 9 years (p.530), from 1970-1978 (p.535) Separate regressions for comparison and treatment groups. Used a "pooled cross-section time-series model that controls for year-specific and individual-specific (fixed) effects" (p.536) Within Site Comparison Groups Burghardt, John; Gordon, Anne; Chapman, Nancy; Gleason, Philip; and Fraker, Thomas. "The School Nutrition Dietary Assessment Study: Dietary Intakes of Program Participants and Nonparticipants." Mimeographed. Princeton: Mathematica Policy Research, October 1993. (see also the other reports from this study including: Data Collection and Sampling; School Food Service, Meals Offered, and Dietary Intakes; Summary of Findings) An evaluation of the impact on dietary intakes of the National School Lunch Program (NSLP) and the School Breakfast Program (SBP), both of which are voluntary programs. Data was collected from surveys and interviews conducted during the period of February to May of 1992 (p.xi). Participants were contrasted with non-participants. The study used a multistage stratified sample weighted by probability of selection of schools and individuals within schools - 626 schools, 350 districts, 45 states; random selection of districts, 3 schools per district, 10 students per school. One day of interviews. The analytic model adjusts for measured differences in individual, individual's family, and the characteristics of the school and community. n attempt to adjust for selection bias was made by accounting for the participation decision (joint model) -- instrumental variables approach. The authors tried alternative models with separate equation for participants and nonparticipants, each of which was adjusted for selection bias (using 2-stage approach) and got similar results. They tested the assumption that identifying variable in participation equation doesn't influence outcome of interest. They examined variance of estimates with different combination of identifying variables and compared estimates from selection bias adjusted models with those from non-adjusted models. The results showed that it is hard to get rid of this kind of selection biases the models were sensitive to slight variations in assumptions and consequently the authors caution against using the results as a true measure of impact Devaney, Barbara; Bilheimer, Linda; and Schore, Jennifer. "The Savings in Medicaid Costs for Newborns and their Mothers from Prenatal Participation in the WIC Program." Vols. 1 and 2, Mimeographed. Princeton: Mathematica Policy Research, April 1991. An evaluation of The Special Supplemental Food Program for Women, Infant, and Children (WIC) which offered prenatal services and food benefits to pregnant Medicaid beneficiaries. The study examined all Medicaid-covered births (during 1987 for four states, and the first half of 1988 for a fifth state) in terms of the costs of the prenatal program relative to the savings in postpartum care for 60 days after birth. The savings was measured as the regression-adjusted difference between postpartum costs for the voluntary participants in the program versus these costs for nonparticipants. The evaluators attempted to account for selection bias through the use of maximum likelihood estimates of a joint model of costs and participation. They had difficulty isolating a variable which influenced the participation decision but did not affect Medicaid costs (partially because of the limited nature of their data). Consequently, the difference in participation propensity for each group was quite small, and the model was not robust to even minor specification changes. The authors wanted to examine the effect of length of duration in program on outcome measures however this effect was confounded with gestational age. Jiminez, Emmanuel and Kugler, Bernardo. "The Earnings Impact of Training Duration in a Developing Country: An Ordered Probit Selection Model of Columbia's Servicio Nacional de Aprendizaje (SENA)." The Journal of Human Resources 22 (1987):228-247. An evaluation of a Columbian training program, SENA. The authors first modeled decision to participate in long courses, short courses, or not at all (trichotomous variable). Then they used the results of that model in an OLS earnings model. They examined these results and found that impact estimates not accounting for these decisions would have overestimated program effects. Data were derived from a survey conducted between 1979-1981 which has information on both SENA trainees and SENA participants who were similar to participants in terms of the types of firms in which they worked. Earnings model robust to certain specifications changes. Kiefer, Nicholas M. "Federally Subsidized Occupational Training and the Employment and Earnings of Male Trainees." Journal of Econometrics 8 (1978):111-25. The study provides evaluations of the MDTA program for a two and a half year period beginning in 1969. The sample was taken from ten major Standard Metropolitan Statistical Areas (SMSAs) for both trainees and eligible non-participants. Groups were matched on age race and gender.Separate estimates were derived for blacks and whites. The author modeled earnings as a function of estimated earnings plus a quadratic equation of weeks of participation in the program. He used a Heckman (1976) technique to control for correlation between the probability of employment and earnings (to help correct for the problem of zero earnings for those unemployed). He tested for correlation between the selection into the program and the error term from the earnings equation and found it not to be significant. Results show small negative effects on earnings for blacks and insignificant effects for whites. Kiefer, Nicholas. "Population Heterogeneity and Inference from Panel Data on the Effects of Vocational Training." Journal of Political Economy 87 (1979):p213-26. Same sample as Journal of Econometrics study. The author here used a model which includes both individual and time effects; doesn't assume that they are orthogonal to the regressors. He finds substantial cross-sectional bias. Cooley, Thomas M.; McGuire, Timothy W.; and Prescott, Edward C. "Earnings and Employment Dynamics of Manpower Trainees: An Exploratory Econometric Analysis." Research in Labor Economics, edited by Farrell E. Bloch, pp. 119-148. Greenwich: JAI Press, 1979 An evaluation of MDTA participants in 1969, 1970, and 1971 cohorts. The authors claim the use of no-shows for comparison group is superior to survey data, first,on theoretical grounds - for them to have enrolled in the program indicates substantial similarity to the treatment group; second, because the autocorrelation function of earnings very similar to that of trainees; and, third, because they can control for unobservable differences better with this comparison group. The authors argue for simpler, more robust models which require fewer assumptions. Matched site comparison groups -- no modeling Buckner, John C., and Chesney-Lind, Meda. "Dramatic Cures for Juvenile Crime: An Evaluation of a Prisoner-Run Delinquency Prevention Program." Criminal Justice and Behavior 10 (1983):227-247. An evaluation of a delinquency deterrent project based on Rahway's Scared Straight program. The sample is a one year follow-up of the first 100 male and 50 female participants (and comparison group members) in the program which began in August 1979. The authors used matched comparison groups: they manually went through prison records to match individual by individual on certain key characteristics (gender, age, arrest record, etc.). They found higher rates of post-program arrests leading to charges among males who participated in the program. Duncan, Burris; Boyce, W. Thomas; Itami, Robert; and Puffenbarger, Nancy. "A Controlled Trial of a Physical Fitness Program for Fifth Grade Students." Journal of School Health 53 (1983):467-471. An evaluation of a nine-month (lasting the school year) physical fitness program implemented in the fall of 1979. Tests were taken by subjects prior to the implementation, at the end of the school year, and at the beginning of the following school year (after the summer break with no treatment). Two fifth grade classes were studied, one from each of two neighboring schools: one received the program, one did not. The authors found no significant differences between distribution of some key characteristics (age, gender, height), but significant differences between others (ethnicity, weight, and weight for height). Significant differences in the pretest for only one measure. Significantly higher improvement was found for the treatment group on four of the nine tests. Differences had shrunk slightly by the time of the second post-test. Evans, Richard; Rozelle, Richard; Mittelmark, Maurice; Hansen, William; Bane, Alice; and Havis, Janet. "Deterring the Onset of Smoking in Children: Knowledge of Immediate Physiological Effects and Coping with Peer Pressure, Media Pressure, and Parent Modeling." Journal of Applied Social Psychology 8 (1978):126-135. An evaluation of a 10 week program to prevent youths from starting smoking (smokers excluded from study sample). Sample consisted of 759 students entering 7th grade from 10 schools. There were treatment levels: 1) Full treatment -- 4 educational videotapes, feedback (updates on classes smoking behaviors), testing (attitudes, behaviors); 2) Just feedback and testing; 3) Just testing; 4) Control -- only pre and post-testing (as opposed to 4 post-tests of other treatments). Two schools: combined populations and randomly assigned each student to one of the four levels (possible spillover/contamination) Eight schools: 2 schools assigned to each treatment level No good discussion of differences in estimates using randomly assigned individuals as opposed to assigned schools. Flay, Brian; Ryan, Katherine B.; Best, J. Allen; Brown, K. Stephen; Kersell, Mary W.; d'Avernas, Josie R.; and Zanna, Mark P. "Are Social-Psychological Smoking Prevention Programs Effective? The Waterloo Study." Journal of Behavioral Medicine 8 (1985):3759. An evaluation of a smoking prevention program. Children participated primarily over the first 3 months of their sixth grade school year (1979-80) and then received additional sessions in their 7th and 8th grade years. Twenty matched schools were randomly allocated to treatment or control status. Schools were matched on size, socioeconomic characteristics, and urban/rural designation. (Discusses one matched pair but it is unclear whether all were matched as pairs or some as groups with similar characteristics). Pretest showed no significant differences in gender or individual, peer, parental, or sibling smoking behaviors between groups. Separate estimates were made for subgroups defined by smoking behavior at pretest. The greatest effects were found for those classified at the outset as "experimenting." The authors tried to model smoking behavior at the school level using a binomial regression model, but it the fit was bad. Separate estimates for subgroups defined by risk level showed significant favorable impacts for treatment students at high-risk. Discusses issue of unit of analysis (p.41). Freda, Margaret Comerford; Damus, Karla; and Merkatz, Irwin R. "The Urban Community as the Client in Preterm Birth Prevention: Evaluation of a Program Component." Social Science Medicine 27 (1988):1439-1446. An evaluation of a pre-term birth videotape intervention aimed at increasing community awareness about the problem of pre-term births. Sample of 10 Community Boards; randomly allocated to 5 to treatment, 5 to control. Studied from June 1986 to August 1986. Hurd, Peter D.; Johnson, C. Anderson; Pechacek, Terry; Bast, L. Peter; Jacobs, David R.; and Luepker, Russell V. "Prevention of Cigarette Smoking in Seventh Grade Students." Journal of Behavioral Medicine 3 (1980):15-28. Evaluation of a smoking prevention program.(mentions three monitoring points of October, December, and May, and notes that October was the baseline, but doesn't specify a year) The study sample consisted of the seventh grade classes of four schools in a district. Two schools assigned to treatment, two to control status. Assignment was not random, picked a high-income and a low-income and a high smoking rate and a low smoking rate school for each group. Statistically significant baseline differences in smoking behavior between treatment and comparison groups. McAlister, Alfred; Perry, Cheryl; Killen, Joel; Slinkard, Lee Ann; Maccoby, Nathan. "Pilot Study of Smoking, Alcohol and Drug Abuse Prevention." American Journal of Public Health 70 (1980):719-721. An evaluation of a program to prevent drug and alcohol abuse in adolescents. Observations of the participants took place over 21 months, from 1977-1979. The treatment group was in a junior high school which was targeted as a problem school. The comparison school was defined as a demographic match, plus it was close by and the administrators were willing to cooperate with the program. There were similar rates of parental smoking and preprogram student smoking between groups. Favorable statistically significant differences in trends between groups were found for several outcomes. Perry, Cheryl L.; Killen, Joel; and Slinkard, Lee Ann. "Peer Teaching and Smoking Prevention Among Junior High Students." Adolescence 15 (1980):277-281. An evaluation of Project CLASP (Counseling Leadership About Smoking Pressures). The treatment group was an entire 7th grade class who received instruction through the end of their 8th grade year (the 77/78 and 78/79 school years). Used self report of smoking behavior at three points: 9/77, 6/78, 12/78). The comparison group comprised of 7th grade class from two schools in a neighboring community. No discussion of matching strategy or preprogram comparisons between groups. Significant differences were found between treatment and combined comparison groups. However, one of the schools, when examined separately, was not significantly different in terms of smoking behavior for the week prior to testing. Perry, Cheryl L.; Telch, Michael J.; Killen, Joel; Burke, Adam; and Maccoby, Nathan. "High School Smoking Prevention: The Relative Efficacy of Varied Treatments and Instructors." Adolescence 18 (1983):561-566. An evaluation of a program to prevent high school smoking. Five classes in each of four schools randomly assigned to one of six treatment combination (two kinds of instruction, three kinds of curricula). Treatment during first two weeks of 3/80. Assessments in 2/80 and 5/80. No significant differences between instruction means or curricula means. Possible interaction between curricula and instructor. Smoking behaviors appear to have changed between pre and post tests however there is no control (no treatment) to measure this shift against. Perry, Cheryl L.; Mullis, Rebecca M.; and Maile, Marla C. "Modifying the Eating Behavior of Young Children." Journal of School Health 55 (1985):399-402. An evaluation of a nutritional education program conducted in the fall of 1982. Food recalls taken in 9/82 and 12/82. The study sampled from four elementary schools: 8 third and fourth grade classrooms from two of the schools assigned to treatment status, 8 matched classrooms in the other two schools acted as comparisons. Groups matched on school size, socioeconomic status, etc. And no significant differences were found between these preprogram characteristics for the two groups. Results showed significant differences between groups. (Mentions adjusting for age and gender (p.401), but unclear how exactly this was done.) Vincent, Murray L.; Clearie, Andrew F.; and Schluchter, Mark D. "Reducing Adolescent Pregnancy Through School and Community-Based Education." Journal of the American Medical Association 257 (1987):3382-3386. An evaluation of a teen pregnancy prevention program in a rural site which took place between 9/82 and 9/87. Treatment site is one half of a county. Comparison sites are the other half of the same county as well as three other communities in the state matched on sociodemographic similarity. Spillover effects in the contiguous comparison community led to unintended dosage effects. The program appears to have been very successful. The study measures the change in average estimated pregnancy rates from the preprogram period (1981-1982) to two post-implementation periods (1983-1985, 1984-1985) and compares this against the same change in non-intervention sites. When using the change between preprogram and the average 1984-1985 rate, there is a 35.5% drop (statistically significant) in estimated pregnancies. The other half of the county had a non-significant drop, one of the other three counties had a non-significant gain, and the other two comparison counties had statistically significant gains. Zabin, Laurie S.; Hirsch, Marilyn; Smith, Edward A.; Streett, Rosalie; and Hardy, Janet B. "Evaluation of a Pregnancy Prevention Program for Urban Teenagers." Family Planning Perspectives 18 (1986):119-123. An evaluation of a pregnancy prevention program for urban teens. Program began 11/81 and the clinic opened in 1/82. Services were available through 6/84. The study sample consisted of two junior high schools and two high schools in the Baltimore school district. Treatment schools served a more highly disadvantaged, all black population. The study only examines the black students in the more racially diverse comparison schools. Substantial (doesn't say if statistically significant) differences between groups at baseline were found. The authors estimates impacts at the school-wide level and finds some favorable statistically significant results. Matched site comparison groups -- with modeling Casswell, Sally, and Gilmore, Lynnette. "An Evaluated Community Action Project on Alcohol." Journal of Studies on Alcohol 50 (1989):339-346. An evaluation of an alcohol problems prevention program which was conducted between 1982 and 1985 in New Zealand. Six communities were divided into two groups of three based on socio-demographic characteristics. Each community in a group was allocated (not randomly for fear of spillover effects from the media campaign) to a treatment status: control, media campaign, media campaign and community organizer. The authors used principal components analysis and a three-way ANOVA to test for the effects of age and gender on the main outcome measures. Used contrasts to test for significant differences 1) in city characteristics at baseline, 2) in characteristics over the course of the evaluation, and 3) in the change over time between treatment pairs (p. 342). Small, but statistically significant, favorable impacts found for the intensive treatment level. Guyer, Bernard; Gallagher, Susan S.; Chang, Bei-Hung; Azzara, Carey V.; Cupples, L. Adrienne; and Colton, Theodore. "Prevention of Childhood Injuries: Evaluation of the Statewide Childhood Injury Prevention Program (SCIPP)." American Journal of Public Health 79 (1989):1521-1527. An evaluation of an injury prevention program implemented between 9/80 and 6/82. A hospital-based surveillance system monitoring the incidence of specific injuries was in place between 9/79 and 8/92. Also telephone surveys were conducted in 8/80 and 8/82. Five treatment and five comparison groups were matched on sociodemographic characteristics. The study used 1970 Census data to match but by 1980 a number of changes (rates of minorities and low-income) had taken place in the communities making them less comparable. Also baseline surveys indicate that high levels of injury prevention behaviors already existed making effects harder to show. Design made it difficult to untangle effects of different components. The authors use of a 2 factor (community effect, time effect) ANCOVA model with socio-economic status (SES) as the covariate. Farkas, George; Olsen, Randall; Stromsdorfer, Ernst W.; Sharpe, Linda C.; Skidmore, Felicity; Smith, D. Alton; and Merrill, Sally (ABT). "Post-Program Impacts of the Youth Incentive Entitlement Pilot Projects." New York, NY: Manpower Demonstration Research Corporation, June 1984 (see also Gueron, Judith. "Lessons from a Job Guarantee: The Youth Incentive Entitlement Pilot Projects." New York, NY: Manpower Demonstration Research Corporation, June 1984. And, Farkas, George; Smith, D. Alton; and Stromsdorfer Ernst W. "The Youth Entitlement Demonstration: Subsidized Employment with a Schooling Requirement." The Journal of Human Resources 18 (1983):557-573.) An evaluation of the Youth Incentive Entitlement Pilot Project (YIEPP), which operated between 1978 and 1980. It was an employment entitlement program with required high school attendance targeting low-income youths (ages 16-19) in order to improve their long-term employment opportunities and earnings potential through education and guaranteed employment. Each of the four large-scale pilot sites which were chosen to be evaluated was matched with a comparable comparison site. Pilot sites determination attempted to create a representative sample -- e.g. ethnic and geographic diversity -- while maintaining a adequate balance between costs and sample size. Matching was based on weighted variables which were thought to have potential influence on the outcomes such as characteristics of the labor market, population, high school dropout rate, socio-economic conditions, and geographic proximity. Variables were weighted based on the strength of their predictive power. Regression analysis was used to control for the remaining differences between groups using three sets of variables: demographic, individual specific characteristic (e.g. prior earnings history), and a constant and a treatment dummy variable. The authors attempted to control for selection bias through the collection of longitudinal data on the eligible population in both the demonstration and comparison sites. There were four waves of surveys, including surveys for nonrespondents and remote movers, and school records. This sampling scheme also allowed a closer examination of the participation decision. Unforeseen changes in labor market conditions, institutional structures (busing, teachers' strikes), and other political problems seriously hampered the strength of impact estimates. One of the pilot sites, Denver, was reduced to a limited-slot program due to implementation problems and was consequently dropped from the evaluation. The lack of long-term post-program data and the interaction between site and ethnic effects are noted as other major analytical problems. Data from the nonrespondent surveys (collected during waves three and four) was used to test for attrition bias; no effect was found on the substantive results of the study. Dynarski, Mark, and Corson, Walter. "Technical Approach for the Evaluation of Youth Fair Chance." Proposal -- has been accepted by DOL. Princeton: Mathematica Policy Research, June 1994. A proposed design for the evaluation of Youth Fair Chance (YFC), a collection of saturation programs broadly aimed at increasing the employment opportunities of youths in high-poverty communities. The proposal is to examine preprogram trends in demonstration communities using 1980, 1990 Census data for matching communities (p.39) -- cluster analysis (p.60). One to one match on poverty level, geographic proximity, characteristics linked to outcome measures, baseline values of outcome measures, representativeness (demographic/service environment). (p.55). The researchers will also examine face validity of the match through discussions with site experts. (p.39). Baseline data will be collected for all participants when they enter. Power analysis will be done to determine adequate sample size for reasonable statistical power (also for subgroups). Multiple regression with OLS for continuous dependent variables and probit or logit with maximum likelihood estimation for dichotomous dependent variables to control for measurable differences is proposed for outcomes analysis. No discussion about how to deal with selection bias. (note: The following is not a community-wide study.) Mallar, Charles; Kerachsky, Stuart; Thornton, Craig; Long, David. "Evaluation of the Economic Impact of the Job Corps Program: Third Follow-Up Report." Mimeographed. Princeton: Mathematica Policy Research, September 1982 ( see also, Long, David A.; Malla, Charles D.; and Thornton, Craig V. D. "Evaluating the Benefits and Costs of the Job Corps." Journal of Policy Analysis and Management 1 (1981):55-76.) An evaluation of Job Corps, a voluntary residential job training program aimed at positively influencing outcomes of disadvantaged youth. Evaluation period was 1977-1981. Original study sample (participants during the spring of 1977) was followed for approximately 4 years. Comparison groups consisted of matched youths in areas with limited knowledge of program. Multiple regression was used to attempt to control for both observed and unobserved. Comparison group chosen based on sequential matching procedure --first matched sites, next matched individuals within sites: Found areas with minimal Job Corps participation and then assigned "selection probabilities" based on similarities to heavily saturated Job Corps areas in terms of socio-economic characteristics such as race and income level. Each of these variables was chosen based on its power in a multiple regression equation to predict a Corpsmember's home region. (Eliminated sites in close geographic proximity). Youths within these designated comparison sites were then assigned selection probabilities in a similar manner. Sampling units for comparison groups were zip code areas (3 digit for rural, 5 digit for urban). Samples were large enough to ensure 90% chance of detecting statistically significant changes. For choosing comparison sites used Census data, for individuals within comparison sites, data was obtained from dropout lists from the high schools and records from the local employment agencies. An error-components model was used in analysis to account for the correlation of individual specific error terms over time. The analysts used only variables that would not be affected by Job Corps participation (if the variable possibly could be affected, the they used a lagged value) -- 2-staged model first estimates individual error components using OLS and then substitutes into a generalized LS model, controls for varying lengths of follow-up and missing data. The analysts modeled participation as well to adjust for selection bias -- most of the variables in this equation the same as the main outcome equations, however it has a slightly different functional form and includes two proxy variables for knowledge of the Job Corps program, separate estimates are calculated for relevant subgroups Since participants sampled at a point in time (rather than following a baseline sample of enrollees) -- sample over represents participants who stayed in the program longer. Results showed effects on employment and earnings and on criminal behavior. A new national evaluation of Job Corps using random assignment is currently underway. *Ketron. "Final Report of the Second Set of Food Stamp Workfare Demonstration Projects." Mimeographed.Wayne, Penn.: Ketron, September 1987. An evaluation of Food Stamp Workfare, a mandatory employment training program for food stamp recipients in demonstration areas. Mandatory nature of program should reduce most selection bias. The analysts compared outcomes of participants to food stamp recipients in other sites who, while subject to work registration rules, didn't have the stringent requirements of Workfare. Sampled first-time referrals (to avoid over representing long-term AFDC dependent individuals) -- referred during the period March to April 1981 (p.15). Maximum amount of follow-up time was 9 months, minimum of 3 months. Comparison group members in matched sites consisted of those referred to food stamp work registration during early 1981 (p.18). They were subject to work registration rules which were less stringently implemented -- sanctions less consistently applied --than in Workfare sites. Sites were matched on variables shown through regression analysis to be strong predictors of the outcome measures (p.14), including characteristics of the local areas, population, and food stamp caseload. Individuals were chosen within these sites who were referred during the same period as Workfare participants and matched again on characteristics expected to influence outcome measures (p.18). Comparison sample then weighted to make it representative of individuals in demonstration sites (.19) The analysts calculated separate regression-adjusted estimates for each subgroup, and then formed a weighted average of these estimates (p.79). The used as independent variables only those characteristics which were used to stratify sample (those used to match comparison sites and level of participation) (p.81, p.C.2) A variance (or error) components model with generalized LS estimation methods (p.C.2), was used for analysis and alternative specifications yielded similar results. The analysts performed a variety of sensitivity tests (p.96-104) based on varying assumptions and definitions of measures. Narrow range of estimates for some measures but not others. Generally narrower for men than for women. The study examines pre-implementation behavior of both group in terms of outcome measures (p.24) (e.g. path of food stamp receipt for a year prior -- p.102). It examines time patterns of food stamp receipt and employment prior to referral (p.101): similar for males,not as similar for females. The analysts argue that with respect to possible non-respondent bias: 1) difference in response rates by subgroup generally fairly small, so potentially not a problem 2) also, regression model should control for some potential biases (p.98) , 3) however, the evaluators performed simulations based on different hypotheses as to the experience of non-respondents and found estimates to be "quite sensitive" (p.100) The authors performed cost/benefit analysis from both government and social perspectives and analyzed incentive vs. training effects. San Diego site seemed to be a problem -- it's inclusion/exclusion creates statistically significant changes in estimates for females -- differences in program implementation there (10 day vs. 30 day job search period everywhere else) but also it was the "only site located in a very large urban center" (p.112) and it had participated in the first set of demonstrations, so the staff had experience. Polit, Denise; Kahn, Janet; and Stevens, David. "Final Impacts from Project Redirection." Mimeographed. New York, NY: Manpower Demonstration Research Corporation, April 1985 An evaluation of Project Redirection, a program targeted at low income adolescents who were either pregnant or had children. Original demonstration implemented in 4 sites in 1980 and ended in 1983. First sample enrolled between 8/80 and 3/81 and were followed for two years. Sample was later expanded to include those who enrolled between 3/81-1/82 (no baseline data on this sample was available as the decision to include them in the evaluation was made after they had already enrolled). Sample II was also followed 24 month. Comparison groups were matched cities on socio-economic and geographic characteristics. Matched teens within cities based on eligibility and recruited these comparison group members in a similar fashion to the way that program participants were recruited. Used stratified matching to balance age, ethnicity, baseline similarity, receipt of services from teen parenting programs. The analysts used an ANCOVA model to adjust for measured differences between treatment and comparison groups. Separate estimates were made for various subgroups (however, site and ethnicity confounded to a large degree). Problems arose due to an increase in availability of competing services for comparison group members. Examines attrition bias were made. Steinberg, Dan. "Induced Work Participation and the Returns to Experience for Welfare Women: Evidence from a Social Experiment." Journal of Econometrics 41 (1989):321-340. An evaluation of Work Equity (which ran between 7/78 and 3/81) a mandatory training/job search program for new AFDC clients and current clients with changes in exemption status between 7/78 and 8/80. Work experience data from 4 years prior to baseline. Participants were followed to two years after baseline. Work Equity replaced WIN in St. Paul and 7 neighboring communities, The comparison group was Minneapolis and communities in close proximity which still were operating under WIN Generalized analysis of covariance was used, accounting for attrition and endogenously missing data, Five simultaneous equations were estimated modeling 1) attrition between periods, 2) 1st period selection, 3) 2nd period selection, 4) 1st period log wage, 5) 2nd period log wage Small sample size seriously hinders the power of significance tests (especially in terms of employment probabilities). The author discusses attrition bias. Devaney, Barbara; McCormick, Marie; and Howell, Embry. Design Reports for Healthy Start Evaluation: Evaluation Design, Comparison Site Selection Criteria, Site Visit Protocol, Interview Guides. Mimeographed. Princeton: Mathematica Policy Research, 1994. This is a design for an upcoming evaluation of Healthy Start. The proposal is to use comparisons sites (two per treatment site), matched on infant mortality rates and trends, location, socio-demographic characteristics, and access to prenatal services (p.22). The study will also contrast participants and non-participants within sites. Program participation is voluntary. Brown, Randall; Burghardt, John; Cavin, Edward; Long, David; Mallar, Charles; Maynard, Rebecca; Metcalf, Charles; Thornton, Craig; and Whitebread, Christine. "The Employment Opportunity Pilot Projects: Analysis of Program Impacts." Mimeographed. Princeton: Mathematica Policy Research, February 1983. An evaluation of the Employment Opportunities Pilot Project (EOPP), which operated between mid-1979 to mid-1981. This was a voluntary program for the most part (in some locations AFDC recipients who had been required to participate in WIN were now required to participate in EOPP, p.184). The study sampled all adults in low-income households in 1979 to avoid selection bias. It also sampled those who enrolled in EOPP between 2/1/80 and 2/28/81. Data were from the 1st quarter of (12/78-2/)1979 through the last quarter of (9-11/)1981. Matched comparison sites were developed for use in one of the three analysis models. The three Models were (p.138): 1) Percent change in outcomes for treatment vs. comparison sites; 2) Percent change in outcomes for enrollees (participants and non-participants) vs. non-enrollees; 3) Relative employment probabilities between unemployed EOPP enrollees (participants and non-participants) vs. general unemployed low income population (non-enrollees). Pitfalls include selection bias in the last two models and difficulty in untangling who benefits or loses in the first model. There were general difficulties caused by small sample sizes such as measurement of subgroup effects. This was aggravated by low enrollment (approx. 10%) and participation (approx. 6%) rates among eligibles. The study proposes alternative approaches which were not used because difficult, expensive, and require better data. Matched Pair Comparison Sites Long, Sharon K., and Wissoker, Douglas A. "Final Impact Analysis Report: The Washington State Family Independence Program." Draft. Washington, D.C.: Urban Institute, April 1993 An evaluation of the Washington State Family Independence Program (FIP), an initiative implemented in July 1988 which sought to decrease welfare dependence and improve employment potential. The evaluation used a comparison group strategy: 1) they created east/west and urban/rural stratifications within the state in order to obtain a geographically representative sample, 2) within five of these subgroups pairs of welfare offices, matched on local labor market ad welfare caseload characteristics, were chosen and randomly allocated to either treatment (FIP) or control (AFDC) status (p.3). This strategy could reduce some of the systematic difference between treatment and control groups if sample sizes are large enough. Davis, Elizabeth. "The Impact of Food Stamp Cashout on Household Expenditures: The Alabama ASSETS Demonstration." In New Directions in Food Stamp Policy Research, edited by Nancy Fasciano, Daryl Hall, and Harold Beebout. Draft Copy. Princeton: Mathematic Policy Research, 1993. An evaluation of The Alabama Avenues to Self-Sufficiency through Employment and Training Services (ASSETS) Demonstration. (This design was used for the entire demonstration although the specific report we have is on the evaluation of the food stamp cashout component of the program. Also, this demonstration should not be confused with the Alabama Food Stamp Cash-Out Demonstration which used random assignment and took place during 5/90-12/90.) Food stamp cashout was implemented in 1990; conducted from 8/91 to 11/91 (p.51). Selection of demonstration and comparison sites was done through identification of three key strata (rural/north, rural/south, and urban), choice of a pair of counties within each strata, and random allocation of treatment or control status to each member of the pair. Counties were matched on caseload characteristics and population size. Unit of analysis was households. Institutional Comparison Dynarski, Mark; Hershey, Alan; Maynard, Rebecca; and Adelman, Nancy. "The Evaluation of the School Dropout Demonstration Assistance Program -- Design Report: Volume I." Mimeographed. Princeton: Mathematica Policy Research, October 12, 1992. An evaluation design for the School Dropout Demonstration Assistance Program (note: "targeted projects" had random assignment, "restructuring projects" had comparison institutions). Evaluation was to be conducted over period 1992-1995. The design called for matched schools in "clusters" (elementary schools which fed into middle schools which fed into a high school). First, the analysts determined several choices for comparable school clusters based on characteristics correlated with dropout rates: "attendance rates, dropout rates, minority populations, limited English proficiency, free or reduced-price lunches, and standardized test scores" (p.61). Then determined face validity by talking to local staff. Students within schools were sampled randomly (p.73). NOTES
|
||||||
|
Page 2/2 | Return to Page 1
|
|
Home
| About Us |
Library | Communities
| Private
| Youth Talkback
| Partners
| Site Search
| Contact |
SITE MAP
|
|
|
© 1997
2001 Telesis Corporation |