![]() |
![]() |
|
Library
Information
|
|||||||||||||||||||||||||||||||||||||||||
|
Page 1/2 | Go To Page 2
Evaluation of Community-Wide Initiatives Introduction In this paper we outline the types of problems which can arise when an attempt is made to evaluate the effects of community-wide programs. We partially review experience with different methods where available. In general we find the problems are substantial, so in a concluding section we provide some suggestions for steps which might be taken to improve methods of evaluation which could be used in these situations[1]. We face several definitional problems at the outset. What do we mean by community-wide initiatives? What type of effects are we particularly interested in measuring? Finally, what are the major objectives of the evaluation? To help with definitions, we turn to papers produced by members of the Roundtable committee. Community-wide Initiatives P/PV has defined community as: "...the intersection of place and associational network. Community encompasses both where youth spend their time and whom they spend it with"[2]. Brown and Richman describe "urban change initiatives" as sharing "to some extent the following guiding principles or development assumptions:
Some brief examples may help. In the late 1970's and early 1980's the federal government funded a program called Youth Incentive Entitlement Pilot Project (YIEPP). This program selected a number of school catchment areas in several states. All the low income persons between the ages of 16 and 19 within that area were eligible to participate in the program. The program provided them with work opportunities during the summer time - guaranteed jobs essentially - and part time work during the school year. If they took the job in the summer, they were to continue in school during the school year. A major objective was to encourage school continuation by making employment possible for the low income population. The key feature is the inclusiveness that dictates against random assignment; since all of the low income youth in a given school catchment area were eligible, random assignment was not possible. A different type of example is that of community development corporations (CDCs) which confine their efforts for community change to geographically designated areas and, at least in theory, all the residents of those areas are potentially eligible for services provided through the community development corporations efforts. Effects to be Measured Turning to the definition of effects which the evaluations assess, we focus for the most part on longer term outcomes which are said to be the concern of the community-wide initiative. We want to separate the longer term outcomes from the more immediate short term changes that are often covered under what is called a process analysis. Thus in the YIEPP example the long term outcomes of interest were school continuation rates of the youth and their employment and earnings. The participation of the youth in the program, while it was of some interest, was not itself considered a long term outcome of central interest. Rather, it was a process effect. In the case of the community development long term outcomes might be an improvement in the quality of the housing stock in the designated area or an increase in the number of jobs in the designated area held by people residing in that designated area, while a process outcome might be participation in community boards which make decisions about how to allocate the program resources. It should be recognized,
of course, for what is considered as a "process variable" for some purposes,
may be considered an outcome variable for others, e.g., participation
of community members in decision-making could be regarded as part of
a process leading to a program outcome of improved youth school performance
in one situation but could be an "empowerment" outcome valued in its
own right in another situation. A clear delineation of the theory of
the intervention process would specify which are "process" and which
are "outcome" effects. The Counterfactual The basic question an evaluation seeks to address is whether the activities consciously undertaken which constitute the community-wide initiative generated a change in the outcomes of interest. In order to address the central evaluation issue the problem in this case, as in virtually all evaluation cases, is to establish what would have happened in the absence of the program initiative. This is often referred to as the counterfactual. Indeed most of our discussion will turn around a review of alternative methods that have been tried in order to establish a counterfactual for a given type of program intervention. To those who have not steeped themselves in this type of evaluation, it often appears that this is a trivial problem. Simple solutions are proposed. For example, let's look at what the situation was before the initiative and what the situation is after the initiative in the given community. The counterfactual is the situation before the initiative. Or let's look at this community and find another community that initially was very much like it and then see how after the program initiative the two communities compare on the outcome measures. That will tell us the effects of the program. The comparison community will provide the counterfactual - what would have happened in the absence of the program. As we shall see however, and as most of us know, these simple solutions are not adequate to the problem - primarily because individuals and communities are changing all the time with respect to the measured outcome even in the absence of any intentional intervention. Therefore, measures of the situation before the initiative or with comparison communities are not secure counterfactuals; they may not represent well what the community would have looked like in the absence of the program. Let's return again to some concrete examples. YIEPP pursued a strategy of pairing communities in order to develop the counterfactual. For example, the Baltimore school district was paired with Cleveland. The Cincinnati school district was paired with a school district in Louisville, etc. In making the pairs the researchers sought to choose communities that had labor market conditions similar to those of the treatment community. A similar procedure, with a great deal more detailed analysis, was adopted as part of an ongoing study of school dropout programs currently being conducted by Mathematica Policy Research. The school districts with the dropout program were matched in statistical detail with school districts in the near neighborhood, that is within the same city or SMSA (Standard Metropolitan Statistical Area). In both of these examples, even though the initial match seemed to be quite good, circumstances evolved in ways that made the comparison areas doubtful counterfactuals. In the case of YIEPP, for example, Cleveland had unexpectedly favorable improvement in its labor market compared to Baltimore. Louisville had disruption of its school system because of court ordered school desegregation and busing. This led the investigators to discount some of the results from using these comparison cities. In the case of the school drop out study, though the districts matched well in terms of detailed school and population demographics at the initial point, a couple of years later when surveys had been done of the students and teachers in the respective school districts it was found that in terms of the actual processes of the schools, the match was often very bad indeed. The schools simply were operating quite differently in the preprogram period and had different effects on students and teachers. Random Assignment as the Standard for Judgement For quantitative evaluators random assignment designs are a bit like the nectar of the Gods: once you've had a taste of the pure stuff it is hard to settle for the flawed alternatives. In what follows, we often use the random assignment design - in which individuals or units which are potential candidates for the intervention are randomly assigned to be in the treatment group, which is subject to the intervention, or to the control which is not subject to any special intervention. (Of course, random assignment does not have to be to a null treatment for the controls; there can be random assignment to different levels of treatment or to alternative modes of treatment). The key benefit of a random assignment design is that, as soon as the number of subjects gets reasonably large, there is a very low probability that any given characteristic of the subjects will be more concentrated in the treatment group than in the control group. Most important, this hold for unmeasured characteristics as well as measured characteristics Thus when we compare average outcomes for treatments and controls we can have a high degree of confidence that the difference is related to the treatment and not to some characteristic of the subjects. The control group provides a secure counterfactual as, aside from the treatment, the control group members are subject to the same forces which might affect the outcome as are those in the treatment group: they grow older just as treatment group members do, they face the same changes in the risks of unemployment or increase in returns to their skills, they are subject to the same broad social forces that influence marriage and family practices. We realize that this standard is very difficult, often impossible, for evaluations of community-wide initiatives to meet. But we use it in order to obtain reliable indications of the type and magnitude of errors which can occur when this best design is not feasible[4]. Unfortunately, there appear to be no clear guidelines for selecting second-best approaches but a recognition of the character of the problems may help set us on a path to developing such guidelines. The Nature of the Unit of Analysis For most of the programs that have been rigorously analyzed by quantitative methods to date, the principle subject of program intervention has been the individual. When we turn to community-wide initiatives, however, the target of the program and the unit of analysis usually shifts away from just individuals to one of several possible alternatives. The first, with which we already have some experience, is where the target of the program is still individuals but it is individuals within geographically bounded areas. While the individuals are still the targets of the intervention the fact that they are to be defined as being within a geographically bounded unit is intentional because it is expected that interactions among individuals or changes in the general context will generate different responses to the program intervention than would treatment of isolated individuals. Another possible unit of analysis is families. We have had some experience with programs in which families are the targets for intervention and where the proper unit of analysis remains the families rather than sets of individuals analyzed independently of their family unit. This would, of course, be the case with for example family support programs. These become community-wide initiatives when the set of families to be considered are defined as within geographically bounded areas and eligibility for the program intervention somehow relates to those geographical boundaries. Many of the recent community-wide interventions seem to have this type of focus, a focus on families within geographically bounded areas. Another possibility for community initiative is where the target and unit of analysis are institutions rather than individuals. Thus within a geographically bounded area an attempt might be made to have a program which targets particular sets of institutions, the schools, the police, the voluntary agencies, the health providers and to generate changes in the behavior of those institutions per se. Then the institution becomes the relevant unit of analysis. The reason for stressing
the importance of being clear about the unit of analysis is that it
can make considerable differences in the basic requirements for the
statistical analysis used in the evaluation. Quantitative analyses focus
on the frequency distribution of the outcome and we use our statistical
theory in order to make probabilistic statements about the particular
outcomes that we observe. The theory is based on the idea that a particular
process has generated outcomes that have a random element in them. The
process does not generate the same result every time but rather a frequency
distribution of outcome values. When we are using these statistical
methods to evaluate the impact of programs we are asking whether the
frequency distribution of the outcome has shifted because of the effect
of the program. Thus a statistically significant difference in an outcome
associated with a program is a statement that the outcome we observe
from the units subject to the program intervention has a very low probability
of coming from a distribution which is the same as the distribution
of that outcome for the counterfactual group. So if the community, in
some sense, is the unit of analysis and we're looking at, for example,
the incidence of low birth weight children in the community, then we
need to have information about the frequency distribution across communities
of the percentage of low birth weight babies. The unit of analysis becomes critical because of the ability to make these probability statements about effects using statistical theory depends on the size of the samples. So if the community is the unit of analysis then the sample size will be the number of communities in our samples. If the court systems are the unit of analysis and we're asking about changes in incarceration rates generated by court systems and we're changing courts in one community in some way and not in the other, then we want to know about the frequency distribution across different court systems of incarceration rates and the size of the sample would be the number of such systems that are observed. The Problem of Boundaries When we're talking about community-wide initiatives we're often talking about cases where geographical boundaries define the unit or units of analysis. Of course the term community need not imply specific geographic boundaries. Rather it might have to do with, for example, social networks. What constitutes the community may vary depending upon what type of program process or what type of outcome we are talking about. The community for commercial transactions may be quite different from the community for social transactions. The boundaries of impact for one set of institutions, let us say the police, may be quite different from the boundaries for impacts of another set of institutions, let us say schools or healthcare networks. We will not attempt here of full discussion of how boundaries of communities or neighborhoods might be defined[5]. We quote some insights which illustrate the complexity of the issue of community or neighborhood boundaries: "..differentiated sub-areas of the city are recognized and recognizable...neighborhoods are perhaps best seen as open systems, connected with and subject to the influence of other systems... individuals are members of several of these systems at once...delineation of boundaries is a product of individual cognition, collective perceptions, and organized attempts to codify boundaries to serve political or instrumental aims... local community may be seen as a set of (imperfectly) nested neighborhoods...recognition of a neighborhood identity and the presence of a 'sense of community' seems to have clear value for (1) supporting residents' acknowledgment of collective circumstances and (2)providing a basis and motivation for collective action...neighborhoods are experienced differently by different populations [and}...are used differently by different populations"[6]. It is suggested that a neighborhood or community might be defined for the purposes of a program by reference to some of the following principles: "[1]Match the place to the intervention. [2] Identify the relevant stake holders. [3] Determine the appropriate change agents. [4] Determine the necessary capacity to foster and sustain change."[7] Note that many of the recent community-wide initiatives have as one of their principle concerns the "integration of services". These integration efforts run right up against the problems of boundary definitions since the catchment areas for various types of service units intersect or fail to intersect in complicated ways in any given area. For the purposes of evaluation, these boundary problems introduce a number of complex issues. First, where the evaluation uses as a before-and-after design (we discuss alternative types of designs for evaluations in detail below), i.e., a counterfactual based on measures of the outcome variables in the same area in a period before the intervention to be compared with such measures in a period after intervention, the problem of changes in boundaries may arise. These could occur either because some major change in the physical landscape occurs, e.g., a new highway bisects the area, a major block of residences are torn down for a major trash-to-steam plant to be built, or because the data collection method is based on boundaries that are shifted, e.g. redistricting of schools, changing of police districts. Similar problems would arise where a comparison-community design is used for the evaluation and similar changes occur either in the treatment community or the comparison community. Second, inflow and outflow of people across the boundaries of the community has to be dealt with in the evaluation. Some of the people who have been exposed to the treatment migrate out of the community and unless follow up data are collected on these migrants, some of the treatment effects may be misestimated. Similar, in-migrants enter the area during the treatment period and have had less exposure to the treatment and may "dilute" the treatment effects measured (either negatively or positively). Third, one of the most
serious problems which evaluations of community-wide initiatives face
is the limited availability of regularly collected small-area data.
The decennial Census is the only really complete data source which we
have which allows to measure population characteristics at the level
of geographically defined small areas. In the inter-censal years, the
best we can do in most cases is to extrapolate or interpolate. For the
nation as a whole, regions, states and standard metropolitan statistical
areas when can get some regularly reported data series on population
and industry characteristics. For smaller areas, we can not obtain reliable,
regularly reported measures of this sort. We will suggest below some
steps which might be taken to try to improve our measurements in small
geographic areas, but at present, this remains one of the most serious
handicaps faced in quantitative monitoring of the status of communities.
The Basic Requirements for Statistical Inference in Evaluations With apologies to those who are well versed in the subject, we thought it would be useful to quickly review the basic elements that go into the design for the statistical analysis involved in a quantitative evaluation. 1. A Theoretical Model First, there should be some, however primitive, theoretical model that links the intervention elements to a set of outcomes. The simplicity of such a model can be even as rudimentary as: this group got the treatment that group did not get the treatment so the two groups should differ on this outcome measure of the treatment; it will increase this outcome and decrease that outcome. This type of model is often complained about being simply a "black box" model where the treatment is only crudely defined as a "black box" that some people are in and some people are not and what happens inside that box, the process by which the treatment is transformed into the outcome, is not specified. More refined theoretical models will detail the treatment elements and behavioral processes on which they impinge and how that impingement could change the behavioral outcomes of interest. 2. The Variance in the Outcome Variable The second key element in the design and the statistical analysis is some estimate of the normal variance in the outcome measure or measures. As already emphasized, we know that the outcomes will have some frequency distribution. How spread out that frequency distribution typically is in the absence of any intervention is critical information since it tells us, in some sense, how much "noise" there will be in the outcome measures. The larger the variance in the outcome the harder it will be to detect the effect of the treatment. For example, the employment rate in the community tends to have relatively small variance whereas the income of individuals in a community tends to have a very high variance. It is therefore much harder to detect changes in the average income than it is to detect changes in employment rates. The adequacy of the theoretical model plays a role here as well. To the degree that the theoretical model specifies the whole set of variables which influence the outcome in the absence of the treatment and those variables typically explain a good deal of the variance in the outcome variable, using these variables in an explanatory model reduces the "noise" in the background. What becomes relevant is how much variance is left in the outcome after we have taken into account the effects of other measured variables. For example, the variance in earnings might be $15,000 but when we include measures of the individual's education, age, gender, ethnicity, the residual variance in $10,000. So the better the model in explaining the outcome, the lower the residual variance and the easier it will be to detect the effects of any treatment. 3. The Desired or Expected Size of Response The next element in statistical design is the size of the response to the treatment which the evaluation would seek to detect. This could be the size of response that is the minimum desired size of response or it could be the size of response that, based on previous studies, is the expected size of the response. This may seem a bit peculiar because it asks for a prior specification of exactly the answer one is looking for through the evaluation analysis. This element is necessary, however, in order to try to assure the statistical design will be sufficient to insure that if the desired size of response does in fact occur, the statistical test will yield the conclusion that the response was statistically significantly different from zero. For example, suppose that the objective was to lower the incidence of low birth weight children in a given neighborhood and the treatment would be judged to be successful if it lowered the incidence of low birth weight children by three percentage points. Then one would want to be certain that if the true effect of the treatment were on average to lower the incidence of low birth weight children by three percentage points that the sample size is big enough so that there is a good chance that this sample would conclude that the effect of the treatment was at least greater than zero. It is an irony of this particular aspect of statistical design that the smaller the desired or expected size of response the bigger must be the sample in order to detect it, other things being equal: it takes more resource to detect a small effect than it does to detect a big effect. Policy relevance can enter into the determination of the desired or expected size of response since one could say, for example, if the treatment lowers the incidence of low birth weight children by only half a percentage point that would not be sufficient to be meaningful in the policy realm and therefore we don't need to have a statistical design and sample size large enough to detect an effect that small. 4. Number and Size of Treatments The next characteristic of statistical design is the number and the size of the treatments. While in the simplest cases we have one type of treatment, in many cases there are multiple dimensions to the potential treatment and it may be useful to vary those dimensions of the treatments. For example, in the early negative income tax experiments there were two characteristics of the treatment that were systematically varied, the level of the basic income guarantee and the rate of reduction in the negative income tax payments as income of the family increased. These two dimensions were varied in size to get a combination of different plans being tested within one experiment. Not surprisingly, depending exactly on how the number of treatments is set up, the requirements for sample size and the data collection for the evaluation become considerably larger. The size of the individual treatments of course has an affect on expected size of response. In some experiments one is varying the level of the treatment and looking for a "dosage" effect. So, for example, in the simple case one might be trying to lengthen the school year and see what the effects of that are but added information could be obtained by lengthening it by different amounts for different treatment groups to try to see what the "dosage" effect of lengthening the school year would be. 5. Power Analysis In developing a statistical design for an evaluation one carries out what is called a statistical power analysis. What this does is to provide an estimate of the chances of finding a significant effect of a given size with a given probability. The power analysis is a more conservative procedure than simply indicating the minimum detectable response. The power analysis recognizes that whatever the true average response, samples drawn to test the treatment will, if done repeatedly, generate a frequency distribution around a mean response and one can not be sure before carrying out the experiment where in that distribution of outcomes the particular sample taken will lie. So the power analysis takes this into account in a more conservative fashion. We have reviewed these
basic requirements for statistical analysis design in order to provide
a framework against which we will discuss a number of characteristics
of alternative designs. Hopefully this discussion will help understanding
of how particular problems in a given approach relate to this basic
design model. Problems with Outcome Measures Community-wide initiatives present particular problems in defining what major outcome measures the evaluation should focus on. In many past evaluations in the social policy area the major outcome variables have been relatively straight forward and agreed upon, for example, the level of employment, the rate of earnings, the test scores of children, the incidence of marriage and divorce, the incidence of low birth weight children, arrests and incarcerations, school continuation rates or drop out rates, birth outcomes. For community-wide initiatives, these traditional type of outcome measures may not be the primary outcome measures or may be regarded as ultimate outcome measures but ones which may not show detectable effects in the short term. For example, in the famous Perry Preschool study the long term outcomes are now often talked about, employment, earnings, delinquency, but obviously at the start of the evaluation these outcomes could not be directly measured. This may be true for some of the community initiatives as well where it may be felt that during the period of the short term evaluation, it is unlikely that traditional outcome measures will show much change even though it may be hypothesized that in the long run they will. For community initiatives, then, we need to distinguish intermediate outcomes and final outcomes. In addition, in community initiatives there may be types of outcome measures that have not been used traditionally but are regarded as outcomes of sufficient interest in and of themselves, regardless of whether they link to more traditional outcome measures eventually. Particularly where the object of the community initiative is a change in institutional behavior it may be that some of the more traditional individual outcome measures are considered of secondary interest. For example, if an institution is open longer hours or disburses more funds or reduces its personnel turn over, these might be outcomes of interest in their own right and not as intermediate outcomes. Finally, we would want to make a careful distinction between input measures - or process measures - and outcome measures. For instance an input measure might be the number of people enrolled in a GED program. Whereas the outcome measure might be the number of people who passed their GED exam or, even further, what the final outcome is in terms of their employment and earnings. Process measures might be changes in the organizational structure such as provisions of more authority to classroom teachers in determining the curriculum content rather than having superintendents or school boards determine the curriculum content. This might be considered a process measure whereas the effect on student achievement would be the ultimate outcome measure of interest. One might try to use the set of principles outlined by Brown and Richman at the outset of this paper and translate the principles into the sets of categories we have just talked about. First, starting from the statement of principles and development assumptions, could we define a set of final outcome variables upon which the "success or failure" of a community-wide initiative might be judged? Second, could we derive from these a set of intermediate measures that we think would be related to the ultimate long term outcomes but which would be more measurable in the short-term? Third, could we distinguish from these principles those measures which would be input and process measures rather than outcome measures? As one seeks to address these questions it becomes clear that it is important to try to determine as best as possible the likely audience for the evaluation results. The criteria for what are important outcomes to be measured and evaluated is likely to vary with that audience. Will the audience in mind, for example, be satisfied if it can be shown that a community-wide initiative did indeed involve the residents in a process of identifying and prioritizing problems through a series of planning meetings, but it could not be shown that this process lead to changes in school outcomes or employment outcomes or changes in crime rates in the neighborhood? Academic, foundation staff, policy-makers and administrators are likely to differ greatly in their judgement of what outcomes provide the best indicators of success or failure. Another dimension of this problem is the degree to which the audience is concerned with the outcomes for individuals vs. the outcomes for place. This of course is an old dilemma in neighborhood change going back to the time of urban renewal programs. In these programs the geographical place may have been transformed by removing the poor people and replacing them through a gentrification process with a different population; place was perhaps improved but people were not. Or to contrast it at the other extreme, the Gatreux process moved low income people from the center city to the suburban fringe and it is judged their lives were improved but of course the places that they left were if anything in worse shape after they left. Important Studies which Demonstrate the Problem of Selection Bias There are two sets of studies now available which illustrate the seriousness of the problems which can arise when comparison groups are constructed by means other than random assignment. These studies both start with data generated through true random assignment experiments. The results of the random assignment experiments are used as the standard of what the "true" estimates of the effects of the program are. In each of the studies alternative comparison groups are then constructed in a variety of ways and another set of estimates of the effects of the program on the outcome variable are made using the constructed comparison group in place of the control group. A test is then made by looking at the estimated effect of the program using the constructed comparison groups and comparing that with the "true impact estimates" taken from the experiment. The first set of studies were based on the National Supported Work Demonstration which ran between 1975 and 1979 in eleven cities across the United States. This was a subsidized employment program for four target groups: ex-addicts, ex-offenders, high school dropouts and women on AFDC. Two sets of investigators working independently used these experimental data and combined them with non-experimental data to construct comparison groups[8]. Both sets of investigators looked at various ways of matching the treatment subjects from the experiment with counterparts taking from other data sources. In constructing the comparison group from other data sources they followed the method that had been used by other investigators previously to study employment and training programs such as the concentrated Employment and Training Act (CETA). The major conclusion from this set of important studies was that the constructed comparison groups provided unreliable estimates of the true program impacts on employment and earnings, and that none of matching techniques used looked particular superior one to the other - there was no clear second-best. The second major set of studies were carried out more recently using data from MDRC's Work\Welfare studies in several states[9]. This work is even more closely relevant to the problems we address here and we will discuss some of the results in more detail later in the paper. What the investigators did was to use the treatment group from the actual Work\Welfare experiments and then for constructed comparison groups they used the control groups from other locations or other time periods. The Work\Welfare experiments were carried out in several states. This made it possible for the researchers to try to construct comparison groups by using the treatment group from one state and the control group from another state. In addition, they were able to use information for several of the programs where the program experiment was carried out with several different offices within the same state so they could use a treatment group from one geographic location within the state or city and the control group from another geographic location within the state or city. Finally, because the samples were large enough, they could take the treatment group from one time period at a given site and the control group from another time period to get what they call "across-cohort studies"; they would have the treatment from one cohort in time in the same site and control group from another cohort in time. Investigators in this study tested a variety of different ways of trying to improve upon the construction of the comparison group. They then looked at the estimates of experimental effect on employment rates using the true results and comparing that with the results using the constructed comparison groups. They tried not only different combinations of constructed groups but also different ways of matching on measured characteristics and they tried some sophisticated specification tests which had been suggested by Heckman[10] and others to eliminate some of the constructed comparison group which the test suggest are less well matched. The results showed substantial differences in the magnitude of the estimated impact between the true experimental results and the results based on constructed comparison groups. In many cases not only the magnitude of the effect is different but the actual statistical inference is different, that is, whether the impact is statistically significant or whether its sign is positive or negative. Their results seem to indicate that, at least for these data, comparison groups constructed from difference cohorts in the same site performs somewhat better than the other types of comparison groups. The importance of these
two sets of studies is that they indicate that the problem of bias arising
when comparison groups are constructed by any other method than random
assignment are likely to be quite serious. They show that statistical
controls using measured characteristics are in most cases inadequate
to overcome that bias. This, then, is not just a theoretical possibility,
but it can be empirically shown in these real life experiments to have
been a very real possibility. Investigators could have been seriously
misled in their conclusions about the effectiveness of these programs
had they used methods other than random assignment to construct their
comparison groups. Even more important, we should remember that these
studies are looking at the effectiveness of alternative methods of creating
comparison groups ex post but when one is developing a design for an
evaluation one must make a priori judgments about the extent of bias
that the results might in the end show. In most cases, one would not
have the luxury of using a specification test afterwards to eliminate
certain types of constructed comparison groups. You would have to guess
beforehand whether the type you have chosen is likely to fall in that
group that would be rejected by the test or not. Types of Comparison Groups and Experiences With Them To Date 1. Constructed Groups of Individuals: Constructed comparison groups of individuals was the most often used method of evaluation prior to the development of the use of random assignment in large scale social policies studies in the 1970's and its much wider use in a range of programs in 1980's. We will only talk briefly about these methods based on individuals since they are of less relevance to the issues of evaluation of community-wide initiatives. The earliest type of constructed group was a before and after, or "pre-post", design. Measurements were made on the individuals before they entered the treatment and then measures made after they entered the treatment and following the conclusion of the treatment. Impacts were measured as the change from before program to after program. This design has long been recognized as highly vulnerable to changes in individuals that occur naturally through life processes. In some cases it is simply changes associated with aging. For example, respect to criminal behavior there is a known decline in criminal behavior with age, a phenomena referred to as "Aging Out". With respect to employment and training, often program eligibility is based on a period of unemployment prior to program entry. In these cases, we know that mistaken inference can easily occur since, for any group of people currently unemployed, the usual processes of job search which go on in the absence of any program would result over time in some percentage of those, usually a very high percentage, becoming employed or re-employed. One cannot untangle the program effects from those of usual job finding processes. Another type of constructed comparison group that has been used is that of non-participants in the program as comparisons with those who participated in the program. This was used in early evaluations of the Jobs Corps. A more recent example is the evaluation of the special supplemental food program for women, infant and children WIC(Devaney, 1990). In this study a comparison was made between welfare recipients who participated in the WIC Program and those who did not. Another case is the study of the National School Lunch Program and School Breakfast Programs (Burghardt et al, 1993). An attempt was made to evaluate the dietary impact of the program by using non-participants in the program in the same sites as the sites of which the School Lunch Participants Program data was gathered. This type of design has
long been recognized as subject to serious bias due to selection on
unobserved variables. Usually there is a reason why the individuals
have participated in the program or not participated in the program.
In some cases this can be the individual's motivation, in some cases
it can be subtle selection procedures followed by the program administrators.
If that selection for either reason is on characteristics that would
affect the final outcome, and these characteristics are unmeasured,
then estimating the impacts by the difference between the participant
of the non-participant group will be subject to bias which could run
in either direction. Several major studies have sought to use existing survey data as a source of getting and drawing individuals for the comparison group, most commonly used source of information is the U.S. Census Bureau's Current Population Survey (CPS), which has large national samples of individuals. Comparison groups are usually constructed by matching the characteristics of the individuals in the treatment group to individuals in the CPS. This procedure was used in a number of evaluations of employment training programs reported in the 1980's (Bloom 1987, Ashenfelter and Card 1985, Bassi 1983-84, Bryant & Rupp 1987, Dickinson, Johnson and West 1986), where program enrollment data was often used in combination with the CPS data or data from Social Security records. These data would generally be regarded as desirable as they would give an opportunity to have a long series of earning observations on individuals prior to the time of program eligibility as well as during program eligibility. In the studies by Maynard and Fraker the CPS and Social Security data were used in combination. Different methods of matching were used in these various studies in some cases cells were defined for characteristics and samples were drawn from the CPS according to the cells of the characteristics matrix in which they fell to match the proportion of the treatment group that fell in those kinds of characteristics cells. In some of the studies, a procedure called the "Nearest Neighbor Match" that was created by Mahalanobis was used to try to get a better match. Some other existing survey data files have also been used for constructing comparison groups. For example; the Panel Study on Income Dynamics was used (in addition to the CPS) by LaLonde in his study based on the Supported Work data. A good deal more could be said about the details of different methods of matching and attempts to use the sample selection corrections of the Heckman type. But we will not go into much of that here.[11] To try to get around the problem of the influence of unobserved variables, analysts, since the late 1970s, have relied on methods of statistically correcting for potential bias. The methods used are based primarily on those developed by James Heckman, currently at the University of Chicago. The basic approach was to try to model the selection process, that is, to develop a statistical equation which would predict the probability of being selected to be in the treatment group or in the comparison group. While the approach proposed could work in certain situations, it has turned out in experience that it cannot generally be relied upon to deal with the problem of unobserved variables. Understanding the problem of unobserved variables and the weakness of any methodologies other than random assignment for dealing with this problem is central to the appreciation of the difficulties that are to be faced in the evaluation of community wide initiatives. We will touch on this repeatedly in what follows. 2. Constructed Comparisons Institutions In a few cases, where the primary unit of analysis has been an institution attempts have been made to construct comparison groups on the basis of institutions. These procedures come closer to the problems of community wide initiative evaluations. The major example we have is the school dropout studies which are currently under way (Dynarski et al 1993). While in some cases, there is in this study random assignment of individuals to a school dropout prevention program and to a control group, for part of the study it was deemed not possible to carry out random assignment within the school so an attempt was made to find other schools that could be used as a comparison group in judging the effectiveness of the dropout program. As noted several times above, after the schools had been initially matched, survey data were collected from students, parents and school administrators. A comparison of these data showed that in fact the schools were quite different in their actual operational aspects so that, in spite of being demographically similar, the schools were operationally quite different. Note that in this case even though the student outcomes are the ultimate subject of the study, i.e., whether the kids dropout or not, the institution had to be the unit of comparison because the context was felt to be important in determining school dropout outcomes. Therefore, to create a counterfactual, one wanted to find an environment which would have been similar to that in the treatment schools. In one study there were a large enough number of schools to at least attempt a quasi random assignment of schools to treatment and control groups. Twenty two schools were first matched on socio-economic characteristics then randomly assigned within matched pairs to treatment and control. (Flay etal 1985). 3. Comparison Communities There are several examples of attempts to use communities as the unit for building the comparison group. The idea is superficially quite appealing: find a community that is much like the one in which the new treatment is being tested and then use this community to trace how the particular processes of interest or outcomes of interest evolve compared to that in the "treatment community". As we will see, however, in practice this simple procedure has lots of pitfalls. a. Treatment Site Predetermined In most cases the treatment site has been predetermined before the constructed comparison site is selected. An example of this type of project is the Youth Incentive Entitlement Pilot Project (YIEPP) which was described at the outset of this paper. These four sites were matched with sites in other communities in other cities, matching on weighted characteristics such as the labor market, population characteristics, high school dropout rate, socio-economic conditions and geographic proximity to the treatment site. In this case, however, unforeseen changes in the comparison sites made their validity as counterfactuals for the comparison sites extremely doubtful. For example, Cleveland, which was paired with Baltimore, had an unexpected improvement in its labor market; events such as court-ordered school busing and teacher strikes made the usefulness of other comparison sites extremely questionable. The Employment Opportunity Pilot Project (EOPP) was a very large scale employment opportunity program, which began in the late 1970's and carried on a bit into the 80's, focused on chronically unemployed adults and families with children. It also used constructed comparison sites as part of its evaluation strategy. Once again there were problems with unexpected changes in comparison sites. For example, Toledo which had major automobile supplies manufacturers, was subject to a downturn in that industry. Further, out of 10 sites, one had a major hurricane, a second had a substantial flood and a third had a huge unanticipated volcanic eruption. A project currently getting under way, the Healthy Start Evaluation, will also use comparison sites. Two comparison sites are being selected for each treatment site (Devaney and Morano 1993). In developing comparison sites investigators have tried to add to the more formal statistical matching by asking local experts whether the comparison sites tentatively selected make sense in terms of population and service environment. The evaluation of Community Development Corporations, being carried out by the New School for Social Research under Mercer Sullivan's direction, has selected comparison neighborhoods within the same cities as the three CDC sites which they are evaluating. b. Treatment and Comparison sites Randomly Assigned There are a couple of examples where the treatment sites were not predetermined but rather were selected simultaneously with a selection of the comparison sites. The largest such evaluation is that of the State of Washington's Family Independence Program (FIP). This is an evaluation of a major change in the welfare system of the state (Long and Wissoker 1993). The evaluators, having decided upon a comparison group strategy, created east/west and urban/rural stratifications within the state in order to obtain a geographically representative sample. Within five of these subgroups pairs of welfare offices, matched on local labor market and welfare caseload characteristics, were chosen and randomly allocated to either treatment (FIP) or control (AFDC) status (p.3). This project produced apparent results that surprised the researchers: increased utilization of welfare and reduced employment whereas the intent of the reform was to reduce welfare use and increase employment. The researchers themselves do not put much weight on the possibility that these results spring from the failure of a comparison site method, but that possibility certainly is there. The Alabama Avenues to Self-Sufficiency through Employment and Training Services (ASSETS) Demonstration uses a similar strategy for the selection of demonstration and comparison sites, except that only three pairs were chosen, The primary sampling unit was the county, and counties were matched on caseload characteristics and population size (Davis). Results from this study look questionable when compared to a similar study which was done with random assignment of individuals in San Diego. In San Diego the estimated reduction in food consumption following Food Stamp cashout was much less. c. Problems of Spillovers, Crossovers and In and Out Migration Where comparison communities are used there are potential problems which arise either because of structural features in proximity or the movement of individuals. Often investigators have chosen communities in close physical proximity to the treatment community. This is justified on the grounds of helping to equalize regional influences. However, this proximity can cause problems. First, economic, political and social structures often create specialization of function within a given region: one area provides most of the manufacturing activities and the other the services, or one generates mostly single family dwellings while the other features multiunit structures, one is dominated by Republicans the other by Democrats, one captures the State Employment Services office and the other gets State Police barracks. These can be subtle differences which can generate different patterns of evolution of the two communities. Second, spillover of services and people can occur from the treatment community to the comparison community, so the comparison community is "contaminated" - either positively by obtaining some of the services or governance structure changes generated in the comparison community or negatively by the draining away of human and physical resources into the now more attractive treatment community. Two features of the New School's CDC study make it less susceptible to these types of problems. First, the services being examined relate to housing benefits which are not easily transferable to nonresidents. Second, the CDC's in the study were not newly established, so to a large extent it can be assumed that people had already made their housing choices based on the available information (though even these prior choices could create a selection bias of unknown and unmeasured degree). An example where this spillover effect was more troublesome was in the evaluation of The School/Community Program for Sexual Risk Reduction Among Teens (Vincent, 1987). This was an education-oriented initiative targeted at reducing unwanted teen pregnancies. The demonstration area was designated as the western portion of a county in South Carolina, using school districts as its boundaries. One of the four comparison sites was simply the eastern portion of the same county. The other three comparison sites used were matched on socio-demographic characteristics. The two halves of the county were matched extremely well on factors that might influence the outcome measures as the entire county was considered to be quite homogeneous (Vincent, p.3382). However, a good deal of the information in this initiative was to be disseminated through a media campaign, and the county shared one radio station and one newspaper. Moreover, some of the educational sites, such as certain churches and workplaces, served or employed individuals from both the western and eastern parts of the county (p.3386). Obviously, a comparison of the change in pregnancy rates between these two areas will not provide a pure estimate of program impact. In-migration and out-migration of individuals are a constant feature in communities. At the treatment site, these might be considered "dilutions of the treatment". In-migration could be due to the increased attraction of services provided or it could just be a natural process which will weaken the homogeneity of community values and experiences. Out-migration means loss of some of the persons subject to the treatment. If one looks only at the stayers in the community there is a selection bias arising from both migration processes. One cannot be sure whether the program treatment itself influenced the extent and character of in and out migration. d. Dose-Response. Evaluators might choose to analyze a system like the South Carolina Pregnancy Prevention project from the perspective dose-response effects. In other words, these areas could be viewed as three different groups: the western part of the county, the eastern part of the county, and the 3 noncontiguous comparison counties. Each of these received a different level of treatment: full, moderate, and little to none. If one examines just the crude absolute changes in numbers, this theory seems to play out in a logical way. The comparison communities' estimated pregnancy rates (EPRs) stayed the same or increased, while the eastern portion reduced its rates slightly and the western portion was more than halved (p.3385). Of course these estimates should be accepted with caution given the general lack of statistical rigor (small sample size, failure to control statistically for even observed differences between communities). Another example of dose-response methodology is an evaluation of a demonstration targeted at the prevention of alcohol problems (Casswell, 1989). Six cities were chosen and then split into two groups based on socio-demographic similarity. Within these groups, each received a treatment of varying intensity. One was exposed to both a media campaign and the services of a community organizer, one had only the media campaign, and the third had no treatment. In this way researchers could examine the effect of varying levels of intervention intensity to determine, for instance, if there existed an added benefit to having a community organizer available (or if the real impact came from the media campaign). It should be noted, however, that random assignment of cities within groups had to be sacrificed in order to avoid possible spillover effects from the media campaign. Most important, this procedure does not get around the underlying problem of comparison communities - the questionable validity of the assumption that once matched on a set of characteristics the communities would have evolved over time in essentially the same fashion with respect to the outcome variables of interest. If this assumption does not hold then the "dose of treatment" will be confounded in unknown ways with underlying differences among the communities, once again a type of selection bias. e. Pre-Post, Using Communities As was noted with respect to individuals, contrasting measurements before exposure to the treatment with measurements after exposure to the treatment is a method which has been often been advocated. This procedure can also be applied with communities as the unit of analysis. The attraction of this approach is that the structural and historical conditions which might affect the outcome variables that are unique to this location are directly controlled for. Often a pre-post design simply uses a single pre-period measurement of relevant variables as a baseline to compare with the post-treatment measure of the same variables. However, since it is recognized that communities as well as individuals change over time, it is usually argued that it is best to have multiple measures of the outcome variable in the pretreatment period so as to allow an estimate of the trajectory of change of the variable. This procedure is often referred to as an "interrupted time-series", with the treatment taken to be the cause of the interruption (see for example McCleary and Riggs 1982). This approach would be stronger the better is the researcher's ability to model the process of change in a given community over time. We discuss the evidence on ability to model community change below. Note also that this approach depends on having time-series measures of variables of interest at the community level and therefore runs into the problem of the limited availability of small area data measuring variables consistently over many time periods, noted above; we are often limited to the decennial censuses for small area measurements. The major problem with pre-post designs is that events other than the treatment - e.g. a plant closing, collapse of a transportation network, reorganization of health providers - impinge on the community during the post-treatment period and affect the outcome variable and these effects would be attributed to the treatment unless there is a strong statistical model which can take these exogenous events into account. We have been unable to locate examples of community pre-post designs using times series - though we feel there must be some examples out there. The EOPP (Brown, etal. 1983), used as one of its analysis models, a mixture of time series and comparison communities to estimate program impacts. The model had pre and post measures for both sets of communities and estimated the impact as the difference in percentage change pre to post between the treatment site and their comparison site. For the Youth Fair Chance demonstration (Dynarski, etal. 1994) the proposed a design will use both pre and post measures and comparison sites. f. Methods of selecting comparison communities The most common method for selecting comparison communities is to attempt to match areas on the basis of selected characteristics which are believed, or have been shown, to affect the outcome variables of interest. Usually, a mixture of statistical weighting and judgmental elements enters into the selection. Often a first criterion is geographic proximity - same city, same metropolitan area, same state, same region - on the grounds that this will minimize differences in economic or social structures and changes in area wide exogenous forces. Sometimes an attempt is made to match on service structure components in the pretreatment period, e.g., similarities in health service provision. Most important, usually, is the statistical matching on demographic characteristics. In carrying out such matching the major data source is the decennial Census, since this provides characteristic information even down to the block group level (a subdivision of Census tracts). Of course, the further the time period of the initiation is from the year in which the Census was taken, the weaker this matching information will be. One study used 1970 Census data to match sites when the program implementation occurred in the very end of the decade and found later the match was quite flawed. Since there are many characteristics on which to match, some method must be found for weighting the various characteristics. If one had a strong statistical model of the process that generates the outcomes of interest than this estimated model would provide the best way to weight together the various characteristics measures. We are not aware of any case in which this has been done. Different schemes for weighting various characteristics measures have been advocated and used. A currently popular one is the Mahalanobis distance measure, mentioned above for the case comparison groups constructed for individuals[12]. In a few cases there are time trend data at the small area level on the outcome variable which cover the pre-intervention period. For example, in recent years birth record data have become more consistently recorded and made publicly available, at least to the zip-code level. In some areas, AFDC and Food Stamp receipt data aggregated to the Census tract level are available. The Healthy Start evaluation proposes to attempt to match sites on the basis of trends in birth data. f. A Reprise on Friedlander-Robins Findings Having reviewed a variety of methods for finding and using comparison communities it may be worthwhile to look briefly at some results from the Friedlander-Robins studies which provide at least some idea of the possible relative magnitude of problems with several of the methods. Recall that this study used data from a group of work-welfare studies. In the base studies themselves random assignment of individuals was used to create control groups but Friedlander and Robins used data drawing the treatment group from one segment and control group data from a different segment ( thereby "undoing" the random assignment). It was then possible to compare the effects estimated by the treatment-comparison group combination with the "true effects" estimated from the random assignment estimates of treatment-control group differences (the same treatment group outcome is used in each difference estimate). We reproduce here part of one table from their study: Comparison of Experimental
and Nonexperimental Estimates of the Effects of Employment and Training
Programs on Employment Status Comparison Group Specification
(Friedlander, Daniel and Philip K. Robins, "Estimating the Effect of Employment and Training: An Assessment of Some Nonexperimental Techniques," Table 3 (p.13) The data are drawn from four experiments carried out in the 1980s (Arkansas, Baltimore, San Diego, Virginia). The outcome variable is whether employed (the employment rate ranged from a low of .265 in Arkansas to a high of .517 in Baltimore). Across the top of the table there is a brief description of how the comparison group was constructed using four different schemes for construction. In the first two columns, the two across-site methods use the treatment group from one site, e.g. Baltimore, with the control group from another site, e.g. San Diego, used as the comparison group. In the second column the term "matched" indicates that each member of the treatment group was matched with a member of the comparison group using the Mahalanobis "nearest neighbor" method and then the estimates of the impact were made as the difference between the treatment group and the matched comparison group. In the first column no such member by member match was done, however, in the regression equation in which the estimate of the impacts are made there are variables for characteristics included and this controls for measured differences in characteristics between the two groups. The Within-Site/ Across-cohort in column three builds on the fact that the samples at each site were enrolled over a fairly long time period and it was, therefore, possible to split the sample in two parts, those enrolled before a given date - called the "early cohort" and those enrolled after that date - the "late cohort". The treatment group from the "late cohort" is used with the control group from the "early cohort" as their comparison group. This approximates a pre-post design for a study. Finally, in column four, for two of the sites the work-welfare program was implemented through several local offices. It was possible, therefore, to use the treatment group from one office with the control group from the other office as a comparison group. This procedure approximates a matching of communities in near proximity to each other. The first row of the table gives the number of pairs tested. This is determined by the number of sites, the number of outcomes (the employment outcomes at two different post-enrollment dates were used), the number of subgroups (broken down by whether AFDC applicants or AFDC current recipients). The number of pairs gets large because each site can be paired with each of the three other sites. The smaller number of pairs in the within-site/across-office occurs because there were only two sites with multiple offices. The next row gives the means of the experimental estimates, i.e., the "true impact estimates" from the original study of randomly assigned treatment-control differentials. Thus for example, the experimental estimates of treatment-control differences in employment rates across all four sites was a 5.6% difference in the employment rate of treatments and controls. The next row compares the results of the estimates using the constructed comparison groups to the "true impact", experimental, estimates , averaged across all pairs. For example, the mean absolute difference between the "true impact" estimate and those obtained by the constructed comparison groups across-site/unmatched was .09, that is the difference between the two sets of estimates was on average more than 1.5 times the size of the "true impact"! The next row tells the percentage of the pairs in which the constructed comparison group estimates yielded a different statistical inference than the "true impact" estimates. A different statistical inference occurs when only one of the two impact estimates is statistically significant or both are statistically significant but with opposite signs. A 10 percent level of statistical significance was used. The fifth row indicates the percent of the pairs in which the estimated impacts are statistically significantly different from each other. For our purposes, we focus on rows three and four. Row three tells us that under every method of constructing comparison groups the constructed comparison group estimates (called non-experimental in the table) differ from the "true impact" estimates by a magnitude of over 50 percent of the magnitude of the "true impact". Row four tells us that in a substantial number of cases the constructed comparison group results led to a different inference, i.e. the "true impact" estimates indicated that the program had a statistically significant effect on the employment rate and the constructed comparison group estimates that it had no impact or vice versa or that one said the impact was to increase the employment rates at a statistically significant level and the other said that it decreased the employment rate at a statistically significant level. Now we focus more closely on columns three and four because these are the types of comparisons that are likely to be more relevant for community-wide initiatives: as already noted, the within-site/across-cohort approximate a pre-post design in a single community and within-site/across-office approximates a close-neighborhood-as-a-comparison group design. It appears that these designs are better than the across-site designs in that, as indicated in row three, the size of the absolute difference between the "true impact" and the constructed comparison group estimates is much smaller and is smaller than the size of the true impact. However, the difference is still over 50 percent the size of the "true impact". The magnitude of the difference is important if one is carrying out a benefit-cost analysis of the program. A 4.5 percent difference in employment rates might not be sufficiently large to justify the costs of the program but a 7.9 percent difference might make the benefit-cost ratio look very favorable; a benefit-cost analysis with the average "true impact" would have led to the conclusion that the social benefits of the program do not justify the costs whereas the average constructed comparison group impact (assuming that it was a positive .034 greater) would have led to the erroneous conclusion that the program did provide social benefits which justify its costs. When we move to row four we have to be a bit more careful in interpreting the results because the sample sizes for the column three and four estimates are considerably smaller than those for the column one and two cases. For example, the entire treatment group is used in each pair in columns one and two but only half the treatment group is used in columns three and four. Small sample size makes it more likely that both the random assignments estimates and the constructed comparison group estimates will be found to be statistically insignificant. Thus, inherently the percent with different statistical inference should be smaller in columns three and four. Even so, for the within-site/across-cohort nearly 30 percent of the pairs the constructed comparison group estimates would lead to a different - and therefore erroneous - inference about the impact of the program. For the within-site/across-office estimates 13 percent led to a different statistical inference. Is this a tolerable risk of erroneous inference? I would not think so, but others may feel otherwise. There are a couple of additional points about the data from this study which should be born in mind. First, this is just one set of data analyzed for a single relatively well understood outcome measure, whether employed or not. There is no guarantee that the conclusions about relative strength of alternative methods of constructing comparison found with these data would hold up for other outcome measures. Second, in the underlying work\welfare studies the population from which both treatment group members and control group members were drawn were very much the same,i.e., applicants or recipients of AFDC. Therefore even when constructing comparison groups across sites one is assured that one has already selected persons whose employment situation is so poor they need to apply for welfare. In community wide initiatives, the population being dealt with would be far more heterogenous. There would be a far wider range of unmeasured characteristics which could affect the outcomes and, therefore, the adequacy of statistical controls (matching or modeling) in assuring comparability of the treatment and constructed comparison groups could be much less. Page 1/2 | Go To Page 2
|
|||||||||||||||||||||||||||||||||||||||||
|
Home
| About Us |
Library | Communities
| Private
| Youth Talkback
| Partners
| Site Search
| Contact |
SITE MAP
|
|
|
© 1997
2001 Telesis Corporation |