Testing Trap
The single largest ‹ and possibly most destructive ‹ federal intrusion into
America's public schools

by Richard F. Elmore

Supporters of the reauthorization, last January, of the Elementary and
Secondary Education Act hail it for tightening school accountability
substantially, for granting more flexibility to states and school districts
in the use of federal funds, and for applying sanctions to and providing
aid for failing schools. Opponents argue that the bill doesn't go far
enough, because congressional supporters of school choice failed to
persuade their colleagues and the president's advisers to include vouchers
in the bill.

Sadly, from an educational perspective, both sides miss the major issues.
This is an "accountability bill" that utterly fails to understand the
institutional realities of accountability in states, districts, and
schools. And its provisions are considerably at odds with the technical
realities of test-based accountability. In the history of federal education
policy, the disconnect between policy and practice has never been so
evident, nor so dangerous. Ironically, the conservative Republicans who
control the White House and the House of Representatives are sponsoring the
single largest‹and the single most damaging‹expansion of federal power over
the nation's education system.

Under the new law, the federal government mandates a single test-based
accountability system for all states‹a system currently operating in fewer
than half the states. It requires annual testing at every grade level, and
states must disaggregate their test scores by students' racial and
socioeconomic backgrounds‹a system currently operating in only a handful of
states, and one fraught with technical difficulties. The federal government
further mandates a single definition of adequate yearly progress, the
amount by which schools must increase their test scores in order to avoid
some sort of sanction‹an issue that in the past has been decided jointly by
states and Washington. Finally, the law sets a single target date by which
all students must exceed a state-defined proficiency level‹an issue that in
the past has been left almost entirely to states and localities.

Thus the federal government is now accelerating the worst trend of the
current accountability movement: that performance-based accountability has
come to mean testing alone. In the early stages of the current movement,
reformers had an expansive view of performance that included, in addition
to tests, portfolios and formal exhibitions of students' work,
student-initiated projects, and teachers' evaluations of their students.
The comparative appeal of standardized tests is easy to see: they are
relatively inexpensive to administer; can be man- dated simply; can be
rapidly implemented; and deliver clear, visible results. But relying only
on standardized tests dodges the complicated questions of what tests
actually measure and of how schools and students react when tests are the
sole yardstick of performance.

If this shift in federal policy were based on the accumulated wisdom gained
from experiences with accountability in states, districts, and schools, or
if it were based on clear design principles that had some basis in
practice, it might be worth the risk. In fact, however, it is based on
little more than talk among people who know hardly anything about the
institutional realities of accountability‹and even less about the problems
of improving instruction in schools.

The idea of performance-based accountability was introduced in the mid
1980s by the National Governors Association, led by Bill Clinton, then
governor of Arkansas. It took the form of what was then called the "horse
trade": states would grant schools and districts more flexibility in making
decisions about what and how to teach, in return for more accountability
for academic performance. This idea became the central theory of today's
accountability reforms. It was appealing in principle: governors and state
legislators could take credit for improving schools without committing
themselves to serious increases in funding. From the beginning,
performance-based accountability was an explicitly political idea, designed
to bring a broad coalition together behind a single vision of reform. As
with most such ideas, it was weak on practical details, most of which were
left to state and local policymakers and educators.

The movement got a major boost in 1994, when Title I‹the flagship federal
compensatory education program‹was amended to require states to create
performance-based accountability systems for schools. The vision behind the
1994 amendments was that Title I would complement and accelerate the trend
that began at the state level; the amendments required states to develop
academic standards, assessments based on the standards, and progress goals
for schools and school districts‹all within ambitious timetables. The
merger of state and federal accountability policies ("alignment," as it was
called) was supposed to occur by 2000. By the end of the decade, it was
difficult to find more than one or two states lacking some form of testing
program and public release of the results. In all but a few states,
however, the basic architecture of accountability remained relatively crude
and underdeveloped. In those few states where the idea had been developed
most extensively‹Texas and Kentucky, for example‹the systems worked well
enough, according to the testimonials of their sponsors, to legitimate the
idea that they were successful in general. But even in these states, there
were legitimate criticisms of the accountability system's actual effect on
academic performance and drop-out rates.

By the late 1990s, it was abundantly clear that the states had fallen well
short of what the crafters of the 1994 Title I amendments had envisioned.
It was also clear that the federal government possessed very little
leverage with which to force them along. States varied vastly in their
administrative capacities to implement performance-based accountability
systems. More important, creating accountability systems at the state level
is essentially a political act, and Washington's harmless knuckle-rapping
was hardly going to overcome the intransigence of a state legislature or
governor. The U.S. Department of Education's ability to monitor and enforce
compliance was limited; budget cuts whittled away at the Department's Title
I staff just as their responsibilities were increasing; and its senior
political appointees were reluctant to make life too difficult for
governors and chief state school officers, who are among their key
political constituencies. So by the target date for full compliance, fewer
than half the states had met the requirements. It came as no surprise to
learn that by the year 2000, many schools with Title I-eligible students
were simply unaware of the program's major policy shift in 1994.

This experience should have signaled to the Bush administration and
Congress that complex issues of state and local capacity could not be
brushed aside just by tightening the existing law's requirements. If more
than half the states were unable or unwilling to comply with the
requirements of the previous, less-stringent, more forgiving law, why would
one expect all the states to comply with a much more stringent and exacting

Even though virtually all the states have joined the accountability
bandwagon, doing so was, for many, largely a symbolic act. The designs of
the systems are still primitive; state education officials' authority to
oversee school districts is still limited in many cases; and the political
consequences of imposing large-scale, statewide testing in areas with
strong traditions of local control are risky. Moreover, mounting a
statewide testing system is beyond the capacity of most state departments
of education. Those that have embarked on large-scale testing are stretched
to their limits just managing test-development work or monitoring testing
contractors. Finally, there are technical issues. Standardized tests
inevitably become highly politicized and, in the course of the debate, the
limits of testing are subjected to public scrutiny. Many policymakers enter
the accountability debate not knowing much about testing, and they often
discover, much to their chagrin, that off-the-shelf tests may not validly
measure the content specified in state-mandated standards and that
norm-referenced tests (tests that deliberately create a normal distribution
around a mean) may not be effective in measuring changes in performance.

The working theory behind test-based accountability seems simple‹perhaps
fatally so. Students take tests that measure their academic performance in
various subject areas. The results trigger certain consequences for
students and schools‹rewards, in the case of high performance, and
sanctions for poor performance. Attaching stakes to test scores is supposed
to create incentives for students and teachers to work harder and for
school and district administrators to do a better job of monitoring their
performance. If students, teachers, or schools are chronically low
performing, presumably something more must be done: students must be denied
diplomas or held back a grade; teachers or principals must be sanctioned or
dismissed; and failing schools must be fixed or simply closed. The threat
of such measures is supposed to motivate students and schools to
ever-higher levels of achievement.

In fact, this is a naïve view of what it takes to improve student learning.
Fundamentally, internal accountability must precede external
accountability. That is, school personnel must share a coherent, explicit
set of norms and expectations about what a good school looks like before
they can use signals from the outside to improve student learning. Giving
test results to an incoherent, atomized, badly run school doesn't
automatically make it a better school. A school's ability to make
improvements has to do with the beliefs and practices that people in the
organization share, not with the kind of information they receive about
their performance. Low-performing schools aren't coherent enough to respond
to external demands for accountability.

The work of turning a school around entails improving "capacity" (the
knowledge and skills of teachers)‹changing their command of content and how
to teach it‹and helping them to understand where their students are in
their academic development. Low-performing schools, and the people who work
in them, don't know what to do. If they did, they would be doing it
already. You can't improve a school's performance, or that of any teacher
or student in it, without increasing the investment in teachers' knowledge,
pedagogical skills, and understanding of students. Test scores don't tell
us much of anything about these important domains; they provide a
composite, undifferentiated signal about students' responses to a problem.

Test-based accountability without substantial investments in internal
accountability and instructional improvement is unlikely to elicit better
performance from low-performing students and schools. Furthermore, the
increased pressure of test-based accountability alone is likely to
aggravate the existing inequalities between low-performing and
high-performing schools and students. Most high-performing schools simply
reflect the social capital of their students (they are primarily schools
with students of high socioeconomic status), rather than the internal
capacity of the schools themselves. Most low-performing schools cannot rely
on the social capital of students and families and instead must rely on
their organizational capacity. With little or no investment in capacity,
low-performing schools get worse relative to high-performing schools.

Some changes in the new law provide unrestricted money that states can use
to enhance capacity in schools, if they choose to. But neither state nor
federal policy addresses the capacity issue with anything like the
intensity applied to test-based accountability. The result is an enormous
distortion in the relationship between accountability and capacity‹a
distortion that is being amplified rather than dampened by federal policy.

In today's environment, critics who suggest that there might be problems
with the ways tests are used for accountability purposes are branded
apologists for a broken system. That the performance of students and
schools can be accurately, reliably, measured by test scores is almost an
article of faith. As a result, tests are being misused in ways that will
eventually undermine the credibility of performance-based accountability

The most serious problem lies in the use of test scores to make decisions
about whether students can advance to the next grade or graduate from high
school. The American Psychological Association's guidelines for test use
(and the consensus of professional judgment in the field of educational
testing and measurement) specifically prohibit basing any consequential
judgment about an individual student on a single test score. Why? Because
test scores are associated with a significant margin of error. That margin
of error increases as the number of cases decreases; individual scores are
typically much less reliable than aggregates of many individual scores.

The solution is to use multiple measures of a student's performance when
making consequential decisions. But this solution is more expensive and it
introduces a new level of complexity into the system. Were high-school
graduation to be contingent on a composite of grades, test scores, and
portfolios of students' work, developing such a composite would be a
challenging technical feat. It would also introduce a certain amount of
judgment into the system, and policymakers tend to distrust the
professionals who make such judgments.

A similar problem arises at the lower-school level. Under Title I, schools
are expected to meet their annual yearly progress goals, measured by a
school's annual gain in test scores. Title I also requires disaggregating
these scores by students' ethnic and economic backgrounds. But such
measures are highly unreliable for populations the size of a typical
elementary school, and they are particularly unreliable for even smaller
sub-groups of students. Schools are often misclassified as low- or
high-performing purely because of random variation in their test scores,
unrelated to any educational factor.

The standards and accountability movement is in danger of being transformed
into the testing and accountability movement. States without the human and
financial resources to select, administer, and monitor tests are now being
forced to begin testing at all grade levels. Instead of creating academic
standards that drive the design of an appropriate assessment, low-capacity
states will simply select a test based on its expense and ease of
administration, making charges of "teaching to the test" increasingly
accurate. A test with no external anchor in standards or expectations about
student learning becomes a curriculum in itself, trivializing the whole
idea of accountability.

The enthusiasm for performance-based accountability plays to the worst
weaknesses of the American education system. After World War II, most
industrialized countries nationalized their education systems, but not the
United States. Because decisions about content and performance were left to
states and localities for so long, they never developed the capacity to
monitor the quality of teaching and learning in schools, to support the
development of teachers' and administrators' knowledge and skill, or to
evolve measures of performance that are useful to educators and the public.

The difficult, uneven, and protracted slog toward clearer expectations and
supports for learning has barely begun in most states and localities. The
history of federal involvement in that long effort is mixed at best. The
current law repeats all of the strategic errors of the previous law, but
with greater federal intervention. The prognosis is not good.

The best we can hope for is that the capacity problems of states and
localities will become more visible as a political issue, triggering
responses that will help schools overcome the real obstacles they face in
improving the quality and intensity of teaching and learning. Similarly, we
can hope that the technical failures of testing will trigger a response
that focuses more on broad assessments of student learning.

The worst that can happen is that test-based accountability will widen the
gap between schools serving the well-off and those serving the poor, thus
confirming the public's suspicion that expecting high levels of learning
from all children is unrealistic. Performance-based accountability in
education is mutating into a caricature of itself.

Richard F. Elmore, Ed.D. '76, Anrig professor of educational leadership at
the Harvard Graduate School of Education, is completing a study of school
accountability. Recent publications include "Building a New Structure for
School Leadership" and "Bridging the Gap between Standards and
Achievement," both available from This article is
adapted with permission from an earlier version, titled "Unwarranted
Intrusion," which appeared in the Spring 2002 issue of Education Next
(, published by the Hoover Institution, Stanford

Norman A. Stahl
Professor and Chair
Literacy Education
GH 223
Northern Illinois University
DeKalb, IL 60115

Phone: (815) 753-9032
FAX:   (815) 753-8563
[log in to unmask]

Universities are institutions run by amateurs to train professionals.
Derek Bok----Harvard University
In examinations, the man who succeeds is not the man who can write well
about something that he knows, but the man who can write brilliantly about
something of which he knows nothing.  D.B. Jackson----the Royal Air Force

To Unsubscribe,
send a message to [log in to unmask]
In body type: SIGNOFF LRNASST.