In 2009, a now often-cited study of teacher evaluations in multiple states found that just 1 percent of teachers were labeled unsatisfactory. That implicitly glowing appraisal of teacher performance stood in contrast to alarming achievement gaps among students of different racial, ethnic and socioeconomic backgrounds, and to a more general slippage of U.S. students in international rankings of student achievement. The study, titled “The Widget Effect,” came at a critical moment.
Teacher evaluations tied to student achievement had become the centerpiece of the Obama administration’s education agenda and the favorite of a diverse coalition of school reformers. The argument appears common-sensical: If, as research shows, teachers are the most influential in-school factor affecting student learning, it would be reasonable to judge them, at least in part, on their students’ achievement.
Based on the chance for a slice of billions of dollars from Obama administration reform initiatives, dozens of state legislatures passed laws that cleared the way for such evaluations. But even some supporters of tying teachers’ performance ratings to student test scores concede that the policy changes moved faster than the research undergirding them. With limited technical know-how at the state and district levels and a shortfall of experts nationally, some efforts to overhaul teacher evaluation systems have been rushed, while others have become enmeshed in legal disputes. The issue has significant political stakes, with accountability-minded school leaders often pitted against unions seeking to protect members against what they see as unfair threats to teachers’ livelihoods. The experiment is playing out in real time and promises to remain contentious for years to come.
Until quite recently, the evaluation of teachers consisted primarily of infrequent “walk-throughs” conducted by untrained administrators, the end result typically being that most teachers were rated as top performers. This led to seemingly incongruous statistics. In Chicago, for example, just 54 percent of the class of 2011 graduated from high school, in a year when 99 percent of teachers were rated as effective. At the same time that administrators argued that current practices made it too difficult to fire poor performers, teachers complained that evaluation systems provided them with little guidance on how to improve. The 2009 study that documented minuscule numbers of unsatisfactory evaluations also found that 73 percent of teachers surveyed said their most recent evaluations did not identify any areas for improvement, and only 45 percent of teachers who did have such areas identified said they received useful support to improve.
Boost from Obama
The drive to revamp teacher evaluations got a major boost when the Obama administration placed $4.35 billion from the 2009 economic stimulus law in a competitive grant program called Race to the Top, which rewarded states for tying teacher evaluations to student achievement. The administration embedded similar incentives in its School Improvement Grant program to turn around low-performing schools, as well as an initiative to grant states waivers to some of the more unpopular aspects of the federal No Child Left Behind Act’s accountability provisions. In his 2012 State of the Union Address, Obama said that schools needed flexibility to implement evaluation systems that “reward the best” teachers and “replace teachers who aren’t helping kids learn.”
The administration’s push had an enormous effect, even on states that did not win federal funds. In 2009, according to the National Council on Teacher Quality, only 15 states required annual evaluations of all teachers, and 35 states did not require evaluations to include measures of student learning. By the end of 2012, those numbers had shifted dramatically: Forty-three states required annual teacher evaluations, with 32 incorporating student achievement.
From the outset, it was clear that the new evaluation systems came with high stakes. Teachers with low ratings could lose their jobs. Salaries, promotions and reputations also hung in the balance. But there were concerns that in their zeal to tackle some long-standing problems, the administration was moving faster than the speed of research. One report, provocatively titled “The Hangover,” warned of “the unintended consequences of the nation’s teacher evaluation binge.”
A major element of the new approach to teacher evaluations is a statistical technique known as “value-added” modeling. Value-added models compare test scores students earn in any given year to the scores they were predicted to attain based on prior tests and a host of other variables; if students exceed their predicted scores, the difference is seen as the teacher’s “added value.” But efforts to rate teachers’ effectiveness solely via value-added scores have been assailed by critics as inaccurate.
Fueling such criticisms have been fluctuations in teachers’ scores from one year to the next. For example, a 2010 study of five school districts found that of teachers who scored in the bottom 20 percent for value added one year, only 20 to 30 percent had similar rankings the next year, while 25 to 45 percent moved to the top of the rankings. A more fundamental problem for researchers is that while the models are meant to evaluate teachers’ effects when assigned to students randomly, this is an artificial construct that seldom looks like the way students are assigned in the real world.
The actual placement of students — in both schools and classes — is far from random. Parents influence where their children go to school and often to what class and teacher they are assigned. Teachers, via seniority, often select the school and classroom where they are placed. Hence, students assigned to a particular teacher may not be representative of the general population. Student achievement can also be influenced by variables outside of the teacher’s control like the physical condition of the schools, school policies and parental support. Finally, No Child Left Behind only requires assessments in math and reading, and only in certain grades. This leaves out the majority of teachers, many of whom find themselves awkwardly incorporated into the new evaluation rubric.
Regardless of one’s view of the quality of the research, it is generally understood that the newness of the field translates into a paucity of reliable models states and districts can choose from, and a scarcity in the pool of experts who can help implement them. In 2011, the former head of Race to the Top’s technical assistance network estimated there were eight researchers in the country with expertise to implement such systems.
Paul Pastorek, former state superintendent of education in Louisiana, expressed concerns that the U.S. Department of Education and many states had oversold the ease of implementing such models. “I think some [states] may be underestimating the resources and energy that these kinds of initiatives require … state departments of education are not designed to implement these programs,” he said. A study by the Data Quality Campaign found that just 11 states (and only four of the 12 Race to the Top winners) had all of the components necessary to implement sophisticated teacher evaluation models. Race to the Top states and several districts that received School Improvement Grants had to scale back proposed reforms or push back timetables due to teacher evaluation headaches.
Evaluations in Practice
It didn’t help the cause of value-added models that an early high profile example of their use (or misuse) erupted in national controversy. The shot across the bow came not from a school but a newspaper. In 2010, the Los Angeles Times published what it called value-added measurements for 11,500 teachers, in which improvement on student test scores were the sole criteria. The statistics were roundly criticized even by the supporters of such models, most of whom believe they should be used as one factor in evaluation along with other criteria like classroom observations.
The public nature of the project played into fears that the evaluations would be punitive rather than used to aid instruction, a point that was driven home when one teacher who was ranked “less effective than average” committed suicide. In Washington, D.C., the implementation of the Impact evaluation system under former Chancellor Michelle Rhee led to the firing of more than 400 teachers, and fears that the measurements were behind a cheating scandal in which teachers at one school were found to have changed student’s answers on standardized tests.
After Rhee’s departure, the system was softened somewhat, lowering the weight given to test score improvement from 50 percent to 35 percent. Even in Tennessee, one of the first states to receive a Race to the Top grant, there have been severe implementation headaches. The state’s assistant superintendent noted the “fundamental unfairness” of methods used to evaluate the more than 60 percent of state teachers who practiced in areas in which there was no standardized test. A report by the state department of education found that while the model successfully identified the best teachers, it “systematically failed to identify the lowest-performing teachers, leaving these teachers without access to meaningful professional development and their students and parents without a reasonable expectation of improved instruction in the future.”
So where do things go from here? With the nation’s teachers’ unions ambivalent at best, the issue of value-added modeling is frequently battled out at the local level. A seven-day teachers’ strike in Chicago in 2012, for example, was sparked in part by a new policy that would base 40 percent of teacher evaluations on student test scores. A similar dispute in New York City risked the loss of $450 million in Race to the Top Funds when the city missed a deadline to reach an agreement on new teacher evaluation policies in early 2013.
In Florida, a case brought by a teachers’ union led a judge to strike down the state’s new merit-pay provision. Under the law, teachers are graded on math and reading tests, but for those who teach other subjects, evaluations are based largely on the performance of other teachers. In 2013, the judge ruled the provision to be “wholly invalid.” One byproduct of the race to implement better teacher evaluations is a flurry of research that could bring aspects of the long-simmering debate closer to consensus. The most visible is a three-year, $45 million study spearheaded by the Bill & Melinda Gates Foundation of 3,000 teachers in seven school districts around the country. The study found that teacher evaluation systems based on multiple measures—including student test scores, classroom observations by multiple reviewers, and ratings by students themselves—are better than those based on a single measure.
There are those, like the National Council on Teacher Quality, who claim that in the search for perfect teacher evaluation measures, policymakers are forgoing those that might nonetheless be better than what came before. A report from the pro-reform research and advocacy group put it this way: “Are emerging teacher effectiveness measures perfect? No. But they are a marked improvement on evaluation systems that find 99 percent of teachers effective with little attention to a teacher’s impact on students.”