The New York Times’ editorial today about teacher evaluation was unusually odd. It sounded as though the writer knows there is no evidence to support using student test scores, but is trying to find a rationale for doing it anyway. There is literally not a single district one can point to and say, “It’s working here. Here is proof that using test scores to evaluate teachers produces excellence.”

The editorial claimed that Montgomery County’s much-admired Peer Assistance and Review program relies on test scores. It sounds like Cinderella’s ugly sister is trying to stuff her big foot into the glass slipper. Montgomery County turned down $12 million in Race to the Top funding to avoid using test scores to evaluate teachers.

Its peer assistance program works far better than the value-added test-based evaluations now adopted in many states and districts in which test scores count for as much as 50% of a teacher’s “grade.”

Carol Burris, who has been a leader in the fight against test-based evaluation in New York, shares her reaction to this odd editorial:

Today’s editorial in The New York Times [http://www.nytimes.com/2012/09/17/opinion/in-search-of-excellent-te...] on teacher evaluation is just one more beat on the same broken drum.  The Times seeks to distance the Chicago plan from other evaluation plans, which with the exception of Montgomery County’s, are more like Chicago’s than not.   Montgomery County’s longstanding plan, does not use test scores for evaluation and it focuses on teacher improvement, not sorting and dismissal.
 
The column bases its arguments on the same false assumptions that folks like Michelle Rhee have sold to the public. The first is that teacher evaluation is universally broken.  This assumption comes from the report, The Widget Effect, produced by Rhee’s group, the New Teacher Project.  It drew its conclusions from a few selected districts. Evaluation is not broken in Montgomery County and it is certainly not broken at my high school.  Many districts have sound evaluation systems that help teachers become more effective—they are not teacher dismissal machines but rather supervision models designed to improve instruction.
 
The second false assumption is that excellent teachers leave districts because they are not rewarded (translate, receive merit pay).  Again, there is no factual evidence to support this.  Merit pay is neither effective nor is it desired by teachers—it is a gift of public funds at a time when schools can ill afford it.
 
The third false assumption is that as long as we decrease the percentage of the evaluation number derived from VAM scores, we can make it all work. The editorial uses IMPACT as an example. They attribute the Washington DC school district’s decision to decrease the percentage of VAM in evaluations to ‘teacher anxiety’. I find that remark, which reformers often use to describe teacher responses to these systems, to be both paternalistic and sexist. Teachers object to VAM because they know its limitations and flaws.  It was never designed to evaluate individual teachers; it was designed by researchers to be a tool to assess systems and programs. Using VAM to evaluate teachers is akin to using Lysol as a mouth wash because it kills germs on your kitchen floor.
 
Here is an example of the limitations of the New York system.  Teachers and principals of grades 4-8 were recently assigned “growth scores” by the State Education Department.    The model SED used was a hybrid of a growth model and a VAM model. The American Institute for Research, which created the model, also produced a technical manual to explain the resulting scores. You can find that manual here: http://usny.nysed.gov/rttt/docs/nysed-2011-beta-growth-tech-report.pdf.
 
It is well worth a careful read.  AIR was remarkably candid explaining the limitations.  Here are some highlights:
 
• Although AIR preferred to use three years of prior scores as the baseline for growth, such data was available for Grades 6- 8 only. Grades 4 and 5 had limited prior data which was reflected in larger error, especially in Grade 4.
• There was no way to identify co-teachers or support teachers, and a little over half of all student scores in grades 4 and 5 were attributed to principals only, because they could not be correctly linked to teachers.
• The only co-variates (predictor variables in the model) were ELL status, SWD status (with all disabilities mild and severe considered the same) and economic disadvantage.
• Race, ethnicity, class size, spending, attendance and a host of other variables which are known correlates with student performance were not included.
 
Perhaps the most important problems with the model are explained on pages 24 – 30. AIR clearly shows how as the percentage of students with disabilities and students of poverty in a class or school increases, the average teacher or principal growth score decreases.  In short, the larger the share of such students, the more the teacher and principal are disadvantaged by the model. Regarding ELL students, the report indicates that some teachers are advantaged, while others are disadvantaged. This should come as no surprise—well educated students from China and students from rural areas of El Salvador with interrupted education are both classified as ELL, but their growth, as measured by test scores, is quite different.
 
Likewise, in this model, teachers who have students whose prior test scores are higher are advantaged, while teachers whose students have lower prior achievement are disadvantaged. This phenomenon, known as peer effects, has been observed in the literature since the 1980s.  It is a root cause of the widening of the test score gap among classes in tracked schools. It has also been found in school to school comparisons as well.  In a study of Houston Schools after Katrina, the schools which received a large share of high performing students from New Orleans saw their original students’ scores rise, and those who received a large share of low performing students from New Orleans saw their original students’ scores decrease.
 
Perhaps the best critique of the model comes from AIR itself.  They conclude “the model selected to estimate growth scores for New York State represents a first effort to produce fair and accurate estimates of individual teacher and principal effectiveness based on a limited set of data” (p. 35).  Not “our best attempt’, not even a ‘good first attempt’, but rather a “first effort’.  And yet, across the state, teachers and principals have received scores telling them that they are ineffective in producing student learning growth.
 
I can assure those who believe that teachers are simply anxious, this is not something that a Xanax will cure. Teachers and principals are smart and savvy; you are mistaking outrage for anxiety.