What can you do with $45 million and three years? Well, if you’re the Bill & Melinda Gates Foundation, you can confirm, empirically, what educators have always known implicitly: great teaching matters, it can be measured, and it improves student learning.
That was one of the many findings released last week in the final report from the MET Project (Measures of Effective Teaching). MET has generated buzz in education and popular media alike, so I won’t provide a full synopsis here. For a basic summary, check out the Washington Post or Huffington Post rundown; for more thoughtful commentary, turn to posts from Chad Aldeman, Andy Smarick, Rick Hess, Marty West, and National Journal Experts Blog. Instead, I want to call attention to two big takeaways from the MET Project.
What teacher evaluations measure is just as important as how they measure it.
Much has been made of the finding that classroom observations are the worst predictor of student learning, compared to state test scores and student surveys. Some have questioned whether observations are worth the significant time and personnel costs involved to do them well. Tim Daly of TNTP even claimed that MET shows “the way that most teachers have been evaluated forever is completely unreliable.”
It’s easy to jump to that conclusion: MET used proven, high-quality observation tools, and observers were trained and certified on their knowledge of them. This isn’t the case with many of the classroom observations used across the country. Still, observations are a critical component of teacher evaluations, particularly for those in the early grades and in untested subjects. And using observations typically receives greater support from educators compared to test scores. Finally, MET’s research found that although classroom observations didn’t improve the predictive power of the evaluation measure, they did improve its reliability – or stability – from year to year.
Test scores also don’t have the same diagnostic power as classroom observations: as Amanda Ripley put it, “test scores can reveal when kids are not learning; they can’t reveal why.” Observations can provide teachers with valuable, timely, and clear feedback on their practice. Given their complexity and the timing of state testing, value-added measures are far less teacher-friendly – not to mention, limited in scope. Surely, great teaching involves much more than improving student scores on multiple-choice tests in two subjects.
To this end, it’s laudable that MET’s researchers also used higher-order tests (the SAT 9 Open-Ended Reading Assessment and the Balanced Assessment in Mathematics) to measure student learning. In some states, these assessments are more similar to the Common Core assessments they will offer in 2014-15. Presumably, states should want teacher evaluations that not only function well with today’s tests, but also those of the future.
Still, the tests MET used only consider English Language Arts and math skills. If the ultimate goal of evaluations is to measure whether teachers create learning environments where students achieve a broader set of outcomes (say, the knowledge, skills, and attributes it takes to be college- and career-ready), then there is still a long way to go in developing these systems. In 2014, many states will be simultaneously implementing new teacher evaluations and the Common Core assessments. But the best evaluation systems today do a far better job identifying teachers that improve student learning via state test scores than teachers that improve college and career readiness. MET’s findings suggest that states should carefully consider whether their evaluation systems are measuring the teacher attributes needed to meet the Common Core’s objectives.
How teacher evaluations are used is just as important as what they measure.
Part of the demand for research like the MET Project comes from the push to use teacher evaluation systems to make human resources decisions. Hiring, retention, placement, compensation, and tenure can all be affected. Some of the push can be attributed directly to the Obama administration: developing and using teacher evaluation systems like the ones in the MET study for HR decisions was a major component of both Race to the Top and the No Child Left Behind waivers.
But there is still uncertainty surrounding teacher evaluation systems; the MET Project doesn’t provide a definitive roadmap or specific policies for states and districts looking to measure effective teaching. Many of its findings are ambiguous (with the exception that value-added measures must account for students’ prior test scores). The MET report is inconclusive when it comes to:
- whether student demographics should be included as a control in value-added models;
- precisely how to weight each component within a composite effectiveness measure: value-added data, student-perception surveys, and classroom observations;
- whether measures like the Content Knowledge for Teaching (CKT) tests or subject-based classroom observation tools could be useful additions to composite measures of teacher quality; and
- who should observe teachers, how long these observations should last, and how many observations should occur each year.
The teacher quality measures MET suggests are “better on virtually every dimension than the measures in use now.” But does that mean similar teacher evaluation systems should be used as the deciding factor for whether a teacher is fired? Or promoted? Or receives a pay increase?
Thorny questions, indeed. Yes, the new measures of effective teaching are promising, compared to most old-school teacher evaluation systems where nearly every teacher rated ‘satisfactory.’ But given MET’s lingering questions and inevitable measurement error in these measures of effectiveness, wouldn’t it make more sense to continue developing and refining teacher evaluation systems without rushing to use them for high-stakes decisions? Especially since most schools lack the capacity and resources to implement evaluations of the rigor and quality that the MET study used? States and districts should consider using the results from teacher evaluations in a more diagnostic manner: why not make these measures of effective teaching the first step in the process of providing professional development, determining who receives pay increases or tenure, and making decisions about hiring or firing – rather than the final step?
 In full disclosure, the work of New America’s Education Policy Program is supported, in part, with funding from the Gates Foundation.
 However, the “data suggest that assigning 50 to 33 percent of the weight to state test results maintains considerable predictive power, increases reliability, and potentially avoids the unintended negative consequences from assigning too-heavy weights to a single measure.”
 MET’s results do show that more lessons and observers increases the reliability of observations, but there are “a range of scenarios for achieving reliable classroom observations.”