Lately I've been reading a lot about using student test scores to evaluate teachers. This has been a long standing problem for me. Not that it is not appropriate, but that the skills to do it don't exist in K12 education.
There is a belief that you can do simple things, like the fraction of students that pass a certain standard from one year to the next, and be done with it. That would be true, if what you were dealing with was a well-defined experiment with random assignment to treatment groups -- but we don't have that.
Analysis of treatment effects and program evaluation works on a continuum. When you have strong experimental controls, the statistics are easy, mean of this vs mean of that with a t-test. The more you deviate from strong experimental controls, blind evaluation and random assignment, the harder the stats get. K12 standardized tests are very far away from strong experimental controls.
Last year, the Parkrose board of education sent the letter below against the adoption of ORA 581-022-172, 1725 on teacher evaluation, not because we thought the idea was bad, but because we thought K12 going to screw it up.
This is not a training issue. I can't teach someone that thinks a chi-squared test is pushing the envelop how to do this in a week. This is a job for professionals, not education professionals, statistical and program evaluation professions, not an a education professional that knows stats, a real statistictician and program evaluator.
I spent years trying to explain to ODE that they were understating the uncertainty on OAKS test scores by a minimum of a factor of four because they had violated an assumption of the estimator they were using but it fell on very deaf ears.
Here is the letter.
The Parkrose Board of Education would like to comment on the proposed OAR 581-022-1723, 1725 Teacher and administrator evaluation and support. We believe that a change may help in understanding true student achievement, especially, the view that state assessments make up the some of the core measures of student achievement.
The proposed rules are almost entirely permissive and can be made part of teacher and administrator evaluation when properly performed. The list includes many items that are extremely difficult to perform well. In particular, we suggest excluding the use of student assessments from 581-022-1723 (1) (2) (a).
Student data, like many tools, can be very useful in the hands of a trained expert. Too often student performance data is treated like the end result of a randomized control trial with one school or one teacher having a randomly assigned collection of students. Students are not randomly assigned. Parents choose neighborhoods. Students are sorted into classrooms based on need, and students move from school to school over the course of a year. This confounding self-selection means that you can not simply look at the data, but must make adjustments for those factors.
Those familiar with program evaluation and quasi-experimental design know that it is possible to use student data to evaluate the efficacy of teachers. The problem is that the statistical expertise does not exist in individual school districts, Educational Service Districts or at the Oregon Department of Education.
Until the practitioners understand, at a minimum, when they should use propensity score matching, or multinomial logit for classroom selection, or linear hierarchical models for evaluating both a school and teacher simultaneously, we do not recommend using student data. Without understanding these basics, teachers and administrators may be improperly identified as excellent or failures.
Our hope would be that we could find the best possible measures, using the best statistical models available to us for the good of all students in the state of Oregon.
James Woods PhD
Parkrose Vice Chair