Here we are: the point in the semester where many a dedicated instructor quails before the giant stack of papers on his or her desk. If your students are handing in their papers only a day or two before they leave campus, writing carefully considered comments can feel like putting a message in a bottle. Will students read your carefully considered feedback? And if they do, will it be just to seek an explanation for the grade they received?
Ideally, our students leave our courses armed with insights they will apply to future endeavors, but…what if machines could provide those same insights, at just a fraction of the coffee? Robograders have been around for a while, but a new study claims they are even more efficient and accurate than human graders. “Accurate” here means that they’re better at calibrating test essays to the control essays supplied to graders. The news has prompted several thoughtful reflections.
For The New York Times, Les Perelman produced a completely incoherent essay that earned a 6 from E.T.S.’s E-Rater.
As the Times article explains, “E.T.S. also acknowledges that truth is not e-Rater’s strong point. ‘E-Rater is not designed to be a fact checker,” said Paul Deane, a principal research scientist.'” Sound like Wikipedia to anyone?
The machine doesn’t actually read an essay, of course; it uses a bunch of fairly complex algorithms to analyze a string of words. And E.T.S. says that those algorithms are able to assess complex thinking, even in the case of Perelman’s nonsense essay. It takes a lot of high-order thinking and linguistic ability to game the algorithms, so any student capable of producing a high-scoring piece of mellifluous nonsense deserves the high score the machine gives him. And that’s where this gets interesting. There’s a lot to say for automating the assessment process, if it can be done well. (For example, community colleges would love a fast and automated way to accurately determine which incoming students require remedial coursework.) But does being able to write a robograde-able essay serve as an appropriate proxy for any skill that we do want to test for? What are we testing for, again?
Last week’s “pineapplegate” shows that, at least at the K-12 level, we are indeed testing for what you might call the Wikipedia Answer—not the truth, but the consensus. Getting facts wrong—or, more importantly, disagreeing with your grader about the facts—shouldn’t harm your score. But incoherence is another issue altogether. One of the study’s authors suggests that “the most exciting potential of automated essay graders […] is not their ability to replace test scorers (or possibly teachers) with a cheaper machine, but their ability to expand upon that software to give students feedback and suggestions for revision.” Watson got a lot of Jeopardy questions right, but I’m not sure I want his advice on revising my essay—or, as Alex Reid suggests, if that’s the kind of writing we’re aiming for, why not just let the machines produce it for us in the first place?
Lee Bessette’s post on “Sustainable Grading” reflects on her own experience with sending students to a computer program for feedback on their writing. As she points out, asking adjuncts to handle ever-larger classes (and for ever-decreasing pay) is basically asking them to grade like machines—instead of engaging with a student’s ideas, which takes time, they will be looking to assess how closely an essay matches up to the model, and a machine can do that for you just fine. In The Chronicle, David Jaffee argues that even the phrase “study for exams” betrays the kinds of values that lead to machinable assessments.
Ben Wildavsky’s essay on how we use tech reflects that while letting tech drive pedagogy can be exciting, it also has the potential to result in things like sleeveless pineapples and robograders. Perhaps using tech merely to make current pedagogy easier to access isn’t such a bad idea after all.
There’s hope yet—teachers working to come up with questions for K-12 classes to use with the basal readers they already have are moving away from machine-like questions about basic vocabulary, in favor of tougher questions that ask students to read the texts more carefully.
Ben Yagoda’s recent post about the errors that even the best-intentioned and most attentive writers sometimes commit is a healthy reminder not only of the fact that we, too, sometimes use faulty algorithms, but more importantly that we know the difference. Presumably, this is the kind of attentiveness to detail and general knowledge that college is supposed to teach: but it’s clearly not the kind that our standardized tests seek, nor is it the kind that most media outlets strive for any more. As an example, take the recent coverage on Starbucks’ decision to discontinue cochineal-based coloring, in which most reporters breezily assumed that “insect” meant something like “beetle.” Those stories would probably have made it past the robograders with flying colors.
This post was written by Odile Harter.