Texas Will Use Computers To Grade Written Answers On This Year's STAAR Tests 41
Keaton Peters reports via the Texas Tribune: Students sitting for their STAAR exams this week will be part of a new method of evaluating Texas schools: Their written answers on the state's standardized tests will be graded automatically by computers. The Texas Education Agency is rolling out an "automated scoring engine" for open-ended questions on the State of Texas Assessment of Academic Readiness for reading, writing, science and social studies. The technology, which uses natural language processing technology like artificial intelligence chatbots such as GPT-4, will save the state agency about $15-20 million per year that it would otherwise have spent on hiring human scorers through a third-party contractor.
The change comes after the STAAR test, which measures students' understanding of state-mandated core curriculum, was redesigned in 2023. The test now includes fewer multiple choice questions and more open-ended questions -- known as constructed response items. After the redesign, there are six to seven times more constructed response items. "We wanted to keep as many constructed open ended responses as we can, but they take an incredible amount of time to score," said Jose Rios, director of student assessment at the Texas Education Agency. In 2023, Rios said TEA hired about 6,000 temporary scorers, but this year, it will need under 2,000.
To develop the scoring system, the TEA gathered 3,000 responses that went through two rounds of human scoring. From this field sample, the automated scoring engine learns the characteristics of responses, and it is programmed to assign the same scores a human would have given. This spring, as students complete their tests, the computer will first grade all the constructed responses. Then, a quarter of the responses will be rescored by humans. When the computer has "low confidence" in the score it assigned, those responses will be automatically reassigned to a human. The same thing will happen when the computer encounters a type of response that its programming does not recognize, such as one using lots of slang or words in a language other than English. "In addition to 'low confidence' scores and responses that do not fit in the computer's programming, a random sample of responses will also be automatically handed off to humans to check the computer's work," notes Peters. While similar to ChatGPT, TEA officials have resisted the suggestion that the scoring engine is artificial intelligence. They note that the process doesn't "learn" from the responses and always defers to its original programming set up by the state.
The change comes after the STAAR test, which measures students' understanding of state-mandated core curriculum, was redesigned in 2023. The test now includes fewer multiple choice questions and more open-ended questions -- known as constructed response items. After the redesign, there are six to seven times more constructed response items. "We wanted to keep as many constructed open ended responses as we can, but they take an incredible amount of time to score," said Jose Rios, director of student assessment at the Texas Education Agency. In 2023, Rios said TEA hired about 6,000 temporary scorers, but this year, it will need under 2,000.
To develop the scoring system, the TEA gathered 3,000 responses that went through two rounds of human scoring. From this field sample, the automated scoring engine learns the characteristics of responses, and it is programmed to assign the same scores a human would have given. This spring, as students complete their tests, the computer will first grade all the constructed responses. Then, a quarter of the responses will be rescored by humans. When the computer has "low confidence" in the score it assigned, those responses will be automatically reassigned to a human. The same thing will happen when the computer encounters a type of response that its programming does not recognize, such as one using lots of slang or words in a language other than English. "In addition to 'low confidence' scores and responses that do not fit in the computer's programming, a random sample of responses will also be automatically handed off to humans to check the computer's work," notes Peters. While similar to ChatGPT, TEA officials have resisted the suggestion that the scoring engine is artificial intelligence. They note that the process doesn't "learn" from the responses and always defers to its original programming set up by the state.
LOL (Score:4, Funny)
Be prepared for some ridiculous essays to be graded A+. Reference: https://arstechnica.com/scienc... [arstechnica.com]
Maybe not as bad as it looks? (Score:5, Informative)
My understanding is that the STAAR test merely suggests areas a student may need improvement in. If it's a rough guide to see who may need help rather than a direct tool to hold kids back a grade, maybe this is no big deal, especially if it's re-reviewed by a human reader IF a hold-back is initially recommended.
As soon as the test results have teeth, parents will then wish to boot the bots.
Re:Maybe not as bad as it looks? (Score:4, Interesting)
If it's a rough guide to see who may need help rather than a direct tool to hold kids back a grade,
Does anybody hold kids back a grade anymore? Even when my kids were young it just wasn't done.
The (possibly correct) theory is that it makes more sense to start providing additional support and individualized instruction, rather than repeating a year of instruction that didn't work the first time around, along with all the personal upheaval and age mismatch that goes with it.
Re: (Score:2)
In my state they try to use summer school for that purpose. It's a hell of an incentive also: make the grades or have your summer ruined. When all the other kids are frolicking at pools and lakes, they are doing fractions in a smelly mobile classroom.
Re: (Score:2)
In my state they try to use summer school for that purpose. It's a hell of an incentive also: make the grades or have your summer ruined. When all the other kids are frolicking at pools and lakes, they are doing fractions in a smelly mobile classroom.
That would be an incentive!
Re: (Score:2)
They vastly expanded their retention policy a couple years ago, basing it almost entirely on standardized testing. One standardized test, to be specific. And despite widespread public backlash, they're seeking to expand the practice.
Re: (Score:2)
Tennessee. [msn.com] They vastly expanded their retention policy a couple years ago, basing it almost entirely on standardized testing. One standardized test, to be specific. And despite widespread public backlash, they're seeking to expand the practice.
Oh, wow!
Lack of details is a bad precedent (Score:2)
Regardless of whether this particular case will cause harm, it is worrisome that so little information about the AI's effectiveness is being provided. If they were confident in the system they would be providing metrics on how often the AI matches the grades of human reviewers. They have millions of tests to verify the AI's competence with. When the AI provides a high confidence grade, does it match the grade of a human reviewer 99% of the time? 90% of the time? Then compare that with how many times the two
Re: (Score:2)
The STAAR test is mostly a feel good joke. Texas is pathologically averse to any kind of national standardized testing, since they tend to show Texas schools in an off-narrative and not very positive light [ontocollege.com].
In theory kids who don't take, or flunk the STAAR tests on certain subjects cannot graduate from high school. I don't know how often this is really done, the high schools here seem to be doing everything possible to pass the trouble makers from grade to grade, or where possible, to juvie.
These Idiots Are Already Too Stupid. (Score:5, Interesting)
I understand the fact that the current population likes to make money and keep others from making it. Controlling education is part of that. Eventually the smart educated folks are going to die off and no one will be smart enough to take their place.
Remember when family business didn't mean mafia?
Re: (Score:2)
Re: (Score:2)
Remember when family business didn't mean mafia?
I 'member!
So does Pepperidge Farm.
GOAT (Score:2)
You are approached by a frenzied Texan scientist, who yells, "I'm going to put my quantum harmonizer in your photonic resonation chamber!" What's your response?
Re:GOAT (Score:4, Funny)
I'd assume he's a member of the clergy
Re: GOAT (Score:2)
Re:STAAR? (Score:5, Informative)
That's what Texas Republicans want to be true, but targeted voter suppression efforts [texastribune.org] indicate that it's not nearly as red as it appears.
Re: (Score:3)
News Alert! Idiots come in red and blue.
Re: (Score:2)
News Alert! Idiots come in red and blue.
That would make them purple idiots.
Re: (Score:3)
We have those too.
in the future... (Score:3)
So, uh, we imprison children for their youngest years and teach them to write, all so they can vomit some shit that will be graded by a computer anyway because we're afraid to just sort people by IQ in the first fucking place, and then forget most of it because we don't really have much use for literacy beyond obedience anyway.
Why not just, like, stop doing it entirely instead of cutting corners?
Bright but not progressing? (Score:3)
A kid may have a stellar IQ, but that's no guarantee that they will do well at school. Being bored, bullied or generally abused may all cause such a poor outcome.
Re: (Score:2)
because we're afraid to just sort people by IQ in the first fucking place,
Pretty much.
Well, it's more that the political and educational system likes to pretend outwardly that all kids have ~ 115+ IQ and that if only we increase funding some more and find the right curriculum then everyone can grow up to be a doctor or engineer. That's been like the past ~30? years of American educational philosophy. (or maybe educational political marketing and posturing )
I think; DROP TABLE grade_results; (Score:3)
What could go wrong?
Re: (Score:3)
Further; Drop Database Starr;.
oh the irony (Score:2, Insightful)
Re: (Score:1, Informative)
That's because getting their favorite politicians into seats of power is MUCH more important that silly things like education.
Education is all woke and make believe with things like theories of evolution, archeology (dinosaurs are made up), civil rights, and global warming (climates change all the time. No proof that Texas oil is hurting anybody.)
Re: (Score:3, Insightful)
Get a degree in education and eliminate jobs in that field at the very same time.
That's as ridiculous as outlawing abortion while refusing to increase funds to foster or educate unwanted children. Then, prioritizing the orphans' right to own automatic weapons once they age out of Juvie.
Only Texas messes with Texas..
u/Texas (Score:2, Insightful)
We worry about too many people to feed. (Score:3)
Stop treating our next generations worse than pets; would you trust your pet to being fed by an AI?
What is the alternative? Trust teachers? (Score:3)
'The city with a population of 155,000 along the Connecticut river has a median household income half the state average; violent crime is common. Yet graduation rates at the city’s high schools are surging. Between 2007 and 2022 the share of pupils at the Springfield High School of Science and Technology who earned a diploma in four years jumped from 50% to 94%; at neighbouring Roger Putnam Vocational Technical Academy it nearly doubled to 96%.
'Alas, such gains are not showing up in other academic ind
Re: (Score:2)
Re: (Score:2)
Re: What is the alternative? Trust teachers? (Score:2)
Many colleges have banned the SAT. Unsure how they judge worthy to enter now? Maybe if you have credit worthiness to be tied to a college loan for life?
A reverse Turing test (Score:4, Insightful)
They seem to believe that the grading algorithm will reliable enough & are motivated by the prospect of it saving money.
There's another way to grade essays & other constructed response items that is also faster & more reliable than typical human grading; adaptive comparative judgement (See: https://en.wikipedia.org/wiki/... [wikipedia.org]). I'd say, if they're really interested in improving grading in Texas, rather than making headlines by jumping on the AI bandwagon, they'd run limited side-by-side pilot trials comparing all 3; typical rubric based grading, adaptive comparative judgement, & letting the LLM grade them unsupervised. Let's see how the comparative reliability & money-saving results turn out.
Re: (Score:2)
Great idea!
Texas: (Score:2)
where thinking goes to die.
This has been going on for many years (Score:1)
The specifics of the tech involved are one thing, but many states have done this for a while now. The contracts (and the RFPs before them) involved in standardized testing will show examples all over the country.
Despite training against a significant body of student responses, at least one AI-driven evaluation maybe seven years ago (personal experience here) would give high marks (for example, 4/4 on use of evidence, 4/4 on structure of essay, 3/3 on writing conventions -- usually there are multiple criteri
Lawsuits coming in .... (Score:2)
no time at all.
But this is Texas, where they abuse teachers to play their "Christian" white supremecist narrative, and teachers are leaving the state.