Flawed Algorithms Are Grading Millions of Students' Essays (vice.com) 90
Fooled by gibberish and highly susceptible to human bias, automated essay-scoring systems are being increasingly adopted, a Motherboard investigation has found. From a report: Every year, millions of students sit down for standardized tests that carry weighty consequences. National tests like the Graduate Record Examinations (GRE) serve as gatekeepers to higher education, while state assessments can determine everything from whether a student will graduate to federal funding for schools and teacher pay. Traditional paper-and-pencil tests have given way to computerized versions. And increasingly, the grading process -- even for written essays -- has also been turned over to algorithms. Natural language processing (NLP) artificial intelligence systems -- often called automated essay scoring engines -- are now either the primary or secondary grader on standardized tests in at least 21 states, according to a survey conducted by Motherboard. Three states didn't respond to the questions.
Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students' essays -- it varies between 5 to 20 percent -- will be randomly selected for a human grader to double check the machine's work. But research from psychometricians -- professionals who study testing -- and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups. And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary. Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.
Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students' essays -- it varies between 5 to 20 percent -- will be randomly selected for a human grader to double check the machine's work. But research from psychometricians -- professionals who study testing -- and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups. And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary. Essay-scoring engines don't actually analyze the quality of writing. They're trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.
Re: (Score:1)
Searching database* for observed correlations with "First Post"...
Most common match: Frost Pis (Score: -1)
*because of course they have a big log of people's writing
Re: (Score:1)
Or maybe (Score:4, Funny)
The algos are correctly grading by an established criteria. It doesn't know or care what demographic a writer is from. It's just judging the writing. Now being able to trick it with sophisticated nonsense should be a tipoff that it doesn't actually do it's job well, but that's separate different from the main point.
Re: (Score:2)
*but that's separate from the main point.
Re:Or maybe (Score:5, Insightful)
No, they are not grading by established criteria. Algorithms are not sophisticated enough to do that.
Instead they Do the following::
1) Grade a subset of papers by hand
2) Feed the algorithms those papers and grades
3) Let the algorithms evolve till it's grades match those of humans, allowing them to pick ANY criteria it wants to make the grading decision.
4) Do spot corrections.
The algorithms do not have to grade on criteria, the humans never have any idea why the alogrithms generate particular scores. It is perfectly acceptable for them to base their entire grade on the names of the students involved.
Your ignorance is remarkable in thinking that the algorithms use the same criteria that the humans use. That simply is not how ANY artificial intelligence works any more. Hasn't been for decades.
Re: (Score:2)
" Let the algorithms evolve till it's grades"
Look, humans can't even tell its from it's. Besides, you wanted "their".
Re: (Score:2)
And you didn't jump on "till". You must be a shitty algorithm.
Re: (Score:2)
Re: (Score:1)
So maybe the solution is an algorithm that can actually pass the IELTS?
Wait, those essays are graded by humans anyway.
Re:Or maybe (Score:5, Informative)
I know it's like beating a dead horse, but this is not artificial intelligence. It's algorithms making choices based from subsets of data. Clearly if you spell correctly and use big words the shitty algorithm rates you higher. Toss in a few keywords needed to relate to the matter at hand and "voila" you have artificial stupidity, that can be fooled. Who would have thunk?
Re: (Score:1)
This is definitely beating a dead horse, or mayhapse a dead philosopher, but... would you care to prove that the real-world manifestations of your intelligence are not just "algorithms making choices based from subsets of data" running on a biological computer?
Re: (Score:2)
But it's on a computer, and it's doing the same thing as at least some human graders.
The computer is bettter than human graders (Score:3)
That is not to say the computer is any good. But neither are the humans that mark these tests.
When doing reviews by experts, the computers do a better job on average than humans.
And at least the computer is consistent.
No, the computer does not just look for spelling and keywords. The sort of grammar checking in MS Word gives you an idea of what they look for. Then some common phrasing.
And no, you do not just feed essays into an artificial neural network and magic happens. There is more to it than that.
L
Re: (Score:1)
Re: (Score:2)
The algos use whatever criteria they are taught from data sets. Assuming they mimic the human scores well and use a large enough data set, then they should grade the same as a human.
I'm not saying this is a good idea, because it prob isn't. But trying to attribute any sort of demographic slant when that doesn't matter in the first place is stupid. Now, if you want a human to grade certain papers on an easier set of criteria because of background demographics of the writers, then you are asking the auto g
Re: (Score:3)
I'm not saying this is a good idea, because it prob isn't.
The alternative is to use human graders for everything. That will increase the cost, make results inconsistent, and in the end, a tired and overworked human likely won't grade any better.
Maybe we should just get rid of the essay portion of standardized tests. Many schools ignore the essay portion of the SAT and ACT, because the essays encourage formulaic writing and aren't good predictors of future academic success.
Re: (Score:2)
Assuming they mimic the human scores well and use a large enough data set, then they should grade the same as a human.
No, this is NOT how AI systems work. The AI cannot directly evaluate the quality of the paper, it just makes a prediction based on all available evidence.
In the US, blacks do not perform nearly as well as whites and Asians academically. So the AI system will soon use any data that correlates to race as a part of its scoring system. If the AI is told the name of the student, a black name will result in a lower score, if only slightly. This is not a bug! It results in the AI system scores more accurately
Re: (Score:2)
If the AI system becomes smarter, and is able to more directly identify the quality of the work
This makes it sound like an incremental improvement could achieve it, but actually they would have to start over using a whole different class of AI, like an Expert System. That would cost actual money, it would require programmers. Lots of programmers. And a whole lot of topic specialists.
Here they don't hire programmers, they just hire one specialist computer operator to set up the parameters. And they compare results to human test-graders, so they don't even need topic specialists. There is no way to add
Re: (Score:2)
The algos use whatever criteria they are taught from data sets.
They are "trained from" data sets, but they are not taught any criteria at all. That is not how modern "AI" works. At all.
If you can't comprehend the huge semantic difference between training and teaching then just shut up, please.
You probably just don't imagine how lazy and do-nothing an application of this sort of "AI" actually is.
It would never be expected to learn to sort new data in the same way that the humans had sorted the training data. It just doesn't work that way. It will only match the human re
Re: (Score:1)
The obsession with AI and political correctness is already demonstrating the problem with trainers. Looks wrong? Fix it. Looks right? Ignore it.
I've already seen a perfect example of this with PC. At
Re: (Score:3)
Re: (Score:2, Insightful)
it's called proper English. me and lots of others who weren't born here somehow had to learn it. should be no problem with US born people learning how to write in proper English
Re: (Score:2)
That's a guess, but the summary make is quite dubious. If a large vocabulary in-and-of-itself is going to raise the rating, then "proper English" is not what's going on.
For that matter, most of these programs can't really parse a complex sentence or track the proper use of pronouns, much less detect whether paragraphs have been properly separated. ("The paragraph is the emotional unit of writing."--G. Stein.)
Re: (Score:2)
Re:Or maybe (Score:5, Informative)
It's a shame that you didn't succeed.
Re: (Score:2)
It's called American. Learn it, you fah-reigner. You live in U.S., you gonna speak USian like us.
Re: (Score:2)
What English did you learn - it's a pronoun dependent on a co-ordinating conjuction. Obviously.
Re: (Score:2)
Go back to school.
Re: (Score:2)
Considering that he is correct and your pedanticism was false, you must be hoping he'll go study to be your social worker?
Re:Or maybe (Score:4, Informative)
1. The first word of a sentence should start with an upper case letter when the preceding sentence ends with either a period ('.'), an exclamation point ('!'), an interrogation point ('?') or '...'
2. In a compound subject or object, the pronoun Me must be on the right.
3. In the sentence starting with "should be no problem" This verbal group is missing a subject
Re:Or maybe (Score:4, Informative)
Also, "me" is the wrong pronoun in this case. It should be "I".
Re: (Score:2)
You only got one out of three of those right.
Re: (Score:2)
If you guys can't agree, how do you expect an algorithm to solve it? Or a group of humans to do it fairly?
Re: (Score:3)
Re: (Score:3)
I 100% agree. But that doesn't mean there wasn't an inherent bias in the essay grading that was used to "train" the algorithm. Grading art is subjective. A sentence may be technically correct, but if the grader doesn't like the tone, style, or word choice they are likely to grade it lower. If you get enough of these lower grades together and use it to train an algorithm, the end result is a computer that doesn't like a particular writing style.
A person from the inner city of Chicago is going to have a di
Re: (Score:2)
proper English. me and lots of others who weren't born here somehow had to learn it
I see what you did there :-)
Re: (Score:2)
The problem is that the "established criteria" had a bias, so the automated testing picked up and amplified that bias. From the article: "The problem is that bias is another kind of pattern, and so these machine learning systems are also going to pick it up"
According to the article, the problem is that the algorithm criteria don't match the hoped for criteria (the purported "established criteria"), which is why the human-graded scores differ. In particular, the algorithm and the hoped-for criteria differ in that the algorithm overemphasizes (i.e., is biased toward) "essay length and sophisticated word choice" and de-emphasizes (i.e., is biased against) "grammar, style, and organization".
Re: (Score:3)
The problem is that the "established criteria" had a bias,
No, there is nothing in the article to say that. The problem exists even if there is no bias in the data fed to the AI.
The problem is simple: blacks do not write as well as others, on average. The AI does not consciously identify anyone by race, but it does learn that people using a certain way of writing are likely to be not as good.
This learning enables the AI to more accurately grade the papers, but it can mean that the best black kids will have their papers marked down. The AI has no hate, just col
Re: (Score:2)
Sure the algorithm doesn't "care" about someone's ethnicity or religion or socioeconomic background, because it's not person; however that does not mean it necessarily gives unbiased results. Excessive sensitivity to irrelevant differences in word choice is a long-standing problem in psychometrics.
What you want to measure when grading an essay is how well the student used the information given to him, addressed the question posed, organized and expressed his thoughts. Since you can't tell any of this by w
Re: (Score:2)
Re: (Score:2)
Now being able to trick it with sophisticated nonsense should be a tipoff that it doesn't actually do it's job well
The exams are designed to test qualification for higher education, where writing sophisticated nonsense is an important skill.
Re: (Score:2)
It's worth reading the actual article to understand the problem better.
Well (Score:1)
Re: (Score:2)
No. There are *restricted* domains where the standardized test if appropriate. E.g. it's quite appropriate for arithmetic.
But there sure is a big area outside of those domains.
Are /. editors even tech nerds? (Score:2, Insightful)
How many nonsense articles about "racists" computers are we going to have, along with summaries to stories that can simply be described in a few words. IE Centurylink 37 hour outage that affected 911 service was cause by a malformed broadcast storm.
Re: (Score:2)
Forever, someone always has a chip on their shoulder and will blame it one something. Race has always been a prime target.
I didn't get picked, it's because I'm [select color here]
Everyone thinks I pick on [select color here] because I am [select color here].
If it wasn't race it would be geographical location, or whether you like the xbox or the playstation.
People are full of shit (like the xbox) and don't like to acce
Re: (Score:1)
Then its "bias against certain demographic groups" when people what cant do the needed work get detected?
Works great 90% of the time (Score:3)
But 90% is not good enough.
The algorithms learn by example, not by learning the actual rules of grammar. Which means that things that conform to the usual, i.e. derivative work, is rewarded and everything non-standard is punished.
If you are in the top 1%, this system punishes creativity. Your work stands out from the standard work which is ALWAYS punished by the modern algorithm methodology..
it also fails to work well for about 9% of the population that are above average intelligence but have substandard teachers. They end up teaching themselves, rather than being taught by their teachers. As such, their work stands out from the standard work and again, this is ALWAYS punished by the current system.
Works better than human markers (Score:2)
What makes you think the computer does not know rules of grammar? Every used MS Word?
And what makes you think that tired, poorly paid, unsupervised human markers do a better job than the computer?
And where did you get 9% from? About 51% of the population are above average intelligence, and about 60% have substandard teachers, which gives about 30%.
Re: (Score:2)
What makes you think the computer does not know rules of grammar? Every used MS Word?
The rules of grammar include knowing when and how to break or disregard them. Something MS Word is fucking terrible at.
A sad state of education (Score:4, Insightful)
Some of the most influential writing in human history has been full of grammar and spelling errors, yet such writing managed to record and convey significant ideas.
Removing an often biased grammar Nazi from the grading process is probably a step in the right direction for many students, but having no human feedback is the epitome of dehumanizing, as are standardized tests - except for the privileged few for whom the tests are designed.
This is why students should feel justified in using an 'essay app' to generate 'perfect', unique essays.
Re: (Score:2)
The result is that then the students produce stuff that is barely legible and claim "it's the idea that is important, not the form!". Well, if someone wants to convey an idea, then he should not put an extra burden of deciphering the text on me. If he does not bother, then neither will his audience.
Re: (Score:2)
I understand your sentiment, and I mostly agree. Education is important, and I'm not advocating that we give lazy people a pass. However, the burden of 'deciphering a text', especially if the content is novel or complex, or if it comes from someone with a very different world view, will always fall on you as meaning and implications are rarely clear or direct. Deciphering babel is not what I'm advocating.
Racism is often thinly veiled by standardized tests and a 'pass-fail' mentality. Taken to it's ultim
Hack It! (Score:2)
some needs to try sql injection / devide by zero (Score:2)
some needs to try sql injection / divide by zero stuff in the tests.
When bias.. isn't (Score:2)
Suppose I make a lot of spelling mistakes because English isn't my first language.
Suppose I don't spell well because my education was sub par and I was more concerned about being shot in school then learning proper spelling and grammar.
There are MANY correlating factors between poverty and certain minority groups that correlate A LOT more heavily with poverty or nation of origin then they do with skin color ( Colour ) if you learned to spell in the UK for from a British English teacher.
Objectively it is sp
Re: When bias.. isn't (Score:1)
Assuming your claims are correct, the repressed individual can alleviate their poor education by going back to school; there are many government supported routes to achieve this goal.
Re: (Score:1)
Is that person entering further "English" education where the level of "English" is going to have to be better and better every year?
Thats a fault with the education system?
Re "learning proper spelling and grammar" that's going to be needed soon for the "English" questions, education, essays.
Re "poverty or nation" That nation can set up their own education system, tests and can pass/fail any % of the population.
Should US education results take i
Egregious old news (Score:2)
Despite being old news that these algorithms somehow like the word "egregious", and perhaps well known enough, it somehow remains a favorite despite all the warning signs.
Perhaps it's yet another example to egregious worship over anything theoretical, without practical testing first. Even the simplest trick, using an algorithm designed to fool the algorithm, would have picked up on the egregious mistake, and demonstrate consistent gibberish that gets a perfect grade.
Then again, it would probably penalize fo
Another good reason not to let the profit motive (Score:3)
"bias against certain demographic groups" (Score:2, Insightful)
Ah it's the stupid old Google Algorithmic Unfairness thing yet again. Unless the AI system gets input about the demographic identity group(s) of the essay authors, this claim physically cannot be true. It cannot be biased against something it does not know.
Re: (Score:2)
Re: (Score:2)
Then the AI system is "biased" against identifiable features which a member of ANY demographic may choose to produce.
Re: (Score:2)
Sure, if they wish. But, if the algorithm is biased against a set of identifiable features that only a certain demographic regularly chooses to produce, and those features in and of themselves are neutral as far as their effect on the essay but the presence of those features causes lower grades to be assigned, then the AI is in practice biased against that demographic group.
Or, in other words
Re: (Score:2)
> identifiable features only a certain demographic regularly chooses to produce
Whoa, what do you have in mind? Oxford commas? Jive?
If you can't identify a high-correlation signal by hand, then it does not matter.
Re: (Score:1)
Was that due to certain demographic in the words? That the essay had been detected before?
The words, slang, jargon did not have anything to do with the set essay question on the day?
Re: (Score:2)
Re: (Score:2)
You can only make this argument AHEAD OF TIME. You don't get to pick your data, train the AI, peek at the output of the trained AI, and then say, golly jeepers, I don't like the political implication of the results, so let's do-over. At that point, YOU are introducing an overt, explicit bias.
I mean you can, but that's politics, not science. If you want to do science, you pick your protocol ahead of time, and let it lead you where it may.
Re: (Score:1)
Re: (Score:2)
After the fact, it's not a "bias", it's a "result".
Re: (Score:1)
Then the smart and correct "city" students get to pass and go further in education.
Did "goo" and "jar" have nothing to do with the topic? Not a part of the terms used all year? Then less students from the "state" will pass.
They did not do the work needed and did not study/use terms like "foo" and "bar".
The "city" students on average could all use the words as expected and could show their working, thinking, use of the terms.
Students Should Submit 2 Essays (Score:2)
In a transitional period...
The first essay that students should submit should be carefully written by hand for a human teacher to review and score.
And the second essay that students should submit would be automatically generated by a collective/shared "A.I." system for the automated essay scoring systems to review and score.
The marks assign to students would be based on the first essay. Meanwhile the collective essay generating A.I. would keep improving until it gets a perfect score from the automated scor
Re: (Score:2)
Remember that Amazon makes billions and does not pay corporate tax.
So we should make Jeff Bezos grade all the papers? That seems fair.
Oh shut up (Score:3)
"these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups"
Really? What I think you mean to say is that it shows bias against certain methods and forms of expression that aren't generally accepted as standard English, regardless of the ethnicity of the person using them. That is not a "demographic group", is it?
Because the idea that an "AI" (let's also remember that these are NO SUCH THING) cannot discern the skin color of the person writing an essay without what, magic?
Re: (Score:2)
It could show bias against forms of expression that are technically correct English but which are more commonly used in some subcultures than are others. Depending on how the training is done, if students from the subculture on average score poorly on writing tests, those poor scores could be incorrectly correlated with a particular style of writing.
Its the same problem as an AI hiring system that through statistics has found that programmers at a company are mostly white males, so it ranks white male appl
Re: (Score:2)
"Its the same problem as an AI hiring system that through statistics has found that programmers at a company are mostly white males, so it ranks white male applicants higher."
So the error here is that the AI isn't "woke" in understanding somehow that vaginas and melanin are somehow intrinsically valuable for coding?
How very soulless of it to simply do things like look at actual performance, etc. Clearly, it needs to be "fixed".
Personally, I'd LOVE to see actual performance metrics from departments staffed
Re: (Score:2)
Nothing to do with "woke". Its about AI, not really being intelligent but just looking for patterns - eg correlations. Then the basic problem with correlation not implying causation. In 1950 there was a strong correlation between being an astrophysicist and being male. Today that is much less true. But an AI trained on 1950s data might easily conclude that women were less likely to be astrophysics, and therefore rank being female as a negative for hires.
It depends in detail on exactly how the AI is tr
Re: (Score:1)
Who total lack the skills needed for more education....
People are full of crap too (Score:2)
I've said for decades that you can create a technical presentation that misses a key component preventing people from duplicating your results and nobody will notice. They often include phrases like "We implemented the Navier-Stokes equations..." or "We employ a 20-state Extended Kalman Filter...". This is the fancy-pants textbook version of "The solution is left as an exercise to the reader."
Re: (Score:2)
I've said for decades that you can create a technical presentation that misses a key component preventing people from duplicating your results and nobody will notice.
That's because the venue is the incorrect one for imparting sufficient detail to allow others to duplicate what you are doing. A technical presentation is intended to report your findings or results, not every step necessary to duplicate them. Nobody will notice that they can't duplicate what you've just presented because they didn't come to the talk to find out how to duplicate what you've done.
If you want to duplicate someone's results, you talk to him after the presentation and work it out.
Re: (Score:2)
Even the paper they are presenting is missing key steps. I'm sure that's by design because they're hoping to get paid a lot of money to tell you the recipe for the secret sauce.
Works as designed (Score:2)
And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary.
That's not a flaw in the algorithm.
The people who write such essays most likely have promising futures in management consulting, and they definitely should be admitted so that they can work towards their MBAs.
Find it on git hub (Score:2)
Bias against certain demographic groups (Score:1)
People who cant study?
People who lack the skills to study?
People who got non academic considerations until they actually had to show their own work? They could not write on the topic and got found out?
Work under a persons name that was related to other peoples work? Do your own work?
Buying work and finding out a lot of other people had used the same work. Getting shared work got detected?
Using words and terms that had nothing to do with the topic?
Using slang, words, terms
I get that grading essays is time intensive (Score:1)