The Curious Case of Increasing Misspelling Rates On Wikipedia 285
An anonymous reader writes "The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources. As the saying goes, the more eyes the better. One particular student who was curious about this conducted rudimentary text mining on a sampling of the Wikipedia corpus to discover how misspelling rates on Wikipedia change through time. The results appear to indicate an increasing rate of misspellings through time. The author proposes that this consistent increase is the result of Wikipedia contributors using more complex language, which the test is unable to cope with. How do the results of this test compare to your own observations on the detail accuracy of massively crowd-sourced applications?"
Spellink chekers. Duh! (Score:5, Insightful)
Every web browser as auto spell-check capabilities these days. Most of them correct as you type.
So why should there be any misspellings on something that is managed strictly from a web interface?
Is it part of the arrogance of those electing themselves to write and editing articles on wiki that they refuse to use a spell checker, or
is it that the words are simply unknown to the normal spell-check dictionaries?
I find occasional misspellings in mainstream news articles as well (and I am by no means a natural born speller).
But most maddening to me is the "they're their there" errors, and similar wrong word usage.
Spell checkers offer little help in catching these, but a 6th grade education usually suffices.
Maybe the same people who wont waist there time checking they're spelling also cant be bothered to use the write word. ;-)
Re: (Score:3, Funny)
What's really curious (Score:3)
Re:What's really curious (Score:5, Insightful)
It might also be that there are specialist words being used on Wikipedia that aren't in the dictionary.. unless this test is explicitly looking for common misspellings..
Re:Spellink chekers. Duh! (Score:5, Funny)
Re: (Score:2)
They're They're your probably just making a mountain out of a mole hill, but it's a mute point.
But really. I'm going to side with complex language there are numerous technical words that aren't in any of my dictionaries. Especially when you get into latin based names and medical terms.
Re: (Score:2)
While the increased usage of words that are not in the browser dictionaries is part of it, it is not just technical words. "Travois" (the ancient sleds pulled by horses or dogs) is a valid English word that my Ubuntu Firefox spell checker flagged this morning: it wanted me to use "travis" instead. We are running into an increasing number of such inappropriate substitutions by spell checkers as the range of breadth of Wikipedia expands, especially as it covers sports, hobbies, and other informal group activi
Re: (Score:3, Insightful)
The rule on places like Slashdot and other Internet forums is that so long as the text can be understood, variations in spelling and grammar are acceptable, should not be corrected, and usually should not even be mentioned.
Are you new here? :)
Re: (Score:2)
Whoosh!
Re: (Score:2)
LMFTFY:
Whushe!
Re: (Score:2)
Like the one around a castle, right?
Re: (Score:2)
"Moot" and "moat" aren't even homophones, damnit!
Re:Spellink chekers. Duh! (Score:5, Funny)
"Moot" and "moat" aren't even homophones, damnit!
"Not that there's anything wrong with that."
Re: (Score:2)
Re: (Score:3)
Every web browser as auto spell-check capabilities these days. Most of them correct as you type. So why should there be any misspellings on something that is managed strictly from a web interface?
Is it part of the arrogance of those electing themselves to write and editing articles on wiki that they refuse to use a spell checker, or is it that the words are simply unknown to the normal spell-check dictionaries?
I find occasional misspellings in mainstream news articles as well (and I am by no means a natural born speller).
But most maddening to me is the "they're their there" errors, and similar wrong word usage. Spell checkers offer little help in catching these, but a 6th grade education usually suffices.
Maybe the same people who wont waist there time checking they're spelling also cant be bothered to use the write word. ;-)
HAI! U R A Cleaver 1! BAI!
Muphry's Law (Score:5, Informative)
icebike is a victim of Muphry's Law [wikipedia.org].
Re: (Score:2)
Most of that had to be deliberate. I think that every single homophone in that entire last sentence was wrong. The "has" was an amusing typo. The "edit" versus "editing" thing was actually not wrong, though it was awkward. Reread it as "electing... or editing" instead.
Re: (Score:2)
Typo and tense change were unintentional. I'll cop to the Muphry's Law violation. It bites me all the time. But hey, this is Slashdot.
The last sentence, on the other hand, totally intentional, and watching all the whoosh posts has been fun.
Re: (Score:2)
The problem is that spell-checkers aren't grammar-checkers. There hasn't been a decent grammar checker since the days of Grammatik. The ones out there now are, frankly, pathetic.
Re: (Score:2)
Re:Spellink chekers. Duh! (Score:5, Interesting)
I have seen articles on Wikipedia that stick around for any reasonable length of time (about six months to a year being typical) usually attract grammar nazis (or people who are annoyed by bad grammar in general) that do a copy edit and try to fix the article to make it read better. Longer articles tend to attract more people than stubs, particularly if they are well linked to other articles. The subject matter doesn't seem to make a difference, and there are a few bots on Wikipedia which try to scan articles for spelling errors and other minor issues.
The issue of British vs. American spellings has been a long resolved issue, and for the most part consistency is more the rule than anything else. Sometimes I've seen protracted edit wars over grammar usage between several editors, but even that tends to be rather harmless.
My point here is that the proofreading does happen, it just happens on a slower time scale and is something that usually only shows up for more mature articles, mature as in more well developed articles that seem to be trying to say something. Articles that are in a constant flux of revision will be less likely to see this kind of activity, or more accurately will tend to see such efforts wasted as the article content changes. Still, if you can get an article to "B quality" status or better, the grammar and quality of the article in terms of spelling and other aspects will be reviewed by at least somebody over time.
Re: (Score:2)
Re:Spellink chekers. Duh! (Score:4, Funny)
You missed:
Maybe the same people who wont waist there time checking they're spelling also cant be bothered to use the write word. ;-)
older IE's do not have spell check. (Score:2)
older IE's do not have spell check.
Re: (Score:2)
Maybe regional differences are being reported as spelling errors. Desktop systems at my work use the French dictionary by default. Not much use to me.
Re: (Score:2)
Every web browser as auto spell-check capabilities these days. So why should there be any misspellings on something that is managed strictly from a web interface?
Talent?
Re: (Score:3)
Is it part of the arrogance of those electing themselves to write and editing articles on wiki that they refuse to use a spell checker, or is it that the words are simply unknown to the normal spell-check dictionaries?
Maybe it's that aggressive as-you-type spell checkers seem to introduce more errors than they catch. Seriously. I've never seen one that doesn't try to replace rare but valid words with more common words that look vaguely similar (often just similar enough to be missed in proofreading) but have completely unrelated meanings. In general, as-you-type "correction" is an insult to anyone writing above a third-grade level.
Re: (Score:2)
In general, as-you-type "correction" is an insult to anyone writing above a third-grade level.
So.. I guess it's hear to stay then.
here to stay, dammit.
Re: (Score:3)
No, it's because spellcheckers are often WRONG.
They don't like foreign words, they don't like unusual words, they don't like domain-specific words; they don't like any words they haven't been programmed for.
Lately, when I write, I have to fight the spell-correction to make things properly correct more than it corrects me.
Re: (Score:2)
Hmm. Firefox's dictionary needs some work. It has an addiction to adding hyphens to common non-hyphenated words, and has some serious difficulties with pluralized nouns.
Re: (Score:2)
The performance difference is non-zero.
Re: (Score:2)
Not that anyone else does this, but one of the
first changes I make in my browser settings is deactivating automatic spell checking. Call it a holdover from the days where leaving it on meant the top rate of entering text would be about one or two characters per second.
The performance difference is non-zero.
Well, true, it is non-zero, but nobody else runs an 8088 processor anymore either.
If this still bothers you, remember that there will be great deals in the After Christmas sales from most computer vendors.
You might be able to step up to a 486 or something.
Re: (Score:3)
If Joe SixPack (or for UK, "Joe Pint" (or equivalent slang)) doesn't know the difference, well, so what. When I see professional authors not knowing the difference, I am... disappointed, I guess.
And crap, I'm an engineer, I'm not even supposed to know how to spell.
Re: (Score:2)
Re:Spellink chekers. Duh! (Score:4, Insightful)
Is it part of the arrogance of those electing themselves to write and editing articles on wiki that they refuse to use a spell checker, or is it that the words are simply unknown to the normal spell-check dictionaries?
You might know the answer to this if you had read the linked article instead of immediately jumping in to editorialize (and no, I'm not new here).
While there are a number of serious methodological concerns I've discussed in another post [slashdot.org], the author's Table 4 ought to raise a screaming red flag. The algorithm the author used flagged about 5% of articles as having more than 25% of their words misspelled--and the author didn't discuss any sort of manual follow-up on those articles to determine where the problem lay. I'm sorry, but misspelling one word in four just isn't a plausible result.
I suspect that the parser is failing to properly handle tables of data, scientific terminology, some unusual formatting and template markup, and foreign words. All of these categories will have been expanded greatly since Wikipedia's early days, and their presence is a sign that the encyclopedia is increasing in quality and coverage, not being degraded.
Re:Spellink chekers. Duh! (Score:5, Insightful)
No, it's our language when it comes to international communication. We don't own the varieties spoken in Australia, Guyana, India and whatever other regions use English, but if you want to be understood you really ought to be sticking fairly close to either British English or American English.
Re: (Score:3)
fwiw, written Australian English really isn't any different to British English.
Re: (Score:2)
I wish the Canadians would make their mind up. Either American or British English - but not a screwy mix of the two. And as for date formats, don't get me started!
Re:Spellink chekers. Duh! (Score:5, Funny)
I wish the Canadians would make their mind up. Either American or British English
...or French.
Re: (Score:3)
Tabernac, eh? Get me a double double and tell me all about it. I'm sure we can get the Canadians to apologize for that.
Re: (Score:2)
As far as I can tell, we have made our mind up, we're taught british english. Just what the hell are you talking about, eh?
ISO 8601 (Score:2)
Is it so hard to remember smallest to biggest?
Yes, because it's 2011-12-23 where I am, not 32-21-1102. Days are smaller than tens of days, right? I prefer ISO 8601 because biggest to smallest is "lexicographically monotonic" on any date in the common era, meaning that sorting a set of strings representing any dates since Jesus M. Christ was potty trained gives the same result whether one treats them as dates or as generic strings.
Re:Spellink chekers. Duh! (Score:5, Informative)
But written Australian English is different from North American English.
In N.A. things are similar TO each other or they are different FROM each other.
We would no more say Different TO than we would say Similar FROM. Just seems wrong to our ears.
Re: (Score:2)
Its not that big of a deal.
Re: (Score:2)
Re: (Score:3)
The most interesting case I've seen of subtle differences in prepositions within the English language was when describing what one does when embarking or disembarking. In America, it's not uncommon to say either, "He got off the bus," or, "He got off of the bus," but the latter sounds as odd to someone from the UK as, "He got on of the bus," would sound to someone in America (or so I've been told). This fact came up in one of my graduate research seminars when we were studying a paper entitled Hey, You, Get [ucsd.edu]
Re: (Score:3, Informative)
Off of
On to
"On of" makes no sense, which is why it sounds wrong : because it is wrong.
"On to" (or onto) sounds fine. Because it is perfectly correct.
Your confusion is caused by your assumption that the same preposition structure would be used in dissimilar situations.
I have no clue what the technical name is for the OF following OFF. But what ever it is, it must match. Omitting it seems fine in either case, but if used it must be correct.
Re:Spellink chekers. Duh! (Score:4, Insightful)
The bigger problem is the differences in short date formats. dd/mm/yyyy vs mm/dd/yyyy can easily generate significant errors in calculation. Anyone who's integrated more than one Microsoft product together in both countries will have encountered the challenge.
Personally, I think our (AU) reverse polish date notation is ridiculous, but at least its not inside out notation (US).
Can we just settle on yyyy/mm/dd and be done with it? Please?
Re: (Score:3)
Re: (Score:2)
Re: (Score:3)
It's a political statement. Omitting the u in labour is intended to show that they are just like the US versons of such groups. They represent labor (working hard), but they don't include "u".
Re:Spellink chekers. Duh! (Score:4, Funny)
Re: (Score:2)
Re: (Score:3)
Memo to AC:
English isn't *yours* either, it's now *everybody's*, except when they misspell two words in one sentence in an article about misspellings.
Re: (Score:3)
smilie = kidding.
Whoosh = you.
Re: (Score:2)
Whoosh, I mean.
Re: (Score:3)
Grammar checkers for formal languages like computer programming languages is trivial compared to natural language processing issues. Another problem is that often the grammar checker straight jackets you into forming sentences in a fashion that pulls feeling out of whatever it is that you are expressing.
Yes, there might be a role for an automated grammar checker, but like spell checkers they have a narrow application of usage. They are also not nearly as easy as you are suggesting in terms of how to write
Many of the smart people have been driven away? (Score:5, Insightful)
Whether it's open source software or online collaborative projects, the smart people always get driven away over the long term. Smarter people are usually more interested in creating high-quality content, whereas stupider people end up putting out crap purely for political reasons. Eventually these stupider people start trying to modify the work of the smarter people, but do a poor job at it. When they're called out on their shitty work by the smart people, the fools make a huge stink. This soon devolves into a political mess where the smarter contributor is severely inhibited from contributing by the constant moaning and bitching of the idiots. Not wanting to waste time with such shenanigans, the smarter person leaves for some other endeavor. After a while, many of the smarter people are driven away, and the end result is that the stupider people make up the bulk of the project's contributions.
We've seen this happen with many open source software projects, and I don't think that other kinds of online collaborative projects are any different.
Re: (Score:3)
Re:Many of the smart people have been driven away? (Score:5, Interesting)
I can't say I've seen that on all articles on Wikipedia, but certainly I have seen it on some. I've seen articles dumbed down to suit the majority of the readers, rather than split and refined to allow the majority a summary and those wanting more information access to that. This certainly discourages those who are subject matter experts - what's the point in being an expert in something if all that's wanted is pub quiz grade?
However, I emphasize that this is NOT what I've seen for the majority of articles. Some articles have been abandoned (occasionally in mid-edit, from the looks of it), some are constantly being updated with updates in conflict with each other, yet others are updated and are of extraordinarily high quality. It runs the full gamut.
I would far prefer a layered approach, so that you could get access to whatever level of detail you wanted, but the contributors just aren't there to get that. It's a pity, and the net result is uneven quality, but Wikipedia is a case where it's better to have an imperfect something than a perfect nothing.
The bad drives out the good (Score:5, Insightful)
I can offer my own opinion of this phenomenon: the bad is driving out the good. Fewer competent writers are bothering to edit Wikipedia articles nowadays. Not only do contributions get reverted / deleted by editors who think they "own" the article, but good writers simply get tired of fixing the semi-literate ramblings of people who cannot write a coherent sentence.
It's the old axiom that incompetent people cannot recognize their own incompetence, and so do not realize that their "contributions" are not improving the article, but instead are making it worse. Eventually the good contributors get tired of sweeping back the ocean with a broom, and just walk away from Wikipedia.
Re:The bad drives out the good (Score:5, Insightful)
Totally agree! I spend the best part of *three years* working on a relatively obscure corner of WP's biology department involving some 500 articles and over 20,000 edits before finally throwing in the towel. I learned a lot during my time there, but eventually the idea of putting more effort into it just didn't make any more sense. One of their main problems is that the only thing preventing good articles from deteriorating is constant policing by knowledgeable editors -- and preferably by the people who are responsible for all the important contributions. I like to think that my contributions to WP have not been a complete waste, but if enough time goes by before anyone fills my shoes, I fear they will be. After all, what good is an article that's now only 99% accurate? 98%, 97%, 96%...
Re: (Score:3)
Someone pro-mutilation comes in and edits the article how he wants it, to feel better about his situation, and reverts any edit made (edits that include citations) by opposing views. When someone decides to take it through the proper channels to expose the "owner" of the a
Re: (Score:3)
I don't really know anything about that article, and can believe it happens, but reading your comment I would guess that you're part of the partisan-editing problem. ;-)
Re: (Score:2)
someone pro-mutilation comes in and edits the article how he wants it, to feel better about his situation,
Not hard to guess which side you edit war for.
Worth Posting. (Score:5, Insightful)
Um... (Score:5, Funny)
The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources.
[citation needed]
Re: (Score:2)
right here [slashdot.org]
It's worse in the grammar department (Score:2)
Here is a typical example:
Person A and B are on the ground floor of some building.
Person A would like person B to have some parcel delivered to the 7th floor of the building.
Here's how person A delivers the request:
"Buddy, please bring this parcel up to the seventh floor, thanks".
I posit that this grammar is wrong. He should say:
"Buddy, please take this parcel to the seventh floor, thanks", because they are in the same area and buddy B, by doing the needful, will be leaving that place.
Worse still, you even
Re: (Score:2)
"please bring this parcel up to the seventh floor, thanks".
Except that in some forms of English, this is perfectly correct (Hiberno-English, for example).
"datum" and "medium"
That battle was lost many decades ago, if it even got fought.
Re: (Score:2)
People who say "the data is" are treating "data" as a synonym for "information". They are usually referring to more than one piece of information, so "datum" would not be correct.
Re: (Score:3)
I blame Star Trek for that one, though.
Re: (Score:2)
Since english itself is a dynamic language, you can argue both are correct. Remember, that the grammatical rules between what's spoken and what's written are usually 110-150 year apart.
Re: (Score:2)
At least he said please.
Eye don't no (Score:2, Interesting)
Eye don't no how ewe can automate proof reading. You still knead a human in the loupe.
Stop whining and fix-em (Score:2)
Unless the article is locked, just fix the spelling errors yourself instead of whining about them getting worse.
Re: (Score:2)
The ever growing number of articles... (Score:4, Insightful)
Literacy (Score:2)
I would put some of it on age (Score:2)
I think some the issue here is that a new generation is showing up with poor literacy skills. The primary schools are under pressure to meet their government mandated competency requirements, budget cuts, and various other issues, and have cut back on some of the basic skills that were once taught.
I work at a tutoring center / assistance center at a college and it is depressing what students are coming out of high school in their basic literacy skills. Writing skills are non-existing, were some of them do
Lol (Score:5, Funny)
It's sad. Through all this web content, I am slowly unlearning how to spell or use proper grammar.
English teachers / professors (with a few exceptions) used to be my arch-enemies (as a math / science person) and wished them all a pleasant, if sudden, death for their batshit-insane insistence on making mountains out of molehills (i before e, except after c; can't end a sentence with a preposition; this {subject}) with regards to the language, and yet lately I finding myself wishing there were more of them.
It's not fair: I've nursed some of those grudges for years!
Higher Availability (Score:2)
Britannica is one factor (Score:2)
This is an artifact to his experiment (Score:2)
The increase in the percentage of spelling errors is an artifact of his experimental procedure. He randomly takes a Wikipedia article instead of analyzing the most popular ones. As Wikipedia has become larger, it has attracted more fringe topics, probably from authors in different countries in the world where English is not their first language. Wikipedia now probably has more articles that aren’t viewed and revised as much. Thus, randomly sampling has now higher chances of selecting such articles and
Re: (Score:2)
It's reasonable to hit all the pages at random, but the results should be weighted by the popularity of the page. That would be a better measure of how good crowd-sourcing is at solving these types of problems. Including the low-popularity pages in the results is important to this because it covers another issue with crowd-sourcing--it takes at least three to make a crowd. If wikipedia's breadth gets too large, the crowd sourcing method breaks down since there's nobody looking at most pages. It brings t
Grammar editors like me got scared off Wikipedia (Score:4, Informative)
Re: (Score:2)
Re: (Score:3)
No one ever, ever, cites a diff when they are bitching about Wikipedia on Slashdot.
More != Better (Score:2)
That's a myth. If those eyes aren't attached to competent people, having more of them will do no good.
Spell however you want (Score:2)
Lazy writers and poor writing skills (Score:3, Informative)
This may sound like a get off my lawn type post, but from what I've seen it seems that the writing ability of younger people has severely declined. And it's not even that big a difference in age that I'm talking about here, I'm talking about people less than 10 years younger than me. I "abuse" the language a fair amount myself, but I'm talking about seeing people thinking column has a b in it, and despair doesn't have an e. There are fluctuations in the language that I'm used to; such as the color vs. colour thing; but basic spelling problems that would not be correct in any dialect seems to be pretty common. And of course we have the their vs. there problem.
Entropy (Score:2)
Wikipedia is not immune to entropy.
Badly flawed methodology (Score:3)
Part of the problem is the article selection methodology. By pulling random articles, the study author is going to be getting mostly articles that have received little attention, and mostly short articles. (Table 2 and Graph 2 show this very clearly--of the 2400 articles examined, only 14 existed in 2001. Half of them didn't exist until 2007. A quarter were created between 2009 and the present.) It's possible that what has been demonstrated is simply that relatively new articles on relatively unimportant topics tend to be less-well maintained.
The major issue is the corpus used for the study. While a half-million-word dictionary sounds impressive, it's still going to fall down in a couple of key areas. For one, foreign-language terms are likely to be nearly completely unrepresented. For another, a lot of proper nouns are going to be missing. If I write an article about Japanese manga or a Norwegian village, I'm going to be including all kinds of things that an English-language dictionary just isn't going to contain. (Worse, I'll get two misspellings for each Japanese term, since I'll have it in the article with both the original Japanese word plus the romanized transliteration). Another problem area will almost certainly be articles on highly technical topics (molecular biology is full of new and unusual abbreviations).
While certain classes of 'obvious' non-words aren't counted, many will be missed. For example, the article preprocessor filters out percentages, but will pass through numbers followed by the degree symbol (which will show up in scientific and geographic articles).
What is noticeably lacking from the report is any mention of manual checking performed by the author to evaluate the accuracy of the results generated by the spell checker. Table 4 reports that about five percent of articles contain more than 25% misspelled words(!); honestly, even people on Twitter don't (generally) show that level of illiteracy. Are there certain types of articles which are responsible for these grossly inflated counts?
In summary -- sloppy methods give useless results. No news.
Re: (Score:2)
Or moar ppl frm teh txting gener8on.
Re: (Score:2)
TFA measured "Average misspellings as percentage of sampled content", up from 0 in 2001 to over 6% now.
Re: (Score:2)
0% in 2001? As in, less than a percent by enough that it didn't make sense to round up to 1%? that's a huge increase. I wonder if it's a real increase, or a result of sabotage by either independent malefactors, or by Britannica using an automated approach...
Re: (Score:2)
Actually, it's 0.00. But many of the 2400 articles they sampled were less than 10 years old. Noticeably, the rate jumps up to 2.58% in 2002, and then continues to climb a pretty steady by 0.365%/year after that, with a slightly higher uptick between 2006-07.
I'm not entirely sure what to take away from that, but it does seem that the more articles WP adds, the less people care about writing them properly.
Re: (Score:3)
which don't even always follow normal English phonetic conventions
Wait, English has normal phonetic conventions?
Re: (Score:2)
Very well put, If I may say so! (forgot to log in)
Re: (Score:2)
Actually I don't really believe that this will happen, but it is a scary thought.