Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Wikipedia

The Editors Protecting Wikipedia from AI Hoaxes (404media.co) 45

A group of Wikipedia editors have formed WikiProject AI Cleanup, "a collaboration to combat the increasing problem of unsourced, poorly-written AI-generated content on Wikipedia." From a report: The group's goal is to protect one of the world's largest repositories of information from the same kind of misleading AI-generated information that has plagued Google search results, books sold on Amazon, and academic journals. "A few of us had noticed the prevalence of unnatural writing that showed clear signs of being AI-generated, and we managed to replicate similar 'styles' using ChatGPT," Ilyas Lebleu, a founding member of WikiProject AI Cleanup, told me in an email. "Discovering some common AI catchphrases allowed us to quickly spot some of the most egregious examples of generated articles, which we quickly wanted to formalize into an organized project to compile our findings and techniques."

In many cases, WikiProject AI Cleanup finds AI-generated content on Wikipedia with the same methods others have used to find AI-generated content in scientific journals and Google Books, namely by searching for phrases commonly used by ChatGPT. One egregious example is this Wikipedia article about the Chester Mental Health Center, which in November of 2023 included the phrase "As of my last knowledge update in January 2022," referring to the last time the large language model was updated.

The Editors Protecting Wikipedia from AI Hoaxes

Comments Filter:
  • One could use their change-sets to fuel an AI that spots AIs.

    Or for the more nefariously minded, train an LLM that rewrites LLM trash to sound less like LLM trash.

    • by allo ( 1728082 )

      "Or for the more nefariously minded, train an LLM that rewrites LLM trash to sound less like LLM trash."

      That is the same idea like "Why don't we train an AI to detect good AI images and throw away all the bad ones".
      If you could do this, the generator AIs would already have it integrated (in fact you could integrate such a tool into the training so you do not even need to generate the bad images afterward).

      A LLM that can rewrite LLM trash to sound better can be used as alternative to the LLMs that write tras

      • That is the same idea like "Why don't we train an AI to detect good AI images and throw away all the bad ones".

        That's called a generative adversarial network and it's the foundation for a lot of the modern generative tech.

        But also I do mean something more specific. "Less detectable to people trying to do detection" at the cost of "Less like real writing" is a thing a lot of people(spammers) want from LLMs.

        • by allo ( 1728082 )

          > That's called a generative adversarial network
          That's exactly the point. If you have the classifier, you use it to train the generator, so you don't need to classifier afterward.

          > But also I do mean something more specific. "Less detectable to people trying to do detection" at the cost of "Less like real writing" is a thing a lot of people(spammers) want from LLMs.
          Less detectable is more or less the same goal as having text that reads fluently. The things you can use to detect LLM content are exactly

          • It's not because, if you look, the project in TFA uses metrics and key words and other triggers to flag their targets, not just "is it gibberish/wrong" which is a standard all of wikipedia nominally tries to enforce on all edits, ai or not.

    • Remarkably close to this, Wikipedia was promoting a tool [mozilla.org] based on an LLM that allows users to add statements from new sources, although in this case the caveat is that the user has to find a citable source first.

  • Not AI (Score:4, Insightful)

    by Geoffrey.landis ( 926948 ) on Friday October 11, 2024 @12:06PM (#64856991) Homepage

    I wish people would stop referring to large language models as "Artificial Intelligence."

    It is sophisticated pattern-matching software. It doesn't in any way "know" what the text it produces means; it just makes text that looks like the patterns of text that are made by humans who do.

    • ^^^^THIS^^^^

      The misnamed "AI" output is a mashup of whatever else mentions the same salient terms. It has not to me shown any evidence of being thought out, It shows no evidence of original content or analysis of concept. It ingests everything, including dross, and doesn't seem to be able to tell the difference in an analytical way..

      There do seem to be those who believe in the "turtles all the way down" concept of LLMs that analyze other LLM outputs, and perform useful functions like editing for clarity, or

      • The LLMs analyzing output from other LLMs tends to go off the rails with much more "hallucination" effects. Garbage in, garbage out. Even output that is intended to closely resemble natural language has enough garbage in it to screw up the learning. You take a mimeograph of a mimeograph of a mimeograph all the way down and you end up with something nasty.

        (for those who don't know what mimeographs are, they're a messy organic solution allowing documents to reproduce. Also get off my lawn!)

        However GPT4 ha

    • Stop. It's not going to change and you don't look smart by saying this.
      You're not the arbiter of this. You can't even properly define "Intelligence".

      Stop polluting Slashdot with this utterly uninteresting and unending pointless 'insight".

    • by allo ( 1728082 )

      And stop saying crypto when they mean blockchain-based currencies.

      Don't fight it, we already lost. Neuronal networks are now AI, because the term is catchier than all alternatives. Try to get people to say ML ...

    • Re:Not AI (Score:4, Insightful)

      by MobyDisk ( 75490 ) on Friday October 11, 2024 @02:23PM (#64857313) Homepage

      Scary fact: That's how you we operate as well. You and I don't know what we are going to say 5 words from now: it just flows out of a complicated statistical model in our brains. Sometimes we say things then realize they are wrong only after we hear ourselves say it. And sometimes we still don't know.

      Decades ago, a program that played Chess was considered "Artificial Intelligence." Then we moved the goal posts to "Well, they can't beat a grand master." (Most of us can't!) Then we felt safe behind the Turing Test: Phew! We humans could still look down upon AI as not *really* intelligent. Today's AIs are based on neural networks, similar to how biological brains operate. They pass the Turing Test with flying colors, yet we still refuse to say they are "Artificial Intelligence." What test must they pass next?

      This really sounds like the No True Scotsman [wikipedia.org] fallacy, or moving the goal posts. Science fiction is rife with characters like in Star Trek: TNG who claimed that Commander Data wasn't really alive or wasn't really intelligent. No matter how smart these systems get, they are never worthy of the label "Artificial Intelligence."

      For all I know, there is some being looking at our conversation on a monitor telling his colleague "Nice program, but it *still* isn't artificial intelligence."

      • "What test must they pass next?"

        Critical Thinking Test.

        When "AI" can actually QUESTION the data fed to it, and thus the commands it has been given. Otherwise, it is merely the extension of a human's desires, providing us with what we programmed it to do. We humans possess critical thinking, so a human equivalent should as well.

    • Except that Artificial Intelligence has been the term since the 70s, despite there being no Intelligence. It is a research field to investigate and try to have someting resembling intelligence but without explicitly programming it all. Machine learning, or machine adaptation.

    • Artificial intelligence is a term that's been around for a long time. It's not going away anytime soon, but the goal posts keep moving. The A* traversal algorithm was considered artificially intelligent. The threshold for artificial intelligence used to be a program that could be the world champion at chess. At one time artificial intelligence was synonymous with computer vision. If you could make software that recognized and identified objects, you had artificial intelligence.

    • I wish people would stop referring to large language models as "Artificial Intelligence."

      Artificial Intelligence is a field of scientific endeavor. Like Mathematics.

      LLMs are a part of artificial intelligence. Like addition is a part of mathematics.

      It should not be confused with the entire field... but it is a part of it. Therefore the term is applicable -even if wildly misleading as commonly used.

    • The term you're looking for is "strong AI" or "artificial general intelligence" (AGI.) As others have pointed out, the term "artificial intelligence" has always referred to all forms of research into automated decision-making. Perhaps Hollywood has misled plebs on this point by bombarding the public with portrayals of human-level thinking by machines, but bringing that viewpoint into a community like Slashdot is only ever going to get you shouted down.

  • It is easier to destroy than to create, and with AI generating the poison content there is no long term scenario where the legitimate content wins out that doesn't involve forcing contributors to register with verified government ID and paid admins policing them.

    • by Samare ( 2779329 )

      Current LLM content doesn't include valid sources, so it can be removed immediately per Wikipedia's guidelines. https://en.wikipedia.org/wiki/... [wikipedia.org]
      And pages can already be set to show the latest content validated by contributors with more experience. Maybe that will become more common. https://en.wikipedia.org/wiki/... [wikipedia.org]

      • by allo ( 1728082 ) on Friday October 11, 2024 @01:59PM (#64857237)

        LLM cannot give valid sources. But systems involving LLM can. See for example Perplexity, what integrates a websearch with a LLM so it can source its text. This may or may not result in useful texts, but it provides valid sources to look up if the text is correct.

      • Generally those using chat AI outputs do so to summarize writing that is already there. So one could write up a bad wikipedia article, with valid citations, then have the LLM "clean it up". Especially useful for non-native speakers of the language.

    • I'm not sure I see *any* long-term scenario where it isn't all "AI" and the humans are grateful for it. How many times do you have to have your painstaking entry deleted for relevance, only to have some stub emerge and get worked on years later? How often do you want your observation that the sky is blue because there's no "source"? Do you enjoy fighting battles you'll never win between reality and politics?

      Obviously, I have an opinion: Wikipedia sucks because of its crusading editors, and I could care les
  • I don't know the solution but Wikipedia seems to have become an indispensable asset to the free world. Could an international consortium buy Wikipedia with a clear and un-editable mission statement? Something like the E.U. but hopefully, more countries including the U.S. Not government specifically, but one that governments support and participate in?

    I can't imagine the consequences if Wikipedia just disappeared or become completely biased. They might be hard to quantify, but I have no doubt they would be

    • Russkies are working on their own wikipedia with blackjack and hookers. It will be completely unbiased. 100% true and 110% on the right side of history.
      • The vodka is strong but the meat is feeble.
      • Re: (Score:2, Informative)

        by Darinbob ( 1142669 )

        Ah, like Conservapedia, chock full of falsehoods that present a deliberately biased view. Starts with the assumption that existing wikipedia is liberally biased (or biased in other ways opposed to Conservapedia authors), and existing print encyclopedias are also highly biased, therefore balance the scales by biasing in the opposite way. The result is a wiki encyclopedia firmly in favor of one, and only one, style of creationism forming a sizeable bulk of the content, along with conspiracy theories, diatri

    • by Samare ( 2779329 )

      Wikipedia is a project of Wikimedia Foundation. And Wikimedia Foundation isn't a corporation, it's an American 501(c)(3) nonprofit organization. https://en.wikipedia.org/wiki/... [wikipedia.org]

      • I understand it's a non-profit, but that makes it vulnerable to it's big donors just like our democracy is, doesn't it? Also, from the ads/request, it doesn't seem to get enough funding as it is. Maybe an just an international consortium to fund it, as long as it remains, basically, unbiased?
        • Ah, but "international" means un-American :-) Some people will assume it must be biased. Effectively there's no way to avoid bias or the appearance of bias. The best you can do is be open and clear with everything, and Wikipedia mostly does that already.

        • by Samare ( 2779329 )

          While American democracy is paid for by big donors who give money in exchange for favorable laws and decisions, Wikipedia is written by people like you and me who don't care about the big donors.

      • Wikimedia Foundation isn't a corporation, it's an American 501(c)(3) nonprofit organization.

        A 401(c) is by definition a non-profit corporation. They are funded by contributions, which reminds me I need to send them some money.

        I am not sure it matters whether content is generated by AI or some other process. The issue is whether an entry is factually accurate, truthful and not spun to someones interest. Its pretty obvious that is not the case for a lot of entries now. Because Wikipedia is the first goto for many of us, there are lots of people trying to influence the content.

        There was recently a

    • I can't imagine the consequences if Wikipedia just disappeared or become completely biased

      Already is completely biased, and has been that way for years now.

  • by xack ( 5304745 ) on Friday October 11, 2024 @12:45PM (#64857047)
    Ever since the original RAMbot generated thousands of articles from US census data. Then lots of repetitive stuff like plant species are all derived from databases. They are even automating it further with Wikifunctions [wikifunctions.org], which is currently in testing.
  • No typos, no spelling errors, grammatically correct, multi-syllable words, very suspicious.:-)

    • This is actually a great way of detecting copyright violations on Wikipedia. I've been suspicious a few times: perfect text appearing in an article on a trail somewhere, or twenty episodes of some kid's show getting plot summaries. Most of the time a quick search shows it's been copied and pasted from some commercial site. For the plot summaries there's also a style of writing that makes you know it's from a TV company website: "There's trouble in the house when Danny brings back a xylophone."

Nothing is faster than the speed of light ... To prove this to yourself, try opening the refrigerator door before the light comes on.

Working...