Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Wikipedia Open Source The Internet Technology

Have 100GB Free? Host Your Own Copy of Wikipedia, With Images 151

First time accepted submitter gnosygnu writes "Want your own copy of English Wikipedia with images? Got 100 GB of disk space? Then open-source app XOWA may be of interest to you. The project released torrents yesterday for the 2013-11-04 version of English Wikipedia. There's 100 GB of sqlite databases containing 13.9 million pages, and 3.7 million images — readable from any Windows, Linux, or Mac OS X system. Image downloads for other wikis are building, but you can still use XOWA to read the text-only version for other wikis like Wiktionary, Wikisource, Wikiquote and 660 more. Next time you find yourself stranded without the internet, you can pull out your own copy of Wikipedia for use."
This discussion has been archived. No new comments can be posted.

Have 100GB Free? Host Your Own Copy of Wikipedia, With Images

Comments Filter:
  • by Russ1642 ( 1087959 ) on Tuesday November 26, 2013 @12:35PM (#45527247)

    It comes with software that automatically reverts your edits and insults you.

  • Finally! (Score:5, Funny)

    by lagomorpha2 ( 1376475 ) on Tuesday November 26, 2013 @12:36PM (#45527253)

    Finally I can have my own version of wikipedia so I can correct all those changes I haven't been allowed to enter into the official version!

    • I hear this from a lot of slashdotters. No one bothers to give examples. I find that suspicious. I believe it happens, I just am skeptical that the edits slashdotters are trying to put in shouldn't be rejected.
      • by lgw ( 121541 )

        Ah, a Wikieditor/fanboy. Admit it: you will be torrenting this 100GB copy just so you can delete every article, then do it all again.

      • by mcgrew ( 92797 ) *

        I hear this from a lot of slashdotters. No one bothers to give examples.

        Here's one for you. In 2006 I needed eye surgery, an artificial lens for my left eye. My surgeon suggested a new design that had been out since 2003. Before the new design there were two types: monofocal and multifocal. Multifocal was like having bifocals in your eyeballs, with monofocals you needed reading glasses.

        The new design is called an accomodating lens and sits on struts inside the lens capsule, so it will focus using the eye's

        • The trick (i use) to getting an edit in Wikipedia is to state my intent on the Talk page, something like "If no-one objects,..." wait a day, then post. Either someone will respond and a discussion will ensue or no one will respond at all. I've never had an edit reverted when i do this.

          I have no idea about the internal mechanisms behind this though, i only edit/add a few lines at a time - i might just be lucky.
        • I'm not doubting your story - especially since you're someone I generally trust well on slashdot (you're my "friend" here); however:

          Go on, try to edit something. It can't be done.

          A little while ago (back in 2008 looking at the article history), I found this article about MFPs [wikipedia.org] to be horribly weak and focused only on home devices with no mention of office or production devices at all.

          Working in the MFP industry, I was able to add a lot of information and give good citations for it; so I did so. Other than the occasional spammer trying to advertise their

          • by mcgrew ( 92797 ) *

            I've always suspected that it was one of Bausch&Lomb's competitors who removed my edits since thye new IOL was so superior (even if it was $1000 more expensive, being under patent). I can see where you would have been more successful with your attempt.

      • I didn't actually believe it either when I first saw it on Slashdot.

        And then I read the webcomic Namesake, and went to Wikipedia to check one thing about Alice Liddel which was featured on the webcomic (which is very good by the way, and is quite famous now).
        I noticed that there was a section of the wikipedia notice called "Alice Liddel in fiction" and which didn't feature Namesake, so I added Namesake.

        It was immediately reverted, citing "no link".

        So I reverted the revert and added a link.
        It was i
    • You say that and laugh, but wait until someone that manages their own DNS, and with an evil intention gets a good idea...
    • Finally I can have my own version of wikipedia so I can correct all those changes I haven't been allowed to enter into the official version!

      Or you could just switch to using Conservapedia.

    • If you are making changes please make in my version too. I don't want it to get outdated.
    • You've always been able to download every page and image. Am I missing something?

      http://dumps.wikimedia.org/ [wikimedia.org]

  • by Anonymous Coward

    Does it include the seasonal donation nag banners?

    Holidays are coming! Holidays are coming!

  • ...yet. But I guess most phones won't easily read sqlite databases yet, either. I suppose it won't kill me to lug around a full-sized SD card.

    Still looking forward to the library-of-Congress-on-a-card from Rainbows End.

    • by vux984 ( 928602 ) on Tuesday November 26, 2013 @01:02PM (#45527709)

      Rats. It won't QUITE fit on a microSD card...

      Just exclude the star trek / star wars related entries; that should pare it down. And besides we all have it all committed to memory anyway right? :p

    • Re: (Score:3, Informative)

      by Anonymous Coward

      ...yet. But I guess most phones won't easily read sqlite databases yet, either. I suppose it won't kill me to lug around a full-sized SD card.

      Still looking forward to the library-of-Congress-on-a-card from Rainbows End.

      Most phones _won't_? Four out of five smartphones today have sqlite preinstalled and ready for use: http://developer.android.com/reference/android/database/sqlite/package-summary.html

    • ...yet. But I guess most phones won't easily read sqlite databases yet, either.

      The structured storage for Android Apps is just SQLite databases. Of course Android doesn't include a database management tool for the end user, but in the background it can read SQLite just perfectly.

  • When the supercold storm blasts through your town, your device will freeze. And I'll still be able to read the pages of my Universalis as I tear them to burn them for heat.

  • by caveat ( 26803 ) on Tuesday November 26, 2013 @12:56PM (#45527607)

    I'd have put en.wikipedia at at least a couple of terabytes. Not inconceivably large, but with some housecleaning I could actually get 100GB free.

    • I'm thinking this must be compressed data. Clicking through, it says that there 20 GB of text data, and 13.9 million articles. This only gives 1.4 KB per article. Which seems extremely small, especially if you're getting all the formatting data. Also remember, I'm pretty sure this doesn't contain all the revision data, only the current version of each article, so the amount of data at Wikipedia would have to be quite a bit larger.
      • How many articles are shitty little 100 byte stubs. As far as revision data you are probably correct
      • Well, if they were pulling only text content, 1.4kB would actually be pretty close to correct. Using averagr characters/word, 1.4kB would be 350 words of text, which is not far off the estimated 400 words/article as calculated in 2005. I'd expect now it would be 450/article, but still not unreasonable depending on the types of articles added since 2005 (I.e., if every town has their own 1 sentence blurb).
    • yeah, 3.7 million images under 100gb? Do I even want to look at these? I can't imagine how compressed and low res those would have to be.

      • Fortunately, you don't have to imagine. The simplest of arithemetic will reveal that's an average of about 20kB per image. If we assume as near-worst-case an uncompressed 16-bit pixmap format, that means 100px x 100 px or so; realistically, most of them are probably jpegs, so search your hard drive...

        find / -name '*.jpg' -size -25k -size +15k

        And take a look at what you have in that range. Then keep in mind that that's an average -- there'll be some much better and some even smaller/compresseder.

        • that's an average of about 20kB per image. If we assume as near-worst-case an uncompressed 16-bit pixmap format, that means 100px x 100 px or so; realistically, most of them are probably jpegs

          Exactly my point. :-)

    • I downloaded all of the (current revision) text a few years back from some of their public data dumps. Stored in a handful of massive XML files, it ended up only being around 3GB. I'd guess it isn't much bigger now, and that the vast majority of the 100GB is simply due to images.

  • by stenvar ( 2789879 ) on Tuesday November 26, 2013 @01:06PM (#45527749)

    That's a good thing. The more we use torrents for the distribution of legitimate content, the more such distribution methods will become legitimized.

    • Let this be heard by everyone in IT management that's trying to sync data between multiple national locations.
    • It's already legitimate and doesn't need legitimizing.

      Of course that doesn't mean that just because your favorite popular zero-day movie/series/albums/ebooks/software site of rather unauthorized nature magically gains "but what about the copy of wikipedia!?"-protection from the likes of MPAA/RIAA/Wiley?/BSA.. at least not in most courts of law.

      • You seem to be confusing "legal" and "legitimate". It's legal, but not necessarily considered legitimate. In particular, many ISPs seem to interfere with torrent traffic. The more people use it for non-copyright-infringing purposes, the more pressure there is on ISPs to back off on their interference.

        • While popularly torrents get messed about with in terms of available bandwidth, the same applies to several other P2P protocols. It's the painted nature of the beast - lots of potentially high-bandwidth connections established for essentially low-priority purposes - that hurts it in that respect. (Yet) An(other) archive of wikipedia isn't going to change that - unless you can think of a convincing reason to submit to ISP decisionmakers that would cause them to believe that throttling the download and/or u

  • > XOWA is a free, open-source application that lets you download Wikipedia to your computer. No internet connection required!

    This is supremely impressive; download Wikipedia without an internet connection!

    • Yeah, just drive run a crossover cable to Wikipedia's server room!
    • First, you tie your request to download Wikipedia to this pigeon's leg and let it fly off.

      Next, you wait for the reply.

      Finally, you load the reply into your computer.

      NOTE: Reply will come in printed format - one article per pigeon. A few million pigeons may be required, but don't worry. We send them all at once to keep you from having to wait.

      • by Kardos ( 1348077 )

        Sounds good to me, there's certainly no shortage of pigeons. It'll be good to put them to work doing something useful!

    • by tepples ( 727027 )

      This is supremely impressive; download Wikipedia without an internet connection!

      Someone's never heard of BD-R.

  • Be... without internet? *screams*
    • Be... without internet? *screams*

      Some of us have to do it. When the boat's connection goes down (e.g. because bad weather misaligns us with the satellite for days on end), that's it ; no internet. Also no emails, or phone calls except through the ship-to-shore radio set. It's bliss!

  • by hcs_$reboot ( 1536101 ) on Tuesday November 26, 2013 @01:36PM (#45528253)
    I hope it's when the previous pope (Ben #16) was pictured as Master Yoda in Wikipedia.. missed that :-)
  • This will be great for offline/remote/low speed situations. Imagine being on a merchant ship or even a cruise ship with a pricey connection package. Scientific expeditions etc.

    How about preloading it on OLPC?

    What if your high school kid can't do his homework without getting distracted online, but says he needs Wikipedia for research. Bam, here's your air-gapped PC son.

  • what I call backup on the cloud
  • Don't Panic (Score:5, Interesting)

    by Covalent ( 1001277 ) on Tuesday November 26, 2013 @01:59PM (#45528671)
    Next year or so 100GB phones will be commonplace...and you will have your Hitchhiker's Guide.

    Truly amazing times we live in.
    • Next year or so 100GB phones will be commonplace...and you will have your Hitchhiker's Guide.

      Pffth. I don't need that. I just need to remember that it's "mostly harmless".

  • Revisions? (Score:4, Interesting)

    by hendrikboom ( 1001110 ) on Tuesday November 26, 2013 @02:08PM (#45528775)

    Presumably the wikipedia is under revision control.
    Does this give you the whole thing so that you can forever after sync with the master?
    Or just the most recent versions of the articles?
    Should there be a bittorrent for syncing huge revision control data bases?

  • just pulled the most recent english-language wikipedia dump, and made elasticsearch ( via the wikipedia river plugin ) run over it. 13.9 million entries now on a small server, answering times ~ couple-of-millisecond order. elasticsearch rocks !
    • Actually, the two things an offline Wikipedia version would benefit from are semantic search and a better UI. Those haven't been tackled yet.
      • Second this. On the semantic search thing, we are generating ideas in-house right now. Contact me if you have an idea for better UI.
  • by Anonymous Coward

    I've been mirroring a local copy of Wikipedia for a long time, with images. What's new about this app compared to the dozens of others that already do this?

  • by Anonymous Coward

    I was wondering when I could replace my CD of Encarta 96.

  • But I thought SQL wasn't webscale wtf?
  • Wikipedia is only so entertaining if you are stranded somewhere with no other way to pass the time.

    Now, if they give us a torrent of the complete TVTropes site....

  • by Anonymous Coward

    That's ALL it takes up?? My goodness! Wikipedia can fit on my largest USB drive?? haha.. I expected it to be in the multi-TB range!

  • I am missing checksums to verify the download. It seems sourceforge has the tendency to change stuff.
  • This is really a cool thing to have as an option. 100G isn't that much today when a TB might cost you 30 bucks.. ( rather surprised its that small... ) and with how 'vunerable' everything is on the net today it wouldn't hurt to have an archive before the next take down notice or commercial buy-out. ( or shut-down due to loss of funding )

  • If all you want is an offline Wikipedia reader, just use Kiwix [kiwix.org]. It uses the ZIM format [openzim.org] which was created specifically for offline use and runs on Win/Mac/Linux/Android or anything else if you want to compile it yourself.

    While the full English Wikipedia ZIM sans pictures is a bit old (January 2012), it has the benefit of being only 10GB and split up into 2GB chunks so it will fit on a FAT32 device like your phone's SD card.

"It's a dog-eat-dog world out there, and I'm wearing Milkbone underware." -- Norm, from _Cheers_

Working...