Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
United States Data Storage

National Archives' Digital Woes 190

Carl Bialik from the WSJ writes "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years, according to the Wall Street Journal. 'The rapid adoption of electronic communications technology in the last decade has created a major crisis for the Archives,' the Journal reports. 'For one thing, the amount of data to be preserved has exploded in recent years, thanks to the proliferation of high-tech tools such as personal computers and wireless email devices such as BlackBerries. At the same time, technology is becoming obsolete so fast that electronic documents created today may not be legible on tomorrow's devices, the equivalent of trying to play an eight-track tape on an iPod.' The director of the Electronic Records Archives Program tells the Journal, 'We don't want to turn into a Cyber-Williamsburg, a place that keeps old technologies alive.'"
This discussion has been archived. No new comments can be posted.

National Archives' Digital Woes

Comments Filter:
  • "The National Archives, entrusted to preserve America's official history, will have to handle roughly 100 million emails from the Bush White House,.." Thanks to the Patriot Act, this number will be reduced to roughly four, including one such email with a complelling advertisement for V14GR4!!!!!!11
  • some funny math (Score:5, Interesting)

    by Yonder Way ( 603108 ) on Thursday December 29, 2005 @10:09PM (#14362328)
    100 million emails
    let's be generous and say that the average email is 8192 bytes in size (8KB)

    100,000,000 * 8KB = ~800GB

    That's not much at all. And that's if you store it uncompressed.

    Use a well documented unencumbered compression algorithm and it's likely to all fit on a single tape.

    • by NonSequor ( 230139 ) on Thursday December 29, 2005 @10:13PM (#14362344) Journal
      This is the Bush administration we're talking about. They all use HTML mail with lots of attached graphics. On top of that, many messages get forwarded hundreds of times.
    • What's that when converted to the storage capacity unit du jour, the Library of Congress (or LoC). How many LoCs is 100 million emails?

    • Re:some funny math (Score:5, Informative)

      by Wildfire Darkstar ( 208356 ) on Thursday December 29, 2005 @11:16PM (#14362588)
      Speaking as a trained archivist, I can say that the problem isn't finding storage space for the e-mails, per se. It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time). Which means that the problem cannot be satisfactorily solved by dumping every e-mail onto a hard drive somewhere and forgetting about them. They all need to be indexed and cataloged, and provisions need to be made to ensure that the data can be migrated onto newer technology when it becomes necessary to do so without losing any of the information (or metadata) associated with it.

      The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with. While storage space itself is a concern, to some degree, given that this material will continue to accumulate, the larger problem is how to manage this material. Having 800GB of e-mail is pointless if you don't provide a means to get in and retrieve specific messages, and provide the appropriate context for that e-mail.
      • Too bad they already awarded the contract to lockheed martin (someone had their palm greased in that deal), as my company deals with document conversion and archiving (of this scale) on a regular basis. The NA concern was converting the documents to modern formats and yet retaining the original document... Peanuts....my systems do it on the fly.

        Oh well....$308 Million dollar contract goes bye bye.....

        When did lockheed martin get into the document management business?
        • When did lockheed martin get into the document management business?

          Sounds like they just did.

        • Too bad they already awarded the contract to lockheed martin (someone had their palm greased in that deal), as my company deals with document conversion and archiving

          So how much did you give to Jack Abramoff? Nothing? Maybe that explains it?
        • Well, now that you know that the federal government has email storage issues, perhaps your company needs to step up and learn about how to bid on federal contracts. State governments are in smaller versions of the same boat. Our governor may not be turning over 100 million emails to our State Archives, but it will be a bunch. Even the last geezer governor transferred a big chunk.

          If you've got the best mousetrap, you need to find out more about how to make your product available to the archives communit

          • It is a relatively new product (6 months)

            We're registered as gov't contractors, but I'd never seen anything like this come across the wire (it is a big wire)

            Thanks for the references, I'll check those out!
        • You may still be in luck. Lockheed Martin largely touts itself as a Systems Integrator. Depending on the contract, they make few software products themselves. Instead, they turn to Commercial Off The Shelf (COTS) software for most of their solutions. Their main role is to allow multiple vendors (archival systems, viewing systems, email systems, and other related technolgogies) work together. They work to sell a complete solution, top to bottom, composed of products from companies like yours.
        • "When did lockheed martin get into the document management business?"

          Don't know. You should email the White House and ask.
      • If the NARA really wanted to be sharp about this, they could load all the emails into a running database instead of onto media, then back it up to tape periodically (this ensures the tapes will keep working, etc). They could go in all sorts of directions from this starting point, including cloning the database to produce WORKING backups, etc.

        Somehow, I find a running server more trustworthy than a bunch of CDs in a box. At least I can go ASK the thing whether it still works... :)
      • This doesn't make sense to me.

        First, you have the mail itself. RFC2822 is an international standard, so that seems like the right way to go to store the mail. For indexing, there are any number of mail archival systems, some better and some worse than others, but most handle 2822 just fine.

        Now you get into attachments. Here, you presumably want to convert everything down into one of PDF (semi-open format where there are at least several competing readers), Open Document (open format with an open source read
      • It's called pdf and links and displayed header information.

        It's damned simple. Use a unix time stamp plus subject and sender for the name of the file. Then create directories for mailboxes and drop links into the directories that correlate to the sender and reciever. You name the link with the standard crap you see in a email client. You name it Date/Time|Subject|Sender.

        Now what you have is a list of emails (links to the actual PDFs) recieved by date and time in a time ordered fashion. Any coder worth
      • I know, I know, I know!! Why don't they use Google Desktop!!?
      • The volume of material is staggering, and goes beyond what NARA (or almost anyone else, for that matter) has traditionally dealt with.

        You're kidding, [google.com] right?
      • "It's the duty and responsibility of the National Archives to preserve both content and context, and to ensure that these e-mails remain accessible for however long the retention schedules call for (which, in the case of executive communication, is not an insignificant length of time)."

        Yes, but it'll still only fit on a single tape. ;D
    • let's be generous and say that the average email is 8192 bytes in size (8KB)

      Let's be honest and admit they use M$ junk. You know they are slinging around 70MB power point files, word docs, ad nauseum. Getting that all put into something legible is hard to do. Try opening your Excel 4 files, for example. Did you remember to install the right fonts and equation editor? If all non text were pumped to pdf or html, things would be a little easier but still larger. The challenge is automating the conversio

    • That's not much at all. And that's if you store it uncompressed.

      And any compression routine will immediately tokenize the long heavily repeated phrases: "September 11, 2001", "Global War on Terror", "aid and comfort to the enemy", "America's will is strong", "central front in the war on terror", "the American people are safer", "9/11", "we will prevail". There isn't a lot of entropy in this particular dataset.
  • Plain Text (Score:5, Insightful)

    by CWRUisTakingMyMoney ( 939585 ) on Thursday December 29, 2005 @10:09PM (#14362330)
    What's to keep NARA from converting most electronic record to plain text? Surely most communications are only text themselves, so formats wouldn't be an issue there. For more complex files, OpenDocument is an option, or just any Open format. On the good side, this would make searching the archives fantastically efficient. NARA is already making some fomerly-paper records into electronic, searchable records. Imagine if everything were like that.
    • You have to recognize that not only the format is prone to become obsolete, but the media too (as in: you can't play audio tape music in your CD-ROM :).

      Digital is great, but preserving it in time is hard: you need media that can last long, media reader that works with the modern equipment, file system format you can comprehend and reader software to display the documents to you.
    • Re:Plain Text (Score:3, Informative)

      by elronxenu ( 117773 )
      Legally they're not allowed to convert the documents.

      IMHO, storing them on 8-track tape is a massive blunder. 8-track is already obsolete. What they should be doing is either keeping them all on spinning storage (with massive amounts of redundancy) or burn multiple redundant copies to DVD.

      Either way, they will have to deal with the problem of unreliable storage - it's easier to cope with if the problem can be automatically detected, and the data recovered from a backup and re-copied automatically. This

      • I think you mean 9-track tape. 8-track tapes are used for car audio systems. 9-track tapes have been around for 40 years. They work, and if needed, it isn't that difficult to build new tape drives. How many other data formats have come and gone in that time period? Newer isn't always better.
        • Silly me. 9 tracks, including the parity bit.

          But I stand by my claim that they are obsolete. NASA is faced with a huge problem to recover the data off thousands of tapes written during the earliest space missions. After 40 years the oxide is flaking off the tapes and recovery is a delicate and dangerous process, often involving the destruction of the original tape.

          It remains to be seen how long current technologies like CD and DVD will last before degradation causes data loss. Some say hundreds of years

    • Re:Plain Text (Score:4, Informative)

      by Wildfire Darkstar ( 208356 ) on Thursday December 29, 2005 @11:27PM (#14362625)
      What's to keep NARA from converting most electronic record to plain text?

      Potentially Armstrong v. Executive Office of the President. Format shifting is a fantastically tricky minefield to navigate. The aforementioned court case dealt specifically with the practice of printing e-mail communication and storing it as a paper record, but it speaks to the standard problems of conversion: you need to be entirely certain that you're not losing any information in the conversion process. This includes transmission information, metadata, and so on. Which isn't to say that plain text conversion can't be done in a lot of cases, but rather that it's something that needs to be undertaken very carefully.

      And while NARA has been embarking on some wonderful digitization projects, no paper-born records have been replaced by electronic conversions as of yet, for precisely the same reason. The electronic conversion augments the original paper record, but NARA still needs to maintain and preserve the paper record for as long as they have always been legally required to do so.
  • OK (Score:2, Insightful)

    by pHatidic ( 163975 )
    So why don't they just use open source data formats? Is there something more complicated here that I'm not seeing?
    • Re:OK (Score:2, Funny)

      by AvitarX ( 172628 )
      I think the problem is they are trying to store the records on 8-tracks.
      • I think the problem is they are trying to store the records on 8-tracks.

        They are. Quoting the article is such fun:

        for now, at least, the Archives uses electronic storing methods similar to those adopted in the 1960s and 1970s, transferring data onto magnetic tapes because that is the only format the archivists know will work indefinitely.

        I hope they are using GNU tar to hold that mess togher.

        The bigger problem is translating proprietary formats for the ages while maintaining the original format as re

    • Re:OK (Score:3, Funny)

      by Sinryc ( 834433 )
      Yea, there is something your not seeing. The fact of the matter is they are talking about STORING the saved data. Not opening it.

      Good job on getting modded well. Anytime someone says "Open Source it" They get modded pretty well.

      Good job.
    • by Rimbo ( 139781 )
      "Is there something more complicated here that I'm not seeing?"

      Massachussetts vs. Microsoft, q.v.
  • One Word: Google (Score:5, Insightful)

    by Nova Express ( 100383 ) <[moc.liamg] [ta] [nosrepecnerwal]> on Thursday December 29, 2005 @10:13PM (#14362343) Homepage Journal
    Really, either Internal or External. Take out anything that might injure National Security, then turn the rest over for Google to index. Hell, send a copy of everything to Google, for that matter; they've got room. Keep a record of searches and visits to documents by codeword and frequency and build index that way. Create a datasea, index it, and let citizens swim in it. As long as the e-mail is in at least a remotely standard format, what's the problem?

    (Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)

    • Stick it all on one box, then install p2p software. Name all the files to song titles and it'll spread even faster. (Of course, the RIAA might go after John W. Doe...)
    • Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...

      Except when you're right

      The Google Search Appliance
      http://www.google.com/enterprise/gsa [google.com]

      What it does

      The Google Search Appliance makes the sea of lost data on your web servers, file systems and relational databases instantly available with one mouse click. Just point it toward your content, add a search box to your site, and in a matter of hours, your u

    • Lately I've been wondering how great Google really is, and whether its deserving of the love I give it. Sure, I think the company Google is full of geniuses coming up with some of the best ideas since bread & butter.

      But then I ask myself how much time I've spent trying to find things online. I've been finding Google to be increasingly less useful. When was the last time you googled, looking for information, and found nothing related? When was the last time you had to rephrase your search query not once,
      • Depends what you're doing. Using google with the exact wording of an error message often gives the solution in the first match. It's still great for me. Spam is an ongoing battle but my searches usually don't result in much--just subject matter differences, I guess. I'm sure if you're looking for digital camera info it's kind of hard. But when the revolution comes and all spammers are lined up along a wall and shot, that problem will go way.
    • It has been said previously, but metadata

      I don't think google is indexing metadata and wouldn't it be just sneaky to have a plain Hello Jane type email have a secret message in the metadata. Everything last to be kept and indexed.
    • Take out anything that might injure National Security, then turn the rest over for Google to index.

      Dear Sir,

      I write to inform you of my desire to acquire [REDACTED] in your country on behalf of [REDACTED] of the [REDACTED] in Nigeria. Considering his very strategic and influential position, he would want the [REDACTED]. He further wants [REDACTED], until [REDACTED]. Hence our desire to have [REDACTED].

      [28 LINES REDACTED FOR SECURITY PURPOSES]

      Your quick response will be highly [REDACTED]. Thank y
    • (Note: Asserting a simple solution to a complex problem is the best way to elicit information, as it creates a burning desire in readers to prove you're wrong...)

      You have achieved true enlightenment. Go forth my friend and enjoy nirvana.
  • by Architect_sasyr ( 938685 ) on Thursday December 29, 2005 @10:14PM (#14362353)
    Well, if the technology that uses the emails is exploding, surely the software/systems that archive the software are too.

    A couple of BSD box's with some Oracle or similar should do it.
    • Logistically it would make sense to feed everything into a single type of database (its ok to have seperate ones for different things to keep the size down and the preformance up as long as they are all the same kind). Database software gets updated and makes it easy to update the database to the new version. Even if Oracle goes out of business, you can bet that every company who continues will have a function to convert from an oracle database to grab customers. As long as they keep the database fairly
  • by Alcimedes ( 398213 ) on Thursday December 29, 2005 @10:15PM (#14362363)
    Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?

    I'm sure a universal data conversion tool would be worth a pile of money.
    • Really, rather than talking about how horrid it is, why not be busy working on software and hardware solutions that will bring old document types up to today's standards, and devices that will pull data off of old drives?

      Sounds more like a governance opportunity to me. the National Archive could spearhead the push to develop sophisticated open standards (open Document doesn't satisfy all archival purposes) that all of government, and the public, could use.

      Of course, we are living in Bush-World(tm) - so any

  • Sounds like a job for everyone's favorite do-everything markup language, XML [w3.org]! Seriously, why isn't it used to structure everything?
    • Re:XML? (Score:3, Insightful)

      by grcumb ( 781340 )

      "Sounds like a job for everyone's favorite do-everything markup language, XML! Seriously, why isn't it used to structure everything?"

      Because it's not the right tool for every job. XML is explicitly a data interchange format. I've worked with material like this in the past, and I can tell you from experience that processing large volumes of XML (or any text-based markup format, for that matter) is extremely expensive in terms of processor and memory resource usage.

      That said, I agree that in this case XML

    • XML! Seriously, why isn't it used to structure everything?

      <?xml version="1.0">
      <bitmap>
      <title>Pathological Example</title>
      <format colors="color" bpp="24" />
      <generator>
      <software:software>
      <software:title>Slashdot XML Paint</software:title>
      <software:version>1.0</software:version>
      </software

    • That's like saying "Sounds like a job for everyone's favorite medium, paper!" (Or I suppose one could even argue that XML is more like wood pulp than paper in this comparison.)

      XML allows for the quick creation of data formats, but it doesn't magically make these data formats popular or parsable by actual programs - that's still a real issue. And even when they settle on an internal format, there's the question of getting existing data into that format, or exporting back into popular formats. It's not as eas
  • iPod Mod... (Score:4, Funny)

    by __aaclcg7560 ( 824291 ) on Thursday December 29, 2005 @10:24PM (#14362405)
    The article mention playing eight-track tapes on an iPod. Does any have the link to that ultimate retro mod? Does it come with a Saturday Night Live dance cover?
  • I think we deserve to be told how many Library of Congresses that takes up!


  • Lockheed officials have recommended using a handful of widely accepted formats such as the popular Internet software language HTML. . .

    Those responsible have been sacked.

  • by rampant mac ( 561036 ) on Thursday December 29, 2005 @10:48PM (#14362493)
    "The National Archives [...] will have to handle roughly 100 million emails from the Bush White House, up from 32 million during the Clinton years"

    I'd love to read those emails, seeing as how we've gone from:

    From: bclinton@whitehouse.gov
    To: hclinton@whitehouse.giv
    CC: agore@whitehouse.gov; tgore@whitehouse.gov; monica04329@yahoo.com; ltripp@weightwatchers.com;
    Subject: omglol, you got to get me some of these!

    I want these for Christmas! http://www.big-fat-cigars.com/ [big-fat-cigars.com]



    To something along the lines of:

    From: gbushjr@whitehouse.gov
    To: dickc@whitehouse.giv
    CC: crice@whitehouse.gov; jbush@whitehouse.gov; lbush@whitehouse.gov; urnotapuppet@gmail.com; osamab@msn.com; cpowell@hotmail.com;
    Subject: Are they for real? Can we attack them too?

    Subject sayz it all, any toughts Dick? I think we can git `em.

    > DYKE BOURDER OIL SERVIES
    > OFFER FOR SALE OF NIGERIAN CRUDE OIL
    >
    > Dear Sir,
    >
    > I am President of blah blah blah...

  • by the eric conspiracy ( 20178 ) on Thursday December 29, 2005 @10:52PM (#14362508)

    rm -rf /

  • We've all had our "I gotta keep everything I do, download, see or hear in my records" moments, and sometimes they may last for years before we realize we don't need 99% of it anyway and will never never use it.

    Information is infinite, there's no ends to the amount of information anyone of us can produce. Storing everything is old school, new school recognizes that fact and stores only important information.

    What the government needs is to prioritize and save only the important stuff. Official bills and memos
    • by YrWrstNtmr ( 564987 ) on Thursday December 29, 2005 @11:17PM (#14362591)
      What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

      Often, you don't know whats important, until long after the fact. Storage space is so cheap and easy, it doesn't make sense to try to filter, as its happening. Inevitably, something important/crucial/worldchanging would get lost, resulting in cries of government censorship.

      And I'd say for a presidency...ALL of it is crucial.

      Random conversations, recorded by the secretary, then 'erased', has already caused one president to resign. What was in that erased 18 minutes? The NARA may actually find out [about.com].

      • Speaking as a trained archaeologist (and I'm not just saying that for effect), it would definitely be wrong to filter out the "unimportant" who-got-coffee when, because it makes a false judgment about what sort of information will be of interest to scholars of the future. There are all kinds of weird correlations possible, too -- "Presidential Coffee Breaks and the History of Global Commerce in the Post-Lewinsky Era," etc. One might want to study what lower-level White House bureaucrats did, too -- who kn
    • The government doesn't save everything forever. All records created by the federal government have their own retention schedules, which can range from a few weeks to forever. There are dozens of potential reasons for needing to access any given record, though, and they aren't all as obvious as one would think. An e-mail from the president asking for a cup of coffee might well have some value to a historian or biographer. Personal communications might have potential legal repurcussions down the line, for wha
    • What the government needs is to prioritize and save only the important stuff. Official bills and memos are worth saving, the president asking his secretary for a cup of coffee isn't.

      That is an absolutely insane idea for government policy. We shouldn't decide what's important for the future - the future history writers decide that for us. Who is it that decides what is important? The public owns the government, and has the right to retain everything it does. Not storing evidence would mean that today's crim

      • "Not storing evidence would mean that today's criminals in government will escape future punishment or disrepute, and current heroes of government will not receive their dues or recognition."

        With the few replies supporting the same point of view as yours, I tend to agree.

        HOWEVER, I ask: honestly, do you think corrupted politicians freely use logged medium to exchange idea for stealing taxes/money from corrupted businesses?
        • HOWEVER, I ask: honestly, do you think corrupted politicians freely use logged medium to exchange idea for stealing taxes/money from corrupted businesses?

          No I don't, I think they are careful, and usually maintain several layers of coverup. However, they usually slip up somewhere (or an underling does). And they WILL communicate over logged mediums, because they need to give some sense of legitimacy. It will look funny if they have no logged transcripts during their years in office. And what they might thin

  • Format obsolesence (Score:5, Insightful)

    by StikyPad ( 445176 ) on Thursday December 29, 2005 @11:06PM (#14362553) Homepage
    There's no reason to keep 286s around to read WordStar documents. Just because formats are updated and revised doesn't mean the data needs to be stored as such. Save the text as ASCII, and the images as png or another lossless format. In the unlikely event that png is updated in a way that isn't backward compatible, convert the old files over to the newer format. Every few years, copy the data from old media to newer media. If done regularly (rather than, say, waiting until there are 500,000 floppies to make the leap to DVD-R), it won't be much of a chore. Sure it's a headache, but that's why they call it work.
  • Internet Archive (Score:3, Interesting)

    by arrrrg ( 902404 ) on Thursday December 29, 2005 @11:09PM (#14362564)
    If the Internet Archive [archive.org] can back up the entire internet every few months, I would think the National Archive could handle a few hundred million emails.
    • Re:Internet Archive (Score:3, Informative)

      by fiji ( 4544 ) *
      For some value of entire.

      TIA is pretty damn impressive, but they certainly don't get all of it.

      1: There is more to the internet than the web
      2: They don't do a lot of dynamic pages... so a lot of forums will probably be ignored (not that that necesarilly loses anything useful ;-)
      3: They only get images if you request it
      4: Sites can request that they not be spidered (robots.txt)
      etc.

      -ben
      • Exactly.

        The best "internet backup" is all the stuff that we rat-packers save and someday recall again... :)

        I recently reconstructed a vanquished web page thru TIA, local saved pages, and various googled caches. It was rather an enlightening experience. One application I can see for the future of the internet is distributed user archive programs such as the TIA is, but with many, many more machines. Google is really kind of a baby step towards the infrastructure needed to have a collective d
  • ASCII Text (Score:3, Insightful)

    by Spazmania ( 174582 ) on Thursday December 29, 2005 @11:11PM (#14362569) Homepage
    electronic documents created today may not be legible on tomorrow's devices

    ASCII text has been around for decades and oh by the way Internet-formatted email is 100% representable as ascii text since that's how its still transferred today.

    This supposed problem is a real problem only for those with Exchange, Domino or Groupwise which creates email in custom, internal formats.
    • I wonder who is stupid enough to have moderated the parent as insightful?

      What if the email contains an attachment which is in a format that you can't read?
      Sure it is encoded as ASCII but it doesn't help..

      • What if the email is encrypted? What if the author wrote using code-words that only the recipient could understand? Answer: Don't worry about it. You take the common word-processor formats and store a second copy as ascii text. You take the commone spreadsheet formats and store a copy as ascii .csv files. Pick formats for graphics, audio and video where the primary criteria is its current ubiquity and store second copies of those too. Everything more obscure you simply store as is and don't sweat it. Tomorr
    • Internet-formatted email is 100% representable as ascii text

      I often get two paragraph messages from people as a Word doc attachment. And yes, it actually is sent as a Base-64 encoded (ASCII) segment of the message, but I don't think that's very helpful. Personally I filter most of my email to plain text, so messages that weighed in at 50k or more come down to 1k. But an archivist doesn't have the freedom to simplify the format.

      • an archivist doesn't have the freedom to simplify the format.

        Sure he does. He shouldn't discard the original, but he has complete freedom to store an additional copy or copies in simplified and standardized formats. And given your comments on the relative size of the alternate versions, such additions would be relatively cheap.
        • Sure he does. He shouldn't discard the original, but he has complete freedom to store an additional copy

          Storing an additional version isn't "simplification". I was talking about "instead of", which is what I do for my own archives.

          • Storing an additional version isn't "simplification".

            Of course it is. You're giving the eventual retriever an easily read version he can get to conveniently as well as the original that he can dig in to if needed. Unless you can guarantee the the converted version is perfectly equivalent to the original (and you usually can't) you have to save the original anyway.

            What you do for your own personal archives is, of course, a very different question. You may well find that the space savings justifies the loss
    • Yep except what about other documents? Things like CAD drawings, spreadsheets, and documents stored in... WORD!.
      This is why the government shouldn't all the use of proprietary formats for any internal documents. Of course a lot of things sent in email would have never been documented before because they would have just been conversations. Things like what do you want to do for lunch today and how about them Redskins... Man if political correctness keeps going the way that it has been in 100 years some poor
      • Yep except what about other documents?

        For easy stuff like word processor and spreadsheet documents, simply store a second copy that has been downconverted to ascii. For more obscure stuff (like CAD drawings) don't worry about it. An archivist can't possibly deal with every obscure data format out there, and clever hackers that can reverse engineer a format are relatively easy to come by should the contents of the document ever be desired. All you have to do is get them the bits and bytes on a media that the
  • Official History? (Score:4, Insightful)

    by datafr0g ( 831498 ) <datafrog&gmail,com> on Thursday December 29, 2005 @11:26PM (#14362624) Homepage
    The National Archives, entrusted to preserve America's official history...

    The official history? as opposed to what - the unofficial history? Or should it be worded differently: The National Archives, entrusted to preserve America's official government records...

    Don't mean to sound nit-picky but when I first read that, a million consipiracy theories raced through my mind! :)
  • MySQL? (Score:2, Insightful)

    by jacklexbox ( 912121 )
    Please correct me if I am wrong, as I probably am, but would like to have this explained to me. Why couldn't all the emails be stored as plain text in a MySQL database with either a web interface (php?) or an application written in an interpreted language (Java or Ruby)? Does that make sense? Is there something I am missing?
  • Old solutions could be helpful here... Not because acid-free paper will last for centuries (the volume of paper would be staggering, and you can't grep dead trees), but they provide methods that could be applied.

    Take mercury delay lines. They kept data by continuously sending sound impulses inside a tank filled with mercury, and the impulses were recycled through to refresh the storage.

    Well, this could be done with a *HUGE* disk array, where you add drives to increase storage, and "retire" broken or obsol

  • by toupsie ( 88295 ) on Thursday December 29, 2005 @11:37PM (#14362661) Homepage
    Monks have done an amazing job preserving important documents over the years. In fact, Xerox worked with Brother Dominic [tvacres.com] in the field of document preservation. Print out all the e-mails on archive quality paper and store them underground. Be sure they are also translated in Spanish so future Americans will be able to read them.
  • Well, if they would just run their mail through SpamAssassin it should make the problem far more manageable...
  • by maggard ( 5579 ) <michael@michaelmaggard.com> on Friday December 30, 2005 @02:23AM (#14363261) Homepage Journal
    Turning emails into text files, all graphics attachments into PNGs, etc. isn't the issue.

    How all of this stuff is connected, who it came from, when it was sent, all of that is something Historians (or Special Prosecutors) will need to know. Email from "aa204@whitehouse.gov" to "mikhail@kremvax [wikipedia.org].su" subject "Plans for Wall" isn't particularly useful if we don't have any way of tracking who aa204 was or knowing it was composed on Nov. 9, 1989 but not actually sent until Nov.10, 1989.

    Face it, most email systems are complex special-purpose systems made up of huge webs of interdependencies; from their hardware to their OS to their various applications. Imagine trying to pull emails, address books, mailing lists, undelivereds, calendars, attachments, cc's, bcc's, forwarded-forwarded-forwarded records etc. from a mass of DEC All-In-1 systems, IBM Profs, MS Exchange v.anything, and a the /.-popular mbox/maildir/postfix/cyrus/exim/sendmail/dovecot/l dap/etc. environments...

    Now figure out some reasonably stable format to save 'em all in where they can be referenced, cross-referenced, timelines produced, who-knew-what-when deduced, identities tracked, policy propagation studied, etc. That's not the territory of thousands of text files, or PNGs, it's a data-miner's nightmare and what the Nat'l Archives are facing.

    So please, stop being quick-to-the-keyboards "Well d'uh" /-trollers and assume that some reasonably clever and knowledgeable folks have already considered the problem and are appalled at it's complexity. Yes, there are possibly some even more clever & knowledgeable folks who read /. but the text-&-png crowd is just so much wasted bits.

    At least the big-database folks are probably closer to what is going to be required, and anyone who is starting to think that mebbe proprietary undocumented databases cost us all more in the long-term then they're worth are even more (IMHO) on the right track...

    • A-freakin-men!

      Seems like there is about 100:1 'understand:clueless' post ratio here.

      Converting the body of an email or document (word, pdf, excel, powerpoint, html, whatever) is trivial. Maintaining all of the meta data associated with the document/email is not. Maintaining the original context is not trivial. Let's not forget that something like highlighting, font color, underlining, bold face, or italics within a message may have meaning - if you convert to all ascii, the formatting and the meaning tha
  • They should experience how the latest version of Microsoft Office can help them better manage documents, organize workload, and collaborate with coworkers--not just from their desk, but from almost anywhere! Why? So that their system will deliver the features, options, and performance they need to maximize their productivity and enjoyment, to insure that their software is authentic, properly licensed and supported by Microsoft or a trusted partner, so that they will get access to updates, enhancements, and
  • Print it all out using stable inks on acid-free paper.
    - This will give the librarians something to do, and will be immune to technology going obsolete ;-)
  • I have to say the biggest problem they face is that fact that the entire US Government is not on one standard for electronic documents. NARA uses GroupWise for it's e-mail. Other agencies use Exchange/Outlook. Some agencies still use text mode e-mail on a mainframe or UNIX box. People I speak with in the Navy tell me that the whole navy uses a bunch of different formats for everything from e-mail to work processing documents.

    The government is only recently adopting PDF files, because PDFs before version
  • Comment removed based on user account deletion
  • I've never seen a more compelling argument for OpenDoc. (and/or a conversion requirement to OpenDoc.)

...there can be no public or private virtue unless the foundation of action is the practice of truth. - George Jacob Holyoake

Working...