Become a fan of Slashdot on Facebook

Ask Slashdot: What Is the Best Open Document Format? 200

Posted by timothy on Thursday May 14, 2015 @12:45PM from the when-plaintext-just-won't-do dept.

kramer2718 writes: I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best? Since "best" can be highly driven by circumstances, please explain your reasoning, too.

Have a question for Slashdot's readers? Take a look at other recent questions first to see if someone else has had a similar question. And if not, ask away! The more details and context you include, the more likely your question will be selected.

This discussion has been archived. No new comments can be posted.

Ask Slashdot: What Is the Best Open Document Format?

Load All Comments

Search 200 Comments Log In/Create an Account

Comments Filter:

can't you search the current doc types? (Score:4, Informative)

by alen ( 225700 ) writes: on Thursday May 14, 2015 @12:57PM (#49690583)

if you use the API's supplied by their creators?

Share
twitter facebook
- - Re: (Score:2)
    
    by Vitus Wagner ( 5911 ) writes:
    
    docx is just a zip-archive with xml files. And as far as I remember, schemas are published somewhere (althouth format description is several thousands of pages)
    - Re: (Score:2)
      
      by sabbede ( 2678435 ) writes:
      
      This? http://www.ecma-international.... [ecma-international.org]
PDF/A (Score:5, Informative)

by thechemic ( 1329333 ) writes: on Thursday May 14, 2015 @01:00PM (#49690629)

http://www.pdfa.org/2011/08/pd... [pdfa.org]

Share
twitter facebook
- Re: (Score:2)
  
  by ray-auch ( 454705 ) writes:
  
  +2
  Hundreds of people who do this for a living (they're called records managers), and have done for many years, have worked long and hard to come up with a standard format for exactly this. Doesn't do everything, but what does, but it does ensure that it will still do it in 50yrs if not longer.
  Caveat 1: OP doesn't mention editing, if he needs it editable then don't convert, or store original and PDF rendition for preservation
  Caveat 2: There is a trade off between doc size (OP mention compression) and digita
Don't convert needlessly (Score:5, Insightful)

by PSVMOrnot ( 885854 ) writes: on Thursday May 14, 2015 @01:01PM (#49690637)

I would suggest, unless you have a pressing need to convert them, that you should store the documents in the formats they are uploaded in.
Whenever you convert a document you run the risk of completely messing up the layout, style, etc.

Share
twitter facebook
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  There's storing for later download, and then there's storing for ongoing analysis, indexing, previews, etc. For the latter, it would help a lot to have one standard format. Probably plain text.
  To properly analyze .doc / .docx, for instance, you'll probably need a Windows machine with Word installed. It will likely be significantly cheaper to have Word installed on only one or two machines, convert to text (capturing any necessary metadata on the way), and then do further processing on other machines that do
  - Re: (Score:3, Interesting)
    
    by darkain ( 749283 ) writes:
    
    All of the "X" variants of MS Office documents stand for "XML" - that is, the documents are stored in a series of XML files inside of a ZIP file that is renamed to formatX (docx, xlsx, etc). There is no real need to even have Windows or Office installed to index these documents. Just write up a basic script to extract the ZIP file and parse out the related XML documents. Note: this isn't as trivial as it sounds at first, though. This would assume that Microsoft's XML structures (yes, plural), had an easy to
    - Frequently not "doable" (Score:2)
      
      by dbIII ( 701233 ) writes:
      
      Some of those files are just an XML wrapper around a binary format for which the documentation is not available outside of Microsoft. The wrapper meets the legal obligations but the file format in such cases is ultimately useless in the long term.
      Meanwhile I can import seismic data from the early 1970s into current software without any conversion - simply because the file format is documented instead of Microsoft's later step backwards.
      - Re: (Score:2)
        
        by RockDoctor ( 15477 ) writes:
        
        Meanwhile I can import seismic data from the early 1970s into current software without any conversion -
        
        Strange, but that is exactly what I am doing at the moment. Or at least, I was doing until a few moments ago, when the task finished. Hi ho! back to the grindstone!
- Re:Don't convert needlessly (Score:5, Interesting)
  
  by Anonymous Coward writes: on Thursday May 14, 2015 @01:25PM (#49690933)
  
  Or store both the original, and a standardized format. The place I work stores everything from engineering drawings, meeting minutes, purchase records, to manuals of old equipment in a central document library. It retains the original file, and makes a pdf of every file, and a link to both is listed in each entry. We've already had some older CAD formats no longer supported by current software we have easy access to, but the old pdfs are still readable and it is cheap enough to find some intern to re-create the document from the pdf if need be.
  
  Parent Share
  twitter facebook
  - Re:Don't convert needlessly (Score:5, Interesting)
    
    by AthanasiusKircher ( 1333179 ) writes: on Thursday May 14, 2015 @01:46PM (#49691251)
    
    Or store both the original, and a standardized format. The place I work stores everything from engineering drawings, meeting minutes, purchase records, to manuals of old equipment in a central document library. It retains the original file, and makes a pdf of every file, and a link to both is listed in each entry.
    THIS.
    PDFs (or some similar standard) will ensure that the original documents can be read by everyone and viewed with the original formatting intended by the person creating them. Any differences in the version of Word or whatever is going to tweak the formatting in unpredictable ways.
    But the originals should always be retained, since it may make future editing easier. And people also won't be stuck trying to undo whatever unpredictable reformatting or editing (e.g., loss of certain features moving between formats) might go on in your conversion process.
    
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by ray-auch ( 454705 ) writes:
        
        Really ? You should tell AIIM, LIbrary of Congress, etc. - they've all been doing it wrong for years with PDF/A.
- Re: (Score:2)
  
  by mlts ( 1038732 ) writes:
  
  Even with programs that can import Word/Excel/etc. documents, they do a good job, about 99% well. However, that one percent that is missed can do quite a number on a document.
  The answer for a document format... depends.
  For a document format that keeps formatting exactly, and isn't intended to be edited, PDF/A is the best thing going, since barring a major world-ending disaster, we will still have utilities that can read PDFs, and PDF/A ensures that the fonts and such are present and readable.
  For a document
.txt (Score:5, Interesting)

by Anonymous Coward writes: on Thursday May 14, 2015 @01:01PM (#49690639)

.txt. If you need pretty formatting, fill it Latex tags.

Share
twitter facebook
- Re:.txt (Score:5, Insightful)
  
  by jythie ( 914043 ) writes: on Thursday May 14, 2015 @01:05PM (#49690709)
  
  And here I am without mod points...
  
  Generally when I have to worry about integration or longevity, it is still hard to compete with ASCII & LaTeX. While they do not have the every day visibility of various office document types or pdfs, renderers, search tools always know exactly what to do with them. They can even interact with version control systems cleanly since the underlying tools do not need to know anything about the formatting to manipulate it.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by captnjohnny1618 ( 3954863 ) writes:
  
  I love this answer... but sadly people aren't willing to learn things like latex. Even in academia (medical physics) many smart people refuse to learn technologies like latex.
  
  And, despite my agreement (I was actually going to post a similar answer if someone hadn't), there are times when I don't want to bother with the latex overhead for quick documents. Am I doing it wrong? ;-)
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by vtcodger ( 957785 ) writes:
  
  Pretty much my thought. Use the simplest format that will do the job. It it's just prose, use txt. Does anyone seriously believe that One Day in the Life of Ivan Denisovitch is somehow enhanced by saving it as .doc or .pdf or .htm or god knows what else? If the text needs some bold and italics, use .txt with markdown. If it needs lots of markup, then something more elaborate -- preferably something with standards and a DTD or equivalent indicating what standard applies. If there are flat tables, use c
  - Re: .txt (Score:3)
    
    by billDCat ( 448249 ) writes:
    
    Perhaps fine for Roman characters, not so fine if the document contains Kanji, Hiragana, Katakana, Hebrew, or any of the other character sets that don't play nice with "plain text" formats. For something that you would think would be pretty straight forward, plain text character handling is surprisingly maddening to work with.
    - Re: (Score:2)
      
      by vtcodger ( 957785 ) writes:
      
      Yes text handling for non-ascii characters can be surprisingly maddening to work with. (Wasn't UTF-8 supposed to fix that?). Problem is that wrapping txt in some more elaborate format like HTML often doesn't make the problem go away. With apologies to Jamie Zawinski It just means that now you have two problems.
- - Re:.txt (Score:4, Insightful)
    
    by TechyImmigrant ( 175943 ) writes: on Thursday May 14, 2015 @01:14PM (#49690799) Homepage Journal
    
    How is it impractical?
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by ShanghaiBill ( 739463 ) writes:
      
      How is it impractical?
      It is impractical because the average end user will have no idea what to do with a .txt file containing Latex markup. It will look like gibberish. Txt files also have no clickable table of contents, or index, or hyperlinks to other documents.
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
      - Stupid file extension tricks (Score:2)
        
        by tepples ( 727027 ) writes:
        
        Heck, you could save your LaTeX files with a .tex extension and associate that with a script that invokes a TeX to PDF renderer followed by your preferred PDF reader.
        
        Re: (Score:2)
        
        by TechyImmigrant ( 175943 ) writes:
        
        With latex stored, you can render to anything when the user requests it. odf, pdf, docx, bitmap, tex, GIF. Take your pick.
        Push the techy stuff on the developer to make the user tasks no brainers.
    - - Re: (Score:3)
        
        by TechyImmigrant ( 175943 ) writes:
        
        Does latex support MS Office? If not then it's very impractical for 95%+ of users.
        If I had this question, I would google before asking Slashdot and exposing my ignorance. There are lots of tools.
      - LaTeX CoNDoM (Score:4, Funny)
        
        by tepples ( 727027 ) writes: <tepples@gmail. c o m> on Thursday May 14, 2015 @02:41PM (#49691771) Homepage Journal
        
        LaTeX is the CoNDoM that protects you from Microsoft Office ViRuSeS.
        
        Parent Share
        twitter facebook
      - Comment removed (Score:5, Insightful)
        
        by account_deleted ( 4530225 ) writes: on Thursday May 14, 2015 @04:51PM (#49693081)
        
        Comment removed based on user account deletion
        
        Parent Share
        twitter facebook
    - - Re:.txt (Score:4, Insightful)
        
        by ClickOnThis ( 137803 ) writes: on Thursday May 14, 2015 @03:59PM (#49692519) Journal
        
        If you are in publishing - like in writing or editing books - you need MS Word.
        Well, use whatever you want to write the book. But if you are printing it, I'd definitely use something other than MS Word. It just doesn't produce publication-quality documents.
        And as far as Latex is concerned, it would be even more work.
        That depends on what you are writing. If your document contains lots of equations and you're using MS Word, then God help you.
        Latex and other formats are great if you are in complete control from start to finish of the publishing process. But working with other people that are scattered all over the World? Nope.
        Again, that depends on who the other people are. Many academics, particularly scientists, use LaTeX.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by TechyImmigrant ( 175943 ) writes:
        
        The Springer publications via the IACR prefer submissions to be in Latex.
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:2)
        
        by TechyImmigrant ( 175943 ) writes:
        
        MS Word is still a complete mess when it comes to numbering sections and lists. This makes it unusable for writing technical books.
        Framemaker had a simple and powerful format for describing numbering sequences. It worked well. I haven't used it for a few years.
        Latext obviously gets it right. Why wouldn't it?
        
        You don't want it to (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        The desktop publishing software I used on an Atari ST back in the day is far better suited to the task than even the current MS Word. While it's gone halfway to being DTP software the real thing has a few differences in the way things are done that avoids the massive time sink you get if you try to treat MS Word like DTP software.
    - - Re: (Score:2)
        
        by TechyImmigrant ( 175943 ) writes:
        
        If you want it nice and clean, pay a curator.
        People will upload crap. Your converters will introduce crap.
    - - Re: (Score:2)
        
        by TechyImmigrant ( 175943 ) writes:
        
        I only se a goat.
- - Re:.txt (Score:4, Informative)
    
    by Desler ( 1608317 ) writes: on Thursday May 14, 2015 @02:10PM (#49691453)
    
    Then you end up with Microsoft inserting garbage characters at the start of each text file to make their job easier, breaking scripts and confusing both users and other editors alike.
    It's not a garbage character. It's a BOM [wikipedia.org] and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.
    
    Parent Share
    twitter facebook
    - Re:.txt (Score:5, Insightful)
      
      by Yaztromo ( 655250 ) writes: on Thursday May 14, 2015 @04:52PM (#49693087) Homepage Journal
      
      It's not a garbage character. It's a BOM [wikipedia.org] and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.
      This is one of those sticky situations. For UTF-8, the Unicode standard discourages the use of a BOM, unless you're converting from a different Unicode format that requires a BOM. The whole purpose of a BOM is to describe the byte order used to generate the file data, however UTF-8 data is broken up into 8-bit code units, and thus endianness doesn't play a role. You simply read the stream one byte at a time.
      Indeed, using a BOM is discouraged (by both the Unicode standard and the IETF) precisely because it breaks backward compatibility with ASCII text processors. Unfortunately, Microsoft seems intent on adding an unnecessary (and, in the case of UTF-8, badly named) BOM to virtually every UTF-8 file created on their platform. This is done to make it easier for them to detect the encoding; however there are reliable, published heuristics which do the same job without the need for the BOM. That's what every other platform in existence does to detect UTF-8 streams. Microsoft's BOM use is purely to make their processing easier, even if it means that it breaks backward compatibility with older tools.
      Thus, you are technically both correct. It's technically not a garbage character at the beginning of the stream, however it is unnecessary, and contrary to the way every other OS on the planet handles the situation.
      (I've run into this more than once in my professional life, dealing with people who are supposed to be technically minded who use Windows Notepad to try to figure out what encoding a file is using. I've had them come back claiming my files weren't UTF-8 because Notepad claimed they were 'ANSI' (never mind that there is no character encoding standard called 'ANSI' in the first place). I've had to explain to more than one person that standard ASCII is valid UTF-8, even going so far as to providing them chapter and verse of the Unicode specs to prove that what Notepad says shouldn't be treated as gospel.)
      Yaz
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by Lunix Nutcase ( 1092239 ) writes:
        
        Thus, you are technically both correct. It's technically not a garbage character at the beginning of the stream, however it is unnecessary, and contrary to the way every other OS on the planet handles the situation.
        And yet I use a multitude of text editors and have scripts that can handle UTF-8 text files with a BOM just fine. Your programs and scripts are broken if they can't.
        
        Re: (Score:3)
        
        by Yaztromo ( 655250 ) writes:
        
        And yet I use a multitude of text editors and have scripts that can handle UTF-8 text files with a BOM just fine. Your programs and scripts are broken if they can't.
        Or they're legacy tools. There are a large number of such tools out there that do various jobs, where having an unnecessary BOM is a liability.
        If you're compiling for some legacy embedded hardware, for example, I have little doubt that its compiler would choke on BOM characters, and you may not have access to the source to fix it. And just because YOU don't need or use such tools hardly means that nobody out there does.
        Yaz
The only open one - ODF (Score:1)

by Anonymous Coward writes:

Either you truly wish to use something "open", at which point the only choice is ODF, or you simply want something that can be widely used. If the latter is the case, ODF is still good, perhaps PDF.
- Re: The only open one - ODF (Score:2)
  
  by LostMyBeaver ( 1226054 ) writes:
  
  Uh... Did you miss the question altogether?
PDF retains the layout (Score:2)

by jones_supa ( 887896 ) writes:

I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?
PDF allows accurate rendering so it's the best choice. It will be a hot mess if you use anything else. Conversion of such complex formats is very error-prone for layout problems.
- Re: (Score:2)
  
  by ChunderDownunder ( 709234 ) writes:
  
  PDF is a print format, which is fine if your audience is going to print it out on a piece of A4 paper - though I think yanks have their own standard. :)
  But they don't generally reflow [wikipedia.org]. e.g. Viewing a document formatted for portrait on landscape monitor, journal articles with multiple columns, reading on a 4" smartphone are challenges for reading onscreen.
  - Re: (Score:2)
    
    by ray-auch ( 454705 ) writes:
    
    PDF/A is the ISO standard format for document archiving, as well as printing.
Forget the Universal Format crap (Score:5, Informative)

by xxxJonBoyxxx ( 565205 ) writes: on Thursday May 14, 2015 @01:05PM (#49690705)

1) Forget the Universal Format approach - your users will kill you for messing up their formatting, and you'll never get complete feature parity
2) Store the docs in their original format
3) Get Apache Solr to search your content
4) You'll be spending a lot of time on #3, so leave time to tinker

Share
twitter facebook
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  I'll second everything but step 3. I would get some standard libraries set up to extract the plain text from each format and make that searchable. Probably much simpler.
  - Re: (Score:2)
    
    by Nadir ( 805 ) writes:
    
    Why, Apache Solr understands various type of office documents via Apache POI. No need to get "standard" libraries, whatever you mean by that.
    - Re: (Score:2)
      
      by omnichad ( 1198475 ) writes:
      
      Because it's most likely overkill if all you want is indexing.
- Re: (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  I work at a typography, and I get a lot of documents from a lot of different people. Those "documents" come as MSWord files with missing fonts, pdfs made with some shoddy software, strange ODTs, many more different types of doc files, the mysterious lnk files that work perfectly fine for them, but not for anyone else and my personal favorites, jpg files (not png or some other lossless format, because that would imply actual thinking).
  Strange enough, I've yet to receive any plain text files.
  To index everythi
Oldes are the bestes (Score:5, Funny)

by Anonymous Coward writes: on Thursday May 14, 2015 @01:06PM (#49690715)

Word Perfect Document, because it's been consistent for nearly 20 years. it has a simple underlying format, it's more finely granular than HTML and because I just like obsolete things.

Share
twitter facebook
- Re: (Score:2)
  
  by jfdavis668 ( 1414919 ) writes:
  
  Or try WordStar. That will never go out of support.
  - Re:Oldes are the bestes (Score:4, Funny)
    
    by Megane ( 129182 ) writes: on Thursday May 14, 2015 @02:59PM (#49691967)
    
    I think WordStar has problems with Unicode support. But then again, so does Slashdot.
    
    Parent Share
    twitter facebook
  - - Re: (Score:2)
      
      by jfdavis668 ( 1414919 ) writes:
      
      I'll have to invite you to the next wedding.
Coding approach (Score:2)

by Sigma 7 ( 266129 ) writes:

I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?
Are you writing the search/compression/render capability from scratch, or are you using a library to handle that job for you?
If you're handling more than one document type, then go for a library. I don't have a recommendation myself, but I'm sure you can find them on a search.
Also, don't worry about compression, as modern .odf/.docx is already compressed
Need more information (Score:5, Insightful)

by nine-times ( 778537 ) writes: <nine.times@gmail.com> on Thursday May 14, 2015 @01:10PM (#49690745) Homepage

As an IT person, I hate questions like this. There's not enough information to give a solid answer. For example:
* What kinds of documents are you talking about? Text? Photos? Spreadsheets?
* What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format?
* What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents?
* How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself?
* When you say that the application will need to allow ".docx, doc, .pdf, etc", what formats are in "etc"?
There may be many other relevant questions, my point is that there just isn't enough detail here. In general, if the most important thing is that you have a printable document that you want to be able to print out from any machine, maintaining the formatting as much as possible, then PDF is a pretty good choice (be sure to embed the fonts and include searchable text!). If you already have a bunch of Word documents and you want the formatting unchanged, and would like the capability to edit the document after it's retrieved, then I'd typically just recommend keeping it as a .docx. It keeps things simple, will be widely supported, and prevents the risk of something going wrong while you're converting to another format. If you like the idea of using .docx because of what I just said, but want something more "open", then ODF is probably worth looking into.
Really, there are only so many choices, and each have advantages depending on your specific needs.

Share
twitter facebook
- Re: (Score:2)
  
  by Archangel Michael ( 180766 ) writes:
  
  * What kinds of documents are you talking about? Text? Photos? Spreadsheets? Photos aren't documents. Spreadsheets tend to be proprietary.
  * What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format? This! I tend to classify documents as "Primary Data" (Structured) and "freeform Data" (Human Readable)
  * What will people need to do with them when these documents are retrieved? Do they need to be ab
  - Re: (Score:2)
    
    by nine-times ( 778537 ) writes:
    
    Photos aren't documents. Spreadsheets tend to be proprietary.
    Nonsense.
    Data needs to be organized by purpose (Record keeping = Primary / structured data) and Executive Summary Type data (human readable).
    It depends on what the data is, and what and how it's being used. There is no "correct" organization, and no "one true way" to deal with data. I would not recommend going around cramming documents into some set organization without understanding where the data is coming from and what people hope to do with it.
    Your organization may work for your purposes, within the constraints of the company or organization you work in. I've supported a lot of different types of companies over the years, and pe
What Suits Your Needs? (Score:1)

by mckellar75238 ( 1218210 ) writes:

"Best" in this case depends on your needs and resources, not on standards or common practices. Flexibility and "getting the job done" are more important than what everyone else prefers. Do what works for you.
Depends... (Score:2)

by EmeraldBot ( 3513925 ) writes:

I'd highly recommend leaving them in their original format, or if anything, converting them all to .pdf. Conversion is always fraught with danger, and you will be spending an awful lot of time getting to the know the intricacies of Microsoft Word if you go this route. Pdfs display equally nicely on every operating system, they archive very well, and almost every tool out there can read them - but while converting documents to this format usually works better than others, I'd still be very careful to watch
Use a document manager (Score:2)

by LostMyBeaver ( 1226054 ) writes:

There are many premade document management systems. They generally will store their indices in a database format for quick searching. Why not store them in their native formats and leave it up to a document management system to handle it for you?
- - - Re: (Score:2)
      
      by ls671 ( 1122017 ) writes:
      
      worth a look too:
      https://tika.apache.org/ [apache.org]
For Two-Millennia Durability... (Score:5, Insightful)

by nightcats ( 1114677 ) writes: <[moc.liamg] [ta] [woemthgin]> on Thursday May 14, 2015 @01:20PM (#49690863) Homepage Journal

...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...

Share
twitter facebook
- Re: (Score:3)
  
  by jfdavis668 ( 1414919 ) writes:
  
  We carve 16 bit Unicode into stone slabs for long term backup storage.
- Re:For Two-Millennia Durability... (Score:5, Informative)
  
  by gstoddart ( 321705 ) writes: on Thursday May 14, 2015 @01:32PM (#49691063) Homepage
  
  Nonsense, bamboo can't touch papyrus for longevity, and you don't need to worry about pandas.
  Damned bamboo shills.
  And don't anybody go suggesting cave paintings, it's a completely dead platform.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by cyberchondriac ( 456626 ) writes:
    
    And don't anybody go suggesting cave paintings, it's a completely dead platform.
    But.. but.. they've lasted the longest! Granted, they're not very mobile.
  - Re: (Score:2)
    
    by Megane ( 129182 ) writes:
    
    Cave paintings certainly have some migration issues.
    - - Re: (Score:2)
        
        by ChunderDownunder ( 709234 ) writes:
        
        Petroglyphs are more susceptible to treehuggers from Greenpeace than the weather.
- Burning Bush (Score:5, Funny)
  
  by Etherwalk ( 681268 ) writes: on Thursday May 14, 2015 @02:55PM (#49691929)
  
  ...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...
  Chisel it into stone tablets, then find an ignorant local. Set up a natural gas line to a nearby bush and hide behind a rock. Cub your hands to add a slight reverb effect and tell him to preach the chiselled word, then break the tablets and hide them in a box and trick nazis into looking at them.
  
  Parent Share
  twitter facebook
And a pony too? (Score:4, Insightful)

by gstoddart ( 321705 ) writes: on Thursday May 14, 2015 @01:21PM (#49690879) Homepage

Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?
Lets' see ... you want to allow uploading in a large number of formats .. you want to magically turn it into a universal format ... while retaining all of aspects of the original ... and will be easily maniuplated ... and you want it in an open, and documented format? And all for free?
I want one of those too. And a Red Rider BB gun with a compass in the stock and this thing which tells time. And a new skateboard. And a pony.
Honestly, you're asking for the holy grail of document management systems ... the universal, lossless document format.
I'm not sure it exists. And I'm not sure companies like Microsoft or Adobe would allow it to exist.

Share
twitter facebook
- - Re:And a pony too? (Score:4, Informative)
    
    by gstoddart ( 321705 ) writes: on Thursday May 14, 2015 @02:26PM (#49691605) Homepage
    
    English idiom connoting yet another impossible thing in a child's unrealistic wishlist ... typically placed at the end of a series of outrageous demands: " ... and a pony".
    Now, please, don't make me pedantic you again to explain the cromulency of phrases. ;-)
    
    Parent Share
    twitter facebook
"Best" depends on intent (Score:3)

by Phoenix Rising ( 28955 ) writes: on Thursday May 14, 2015 @01:25PM (#49690945) Homepage

On the conversion side... If you're taking in PDFs created by a layout/page design program, then you're not likely to get good satisfaction converting them and storing them as something other than PDFs. OTOH, if you're taking in a lot of documents created in an office suite, and they have collaborative notes, and you need to retain the documents for legal purposes, then converting them to PDF is going to lose data.
On the future use side: PDFs are slower to render and search than most formats; they're harder to alter, but they're more reliably rendered than any other format. Office documents offer richer content and easy editing; their layout may vary depending on the output device (good and bad), and office document formats seem to change a bit more than other document types. HTML with CSS is good, and probably now stable enough that future clients will render something similar - but it's not PDF for reliable formatting, nor office docs for feature richness; editing tools for HTML aren't all that intent on preserving what came before. LaTeX is a reliable formatter wrapped around text-centric documents, but it's not something most people will be able to use and edit.
Each document type has its reasons for being - you'll need to decide why you need to store your documents and what you need them for in the future. Retaining the original document along with a text conversion stored and indexed in a search engine may be your best bet - or not.

Share
twitter facebook
- Re: (Score:2)
  
  by Vitriol+Angst ( 458300 ) writes:
  
  Sometimes you people make things WAY too complicated.
  In our 'best judgement' -- what's a very open standard for documents? Now, we can ask "what type of document" -- and we can also try and answer for whatever documents we know.
  So here goes;
  Documents; Try RTFD. Rich Text Formatted Document. It might not be perfect in layout -- but it's open, and accessible to a lot of apps and cross platform. If you get bad results, you might just need to switch to some other "open" app. OpenOffice on all platforms will lik
Docbook? (Score:4, Funny)

by Enry ( 630 ) writes: <enryNO@SPAMwayga.net> on Thursday May 14, 2015 @01:25PM (#49690949) Journal

Docbook allows you to separate out the content from the presentation. You write in XML and define paragraphs, chapters, images, etc. and then leave it to the various stylesheets to drive how it looks like when it comes out the other end - PDF, HTML, Word, whatever, and the stylesheet makes sure that if some features are supported (hyperlinking from the table of contents to the chapter) it'll be included in there. Since the content is in plain 'ol XML you can use any kind of XML processor to go through it..

Share
twitter facebook
- Re: (Score:2)
  
  by gstoddart ( 321705 ) writes:
  
  OK, grandpa, it's time for your meds again ... look, Matlock is about co start ... no, they're not on your lawn. ;-) [ Wow, and actual 3-digit id ]
  Honestly, for those of us old enough to still have a copy of Goldfarb's [wikipedia.org] book, this has been the holy grail for a very long time.
  But in practice, there's still no tools to convert all those formats to it, and most anything you do is going to be custom code.
  As a system which takes other formats as input, docbook falls into the category of wishful thinking.
  Even us
  - Re: (Score:2)
    
    by Zontar The Mindless ( 9002 ) writes:
    
    But in practice, there's still no tools to convert all those formats to it, and most anything you do is going to be custom code.
    I love DocBook. But I've also spent a fair portion of my lifespan persuading other formats to turn into it. Not the most fun I've ever had, to say the least.
Impossible question to answer (Score:2)

by Saanvik ( 155780 ) writes:
There is no "best" document format, open or otherwise, for "easy search, compression, rendering, etc." because those words are too fuzzy.
- What does rendering mean (print, screen, mobile, or ...)?
- What is the search scope ("this" document or multiple documents)?
- How important is compression?
If your use case is a typical one, then you actually want, for maximum search functionality, text (perhaps with some form of markup so you can assign weights to segments, like higher weight for titles), a HTML5 based
EDI? (Score:2)

by ArhcAngel ( 247594 ) writes:

Will this be for sharing information with 3rd parties? If your documents consist of a set of data you will need access to EDI [wikipedia.org] (electronic data interchange) was designed to store the data in a standard format and be able to inject that data into any document format. It's not so helpful if you are just archiving word documents or emails. There are a number of companies [covalentworks.com] that assist in converting your documents to EDI.
How about (Score:3)

by pahles ( 701275 ) writes: on Thursday May 14, 2015 @01:59PM (#49691385)

Markdown? It's easy to write, read, render, compress, search.

Share
twitter facebook
Bad idea (Score:2)

by Daniel Hoffmann ( 2902427 ) writes:

Your documents will lose formating when the files are converted, if you want users to be able to download the files in any format you should just store the files in the way that the user uploaded them and convert directly. Create a metadata plain text version for search, maybe a visualization version so that the user be able to see the files inside your application, in this visualization version you should just use the easiest method.
Of course this depends heavily on your requirements.
That's easy (Score:2)

by ArcadeMan ( 2766669 ) writes:

Just convert all the documents into 1200 DPI, 32-bit PNG images.
Three letters (Score:2)

by plcurechax ( 247883 ) writes:

DNA
Millions of years of field testing, and it still mostly works, and DNA itself is not patented.
Use cases? (Score:2)

by PPH ( 736903 ) writes:

Who is going to use these stored documents? How will they be used (read-only, revise and check in, etc.)? What tools are authors generating these documents with? Answers to these questions will help determine the best storage format.
For documents intended to be downloaded and read or string searched, PDF is a good choice. There are a lot of PDF readers for different O/Ss available.
Save Space, Switch to ODT (Score:4)

by BrendaEM ( 871664 ) writes: on Thursday May 14, 2015 @04:07PM (#49692603) Homepage

I've written several books. Because ODT's have standard compression, they are usually much smaller. For a 109,683 word book, with styles and formatting:
ODT: 271,090 bytes
Docx: 300,057 bytes
Word 97: 1,379,328 bytes
PDF: 1,050,788 bytes
If bytes cost money to store ODT rules.
Imagine yourself sticking with Word 97 because it's a reliable standard: imagine buying three times as much storage, as well as the backup for the storage,

Share
twitter facebook
Silly Rabbit, Trix are for kids.... (Score:4, Interesting)

by David_Hart ( 1184661 ) writes: on Thursday May 14, 2015 @05:15PM (#49693333)

No, just no....
Store the documents in their original format.
There are many possible reasons why you shouldn't mess with the originals such as formatting, legal implications, loss of content because one format supports stuff that the other doesn't, etc.
The only way that I could see this working is if you converted everything to an open format but kept copies of the originals and linked to them. But if the plan is to dump the original documents, then it just isn't worth it....

Share
twitter facebook
LANG=POSIX, 7-bit ASCII text (Score:2)

by Antique Geekmeister ( 740220 ) writes:

If you can't read it in flat text, it's not long-term reliable documentatoin.
my 1st and 2nd choice for document format (Score:2)

by Skapare ( 16644 ) writes:

i usually just go with the .TXT format [wikipedia.org] but have been considering the compatible .RST format [wikipedia.org].
The best Open Document Format? (Score:2)

by Trogre ( 513942 ) writes:

The answer's right there in the question...
- - Re: (Score:1)
    
    by RavenLrD20k ( 311488 ) writes:
    
    Ever since the OMG Ponies! [metlin.org] incident... Slashdot just hasn't been the same...
    - Re: (Score:1)
      
      by jfdavis668 ( 1414919 ) writes:
      
      Shh... Mods are asleep, post Ponies!
  - Re: (Score:3)
    
    by gstoddart ( 321705 ) writes:
    
    Dude ... can I point out to you that you got the reference and that many of us wouldn't know WTF it was?
    So maybe your question is how many other Slashdotters are Bronies [wikipedia.org] besides you? ;-)
    And, for the record, I included that link because I had to google it to find out what it meant.
    Now if you will excuse me I need to go apply brain bleach. The images which came up in that google search are terrifying.
    - - Re: (Score:2)
        
        by gstoddart ( 321705 ) writes:
        
        To be fair, doing a google image search for pretty much anything is dangerous.
        I didn't. But google was "helpful" enough to throw up related images along with the search results.
        And now I shall ever be traumatized that 'adults' are dressing up like that.
        I can't simply unsee that. I'm going to make it the new Rick rolling ... just randomly stick in links to bronies. Spread around the pain.
        
        Re: (Score:2)
        
        by ArcadeMan ( 2766669 ) writes:
        
        http://img3.wikia.nocookie.net... [nocookie.net]
        I do share your feelings about the cosplaying though... unless it's a cute girl dressed like Fluttershy. If that's okay with her, I mean.
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  And those .02 wasn't worth much, considering that the most effective way to FUBAR a doc-file is to pass it around to a few users who use different versions or even patches of Word. Nobody is 100% compatible with MS-Word. Not even Microsoft, so so that "format" goes right into the shitter, along with your lame attempt at being cool and suitably hipster "anti-opensource".
  Captcha: "Vanity". Indeed.
- And Depend (Score:2)
  
  by ArcadeMan ( 2766669 ) writes:
  
  Is a brand of adult diapers [depend.com].
- - XHTML5 exists (Score:2)
    
    by tepples ( 727027 ) writes:
    
    The HTML5 introduction [w3.org] states that XHTML is one of the two syntax forms of HTML5.
    - Re: (Score:2)
      
      by Dracos ( 107777 ) writes:
      
      But HTML5 treats XHTML syntax as a resented stepchild because WHATWG hates XML so much. All the sloppy markup that the rest of the spec advocates should be treated that way instead.
- - Re: (Score:2)
    
    by brausch ( 51013 ) writes:
    
    Vote this up. It was my first thought too. Basically, plain ASCII text with formatting instructions that are human readable.
- Re: (Score:2)
  
  by Zontar The Mindless ( 9002 ) writes:
  
  HTML is primarily a layout language. It does diddley-squat for semantics. You need DocBook XML, or something like it, if you're going to go that route... and good luck with converting to it from something like Word.
- Re: (Score:2)
  
  by vtcodger ( 957785 ) writes:
  
  So long as you remember that the M in HTML is "Markup", not "Layout". If it is important that page layout be "perfectly" preserved in the presentation, something else like pdf (Yechhh) might be a better choice.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

can't you search the current doc types? (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

PDF/A (Score:5, Informative)

Re: (Score:2)

Don't convert needlessly (Score:5, Insightful)

Re: (Score:1)

Re: (Score:3, Interesting)

Frequently not "doable" (Score:2)

Re: (Score:2)

Re:Don't convert needlessly (Score:5, Interesting)

Re:Don't convert needlessly (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

.txt (Score:5, Interesting)

Re:.txt (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: .txt (Score:3)

Re: (Score:2)

Re:.txt (Score:4, Insightful)

Re: (Score:3)

Re: (Score:2)

Stupid file extension tricks (Score:2)

Re: (Score:2)

Re: (Score:3)

LaTeX CoNDoM (Score:4, Funny)

Comment removed (Score:5, Insightful)

Re:.txt (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

You don't want it to (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:.txt (Score:4, Informative)

Re:.txt (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

The only open one - ODF (Score:1)

Re: The only open one - ODF (Score:2)

PDF retains the layout (Score:2)

Re: (Score:2)

Re: (Score:2)

Forget the Universal Format crap (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Oldes are the bestes (Score:5, Funny)

Re: (Score:2)

Re:Oldes are the bestes (Score:4, Funny)

Re: (Score:2)

Coding approach (Score:2)

Need more information (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

What Suits Your Needs? (Score:1)

Depends... (Score:2)

Use a document manager (Score:2)

Re: (Score:2)

For Two-Millennia Durability... (Score:5, Insightful)

Re: (Score:3)

Re:For Two-Millennia Durability... (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Burning Bush (Score:5, Funny)

And a pony too? (Score:4, Insightful)

Re:And a pony too? (Score:4, Informative)

"Best" depends on intent (Score:3)

Re: (Score:2)

Docbook? (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Impossible question to answer (Score:2)

EDI? (Score:2)

How about (Score:3)

Bad idea (Score:2)