Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
News

OSD Database Downloadable As XML 46

After taking some heat a few months ago for not having many products listed, the Open Source Directory has been plugging away. Steve Mallett writes: "We made the product database of Open-Source Directory downloadable in XML today. Announcement here at newsforge. We're hoping that people begin to use the data like google uses dmoz. More people see the data, which increases awareness of open-source which increases the database which gets more people to display the data etc, etc ... You get the point."

Providing a list of applications stable enough to recommend to non-gurus is a worthy endeavor, so it's great to see this project slowly becoming more useful. There are gaps to plug going forward, though. The default text strings can be ambiguous, and the information provided on individual projects doesn't always give much to go on. For instance, look at the Mosix page, where you'll find that "This product has no Latest version yet," "This product doesn't fix anything," and "This product is not like any other," but no email contact information for Mosix authors. Similarly ambiguous pages are provided for Gnucleus and OpenOffice.

I exchanged some email with Steve on the state of the entries in the database, and asked about how the missing information could be filled in. He told me that while project maintairers (and site administrators) are the only ones who can update entries, users can contact the administrators of individual projects directly through the OSD site to suggest changes or clarification.

"We're trying to make things easier for the maintainers. ... I think there is a serious lack of product maintainers to help authors," he said. To that end, Mallet may soon provide example projects for software authors to emulate, and is in the early stages of a unified project-listing tool which would update listings on various web sites. Given the number of sites that offer downloads or simply track various software projects, that could be a boon to developers.

Hopefully, this will turn into the sort of tool that you can show a boss or teacher to answer the bugaboo of Free / Open Source being unready for prime time (or just overwhelming and undifferentiated).

This discussion has been archived. No new comments can be posted.

OSD Database Download In XML

Comments Filter:
  • by Anonymous Coward
    The fonts are too small and they use grey instead of black fonts which makes them even less readable. Easy on the eyes, guys! Please use decent font sizes and ergonomically correct colors.
  • by Anonymous Coward
    I find it somewhat amusing that the Linux kernel itself isn't listed in the directory. (yes, I know that they list just apps, but still, Linux is to most people the most prominent example of a open source project)
  • Is the idea that a package won't get listed until someone out there submits it for inclusion? I was somewhat surprised to see that CVS, certainly a well-known and stable piece of software, is not in the database. -Karl
  • mod this way up please. very useful links.

  • Someone should create an HTTP interface to a dmoz XML database, which would allow users to place XPATH [w3.org] queries which would return XML nodesets to the requesting client.

    Someone could leverage XML RDBMS like DBXML [dbxml.org] which is based on the "XML:DB" [xmldb.org] standard.

    If enough people are interested, I could try downloading dmoz myself and "massage" it into some dbxml store on my own system and build a web-based interface to query it, I've just been really busy with other stuff lately though.

    If you happen to read this and are interested, shoot me an email at valmont@wildstar.net and we can take it from there.

  • It's worse than that. It can't be valid because it doesn't even have a DTD.

    But this document is not even well-formed XML. In other words, it is not XML at all. It's plain text with some tags.

    For details on what it means for an XML document to be well-formed or valid, see the spec at the W3C [w3.org]
  • by VValdo ( 10446 ) on Sunday July 22, 2001 @05:17PM (#68532)
    What is the consensus on the best way to include Open Directory Project (ODP/dmoz) content in your web site, say for a mini-portal (or in the case of OSD, for a mini software directory)? I dont' want to simply download/display dmoz's RDF/XML file on a weekly basis because (1) I'm only interested in a tiny portion of the ODP which relates to my web site and (2) I'd like to encourage people to be uploading new content back to dmoz, so I'm looking for a way to pull "live" content from dmoz and let my visitors send links back to dmoz.

    Is there a PHP class or something that everyone's using for this? I saw a couple offerings at freshmeat that relates to ODP and some some tools and code are here [dmoz.org], but I'm curious what most people are using.

    W
    -------------------

  • by Teferi ( 16171 )
    YHBT. This is a variation on the same 'foo is dying' framework that's been going around /. for months. Look for any kind of post matching 'BSD is dying' and compare it to this...same thing.
  • I was one of the original category editors of DMOZ way back when it started and long before the AOL bureacrats took over. I was really dedicated to my two small categories - paragliding and paramotoring - and built them up from scratch. Shortly after AOL took over I suddenly found myself locked out as a category editor. After repeated inquiries as to why, it turns out that I had listed some non-English web sites on my "English only" categories and this was against AOL policy hence an immediate boot.

    The real shame was watching the categories I created with TLC lie fallow for months and months without any one to update them.

    With inane policies like these is it any wonder that this directory lacks up-to-date information and is in general disarray? Me thinks not.

    I was thinking of the immortal words of Socrates, who said, "I drank what?"

  • by Arandir ( 19206 ) on Sunday July 22, 2001 @07:09PM (#68535) Homepage Journal
    Linux may be the most "prominent" example of Open Source, but the three pieces of Open Source work that are actually the most used are not listed either. Perl, Apache and XFree86.

    The OSD is not meant to be a definitive archive. It's mission is to provide a resource for users. I think it has a done a good job in this regard.
  • It is pretty easy to use Perl for this. Look on CPAN for XML (XML::Parser in particular)
  • - XML doesn't replace databases.

    Sure. The point is that XML is only a file format. The data it represents is vaguely semi-structured. Of course, one needs a query/update language on top of that (and some other good stuff) on top of that to make a database---there have been many proposed. In the relational world there is no standard file format. One could represent relational data in XML pretty trivially, though.

    - XML Schema is also very poor on data modelling, because it has no separation between a structural schema (which element goes inside the other) and a semantic schema (what each element means, when placed inside another)

    DTDs are problematic because they just provide a grammar for the structure of an XML document. XML Schema tries hard to provide a strong notion of type. For example, I could define a type called Person and let several tags, say manager and employee both have that type.

    - How do you represent shared resource in XML; such as an author of several modules ?

    Sort-of. It's hard to represent graphs in XML. Unlike semi-structured data, which is a graph, XML is, at its core, a tree description language. One can define graphs with IDs and IDREFS, but it's a pain.

    - How do you distinguish such an author for another author with the same name ?

    This seems like a key problem---you'd have the same issue in the relational world. To separate two authors with the same name you need more information to make a key.

    The biggest problem I see for XML as a data description language is that it's way too complicated for what it does. To represent semi-structured data, which is what we seem to want here, all you need is a simple graph description language. XML, however, does this with three types of edges (subelement, attribute, and IDREF) and has other features (eg., mixed content) that are hard to figure out what to do with from a database perspective. (From a document description point of view things like mixed content make a lot of sense.)

    The other problem, one induced partly from the inherent complexity of XML, is that the standards that are growing up are horrendeously complicated. For example, the 300 page monster that is XML Schema [w3.org].

  • Timothy, that was a very informative news item, with great commentary. Thank you for posting this.

    --

  • It states on the download page that it's not validating XML. And did you take a look at the DTD for this? [opensourcedirectory.org] It's very simple, about as simple as you can get. Basically useless for rigorous validation.

    I agree your way of structuring the data is better, but I would add that many of the data items should be attributes. I mean you have elements and attributes available, why not use both? It would have made things much faster and cleaner to keep up to date and ensure all parsers can validate it quickly. Can you really see a SAX parser making use of that xml? And a DOM parser would consume an enormous amount of memory needlessly. Oh well, I'm sure they were in a big hurry to get this info available. And it'll get well cleaned up in the next few months.

  • The OSD isn't the same as DMOZ. Similiar idea though.

    I also share your concern over DMOZ. I was declined as an editor for an editor-less category that I know inside and out.

  • Look for the XML solutions at the Apache XML Project [apache.org] - Xerces and the like.

    Some are available for both Java and C++.

    Sorry I don't have a more detailed answer to your question but I'm sure something can be built from the Apache XML stuff.


    Mike [goingware.com]

  • This may be offtopic, but whoever modded this post down obviously did it only because he/she didn't agree with the AC's opinion.

    The post you recommend is being AC-posted to practically every /. thread these days, practically unmodified. There's a similar version dooming & glooming *BSD.

    IMHO, this is just tomorrow's goatse / body thetans / frist p0st, and the troll moderation it got is all that it deserves.

    --

  • but xml was not designed to replace databases

    XML doesn't replace databases. It _can't_ do, because it has no query mechanism. If you want to compare something to an RDBMS, then you have to look at the combination of XML + XPath. This is actually quite a good choice for some small systems (although it has no large-volume performance).

    What's a more important issue (and this is one of my personal hobby-horses) is to separate the data model from the serialisation. XML does serialisations, and it does them quite well. It's poor though on data modelling. XML Schema is also very poor on data modelling, because it has no separation between a structural schema (which element goes inside the other) and a semantic schema (what each element means, when placed inside another). As a result, it's possible to serialise XML documents to represent "One view of the data, for one context" but it's really not possible to build an XML representation of a large data modelling problem for anything beyond the trivial.

    • How do you represent shared resource in XML; such as an author of several modules ?
      How do you distinguish such an author for another author with the same name ?
    • How do you represent graphs in XML, such as "foo depends on bar, which dependss on wibble" ?

    Now (obviously) people have built XML solutions that work around these problems, but XML itself doesn't support them. It doesn't have a portable solution to such commonplace problems that a generic parser (like SiRPAC) could understand, and it doesn't support the development of particularly good solutions to them.

    Teaching RDF, one of the hardest (and most important) lessons to communicate is that there's an underlying data model, and there's a serialisation, and that the serialisation is only one usage-dependent view onto what ought to be a much better structured and flexible internal model. For RDF it is, but for XML it isn't.

  • That's some of the worst XML schema design I've seen in years (OK, I know it wasn't yours).

    Secondly, which pair of moronic moderators moderated this down as a troll ?

  • XML Schema tries hard to provide a strong notion of type

    Although that's a valid point (and I haven't written DTDs in over 2 years, in favour of schema) it's not the issue I was talking about. Look at the Infoset [w3.org] draft or the recent Processing Model [w3.org] workshop. You can barely tell the difference between reading infoset and the syntax spec, because XML just doesn't put enough distance between semantics of the content and its representation in a document.

    XML doesn't "represent" anything. It never has done, it never will, and all attempts to pretend that it does will end in failure. XML (and XML Schema) is a low-level transport and manipulation platform, but it doesn't have the ability to do any form of abstract representation. Its structure and implied semantic meaning are so closely fastened together that it's impossible to squeeze a gap between them. "Representation" is the act of stretching this gap, between structure and implied meaning, so as to infer a higher level meaning.

    The problem is fundamental to XML, and won't be fixed by tools at this level. There's no abstraction in XML; any attempt to indicate semantics also drags along its structural baggage, because that's the only way XML-Schema allows you to work. No number of "sideways" solutions to this; namespacing to allow parallel co-existence, BizTalk to allow sharing of schemas, will fix this - XML just doesn't offer any "upwards" in a semantic direction.

    To separate two authors with the same name you need more information to make a key.

    Again, I agree with you in general, but that's not quite the issue I was thinking of. Clearly we need more structure to distinguish them, although in fact we don;t need any more information (RDF can do this entirely within the document structure, with no need to start "allocating author indexes" or similar).

    The symptom of this problem, in the XML world, though is an over-dependence on flat text comparisons. It's like search engines that only compare at the text level and can't tell "goat sex" from animal husbandry or a Slashdot Troll. Because XML has nothing useful beyond the text node, that's what gets used. If it's easy to do it all just by comparing author names, then that's what lazy coders do. Disambiguation between resources like this needs a simple and lightweight mechanism, because if it isn't, no-one will use it. RDF manages it with rdf:resource and rdf:about attributes. In XML then you'd have to build some identifying system at the application level (so a generic parser can't understand it) and impose its use on your data. No wonder people stick with just using the names and ignoring truly identifying relationships with resources.

    ID & IDREF are just broken. If you want to do it that way, build a proper architecture for doing it and join the RDF WG.

    ...the 300 page monster that is XML Schema.

    Tell me about it 8-(

    Compare the XML Schema spec, the SMIL spec, and the even more gargantuan MPEG-7 spec. Now take a look at DAML [daml.org] and see that complexity can be described, without needing a spec like a phone book.

  • XPath is a partial query language for XML - it can read, but it has no way of updating the document.

    There's also the issue that XPath is very much an XML tool, with a tight binding between semantics and structure (which is the whole thing that I'm saying about XML in the first place). If you have a graph represented in XML, then it's hard to write XPath expressions that can traverse it. If you have RDF stored in XML (which has several possible serialisations for the same semantic content) then it's possible to write XPath that expands these, but it's hard, error-prone, and generally unworkable.

    There's still a lot of thought out there that XSLT can translate magically between schemas. Some groups see XML Schema as improving this (Hunter & Lagoze, WWW10 [dstc.edu.au]). Although Alison Cawsey's paper from WWW 9 [hw.ac.uk] shows just why this approach doesn't work. I've abandoned my own work in this field for similar reasons; even though I managed to build something workable, I just never trusted it to be reliable.

  • Someone should create an HTTP interface to a dmoz XML database, which would allow users to place XPATH queries which would return XML nodesets to the requesting client.

    That's an interesting idea, but it's not quite the same problem. You describe a good solution to a "pull" scenario, which is great for queries instigated by a client, but it's not as good as a "push" for providing a newsfeed from a site.

    I'd suggest RSS 1.0 as a good format to produce (possibly based on the same XPath-based pull that you describe). Once it's in RSS 1.0, then it's trivial to make it appear on any number of sites, or to aggregate it into other more generalised newsfeeds.

    For implementing the "pull" side, then XPath encapsulated in SOAP is an easy way to build clients, and not too hard for the endpoint server. I've been doing this recently, so that a UI component (DHTML in Javascript) could selectively retrieve pieces of a big taxonomy document that was >MB in total.

    My one concern (and my own personal bias) is that I see many of these items as running off the limits of what XML (and XMLDB) is good at, and being better handled in RDF. Certainly RSS 0.91 (which is XML) couldn't do this, but RSS 1.0 (which is RDF) could easily. Of course, that then makes XPath unworkable as a query language and there's not yet a stable "RDFPath" equivalent for RDF.

    I'm also interested in working on this. Anyone else, drop me a mail if you are too.

  • I definitely see a couple of immediate uses for that data. The less important (but more useful for the majority of the universe, yeah...) is just to look up useful applications, and frankly you probably don't need to download the database for that... might be a good idea for somebody to write a client so people could browse it offline, on second thoughts. It's fairly small when gzipped (130k or so) but could be a worthy addition to a Linux distro for those who disdain Freshmeat.

    The more important one would be for the licencing info- I was about to face the task of building up a database of (L)GPL'd applications manually. I'd say they've definitely saved me some effort... sure they're not all there but it's a start... thanks, guys.

    On the topic of the GPL, anybody notice they've licenced this XML document under the GNU Free Document License? [gnu.org] I can see the press release now: 'Argh! Viral pac-men documents!'

  • by antibryce ( 124264 ) on Sunday July 22, 2001 @04:56PM (#68549)

    I use my computer to write music, so I went to see their listing of stable audio software. The only things listed there are a crossfade plugin for xmms, GLAME and a soundfont editor. I've tried GLAME. Listing it as "stable" is a joke. And to top it all off, they have these things listed multiple times in categories they shouldn't be in. I'm fairly certain a soundfont editor doesn't qualify as "sound synthesis".

    I want to stress I'm not trying to discredit the GLAME team or any of these software packages. But what good is OSD if it's categories are a mess? I might as well just use freshmeat.

    c.

  • Well under the circumstances we are mearly just querying the entire database and using XML to transport that data back. It would be impraticle to query the origional database everytime you wanted to get the data out and you aren't just going to send out the database itself. By doing this the reciever can recieve the database in an open, usuable format and can do what they want with it at that point, whether that be putting it back in a database or parse it out and use the data for something usefull.
  • This is trasnport. You are downloading it, aren't you?

    Just write a script to import it into your choice of databases.
  • by pjbass ( 144318 ) on Sunday July 22, 2001 @05:47PM (#68552) Homepage
    I'm all for OSD, and means for making it easier to search, read about, and acquire apps. However, the people that would most likely find this searchable database are the people who already know how to search other OSD networks, like sourceforge and freshmeat. I can only see this, by itself, is going to confuse people more with multiple area to get the same software, but possibly at different versioning levels, in comparison to freshmeat and sourceforge. Maybe if this list was synced somehow with the freshmeat lists, that might provide a very powerful tool for people new and experienced to the Open Source world to get, play with, learn, upgrade, hack, and love Open Source. IMHO though...
  • "We're hoping that people begin to use the data like google uses dmoz. More people see the data, which increases awareness of open-source which increases the database which gets more people to display the data etc, etc ... You get the point."


    Sounds to me like the point of this project is a global infinite loop. I don't know much about this, but if that's what it is...count me out. I have it bad enough as it is. (I run windows ;-)


    chances are, this is a joke.

  • We contact products to do their own listing first, if they don't want to do it themselves they can ask us to do it. Of course we prefer that the maintainer for the product take ownership of their own listing since it will likely be kept up to date & in full.

    We haven't talked to the CVS folks yet.

    -Steve Mallett of OSD

  • We've released an updated XML doc [opensourcedirectory.org] (July 23, 4pm Atl. time) which should correct any validation problems. Thanks to everyone who emailed us to let us know how to actually fix those damn "&"s.

    -Steve Mallett of OSD

  • It is stored in a database, then put into XML. -Steve Malllett of OSD
  • Actually its this [sourceforge.net].

    Steve Mallett of OSD

  • by blab ( 214849 ) on Sunday July 22, 2001 @05:50PM (#68558)
    It is really a product's admin to be truthful. However; our Social Contract [opensourcedirectory.org] outlines that while it is their responsibility (we can't test them all) the interest of the directory are primary. If it sucks & there is no way its stable, write to the admin & tell him so.

    Ultimately the info is open for catching bugs like this one. If it is a bug it will get weeded out.

    -Steve Mallett of OSD

  • by blab ( 214849 ) on Sunday July 22, 2001 @05:03PM (#68559)
    The "unified project-listing tool" refered too above is at: http://sourceforge.net/projects/trovesendtwo/ [sourceforge.net] The idea is: you put the information for your product in a client & it updates SF/FM and or OSD at the same time without having to login & change all those listings. This is possible because, all these sites anyway, are based on loosely the same interface, data & category map. And yes, we could use some assistance with it!

    -Steve Mallett of OSD

  • i doesn't even list miranda icq

  • I was wondering how someone would sneak in the standard /. microsoft bash...

    Anyone else notice that when an article on something related to (even remotely) Microsoft, or some other favorite whipping boy, /. gets 900 messages screaming "Bill gates sucks!" but in an article like this, which actually can be productive and useful, you get ten messages...

    I'm sure there's a significant insight into the /. audience in there somewhere. I have my opinion, but it'd just get rated down as flamebait, so I'll keep quiet.

  • Not only that it is not valid, the XML structure is not very logical either.

    The authors of the XML file has written it like this:

    <group_name></group_name>
    <--properties of group-->
    <group_name></group_name>
    <--properties of group-->

    whereas a more clever structure would have been:

    <group>
    <group_name></group_name>
    <--properties of group-->
    </group>

    This way the different groups would have been separated in a more logical manner, and it would be "easier" to parse the information in the XML file.

  • I've actually found that is you say, I'll probably get modded down for saying this but I'll say it anyway , or something similar he post get's modded up insightful.
    __________________
  • um, havn't you seen this article on any other threads? this is just an anti-linux troll
  • Ouch. That's the last time I ever try to help out an anti-Linux dumbass.
  • by sehryan ( 412731 ) on Sunday July 22, 2001 @05:45PM (#68566)
    but xml was not designed to replace databases. to store everything in an xml format is a bad idea. you store it in a database, and pass the variables into an xml document, then parse that with an xsl or xhtml. xml is a transport, not a storage.
    -
    sean
  • I've used dbxml some and find it very nice, but I haven't tested in on anything large. Have you any experiance here you'd care to share?

    (I really like being able to do XPATH querying documents instead of insane SQL querying documents disguised as records. :-)

  • Great, XML. Now MS can use it in the next version of Office. :P
  • Have you seen this statement by RMS on fsf.org?

    http://www.fsf.org/philosophy/luispo-rms-interview .html

    About stallman himself,
    "A short list of his coding accomplishments would include Emacs as well as most of the components of the GNU/Linux system, which he either wrote or helped write. "

    :-)
  • Obviously for such things as flat databases XML is perhaps not the best solution, but XML can and should be used for storing data such as marked up full text documents (TEI) or descriptions of Archives or Museum objects (EAD/Spectrum) XML is most definitely not only a transport. The XML in the database provided is awful, and demonstrates why XML needs to be thought out in advance rather than generated directly from a database. For example, no encapsulation of individual projects, just a single layer of tags from beginning of the document to the end. -- Azaroth
  • by azaroth42 ( 458293 ) on Monday July 23, 2001 @02:09AM (#68571) Homepage
    Here's a valid XML file and DTD:
    http://www.o-r-g.org/~cheshire/osd/osd.tgz [o-r-g.org]

    Also, a search engine (Cheshire2 [berkeley.edu]) running over the XML with a Very simple interface/display is available at:
    http://www.o-r-g.org/~cheshire/osd/ [o-r-g.org]

    Enjoy =)

    -- Azaroth

"I've finally learned what `upward compatible' means. It means we get to keep all our old mistakes." -- Dennie van Tassel

Working...