Ask Slashdot: Open Source vs Proprietary GIS Solution? 316
New submitter rnmartinez writes "As the Project Manager for a non-profit looking to implement a tech project, I am running into a few dilemmas, and as a casual Slashdotter I could really use some help. I'll start with a brief explanation of the project. We research issues in Canadian Immigrants, and found that there was a lack of recent, unaggregated information. As we dug further, we found that some data was available, but there was no central repository. Therefore, we are building a web based service to collect this data, with the intent of having it display in Google Maps and then be downloadable as a CSV file that is readable in GIS software such as ESRI Arcsoft, so that data may be visualized."
The dilemma: "...It seems that MS SQL offers more functions with regards to geometry built in then MySQL, and my developers (good guys, but MS guys at heart) want me to switch to .net NUKE/MSSQL and ditch the open source stuff." Read on for further details.
"To date, we have relied on a LAMP box with Drupal as the frontend to help provide a more social experience. However, it seems that MS SQL offers more functions with regards to geometry built in then MySQL, and my developers (good guys, but MS guys at heart) want me to switch to .net NUKE/MSSQL and ditch the open source stuff. As a believer in open source and as a non-profit, I am having some moral issues with this (I try and run Linux and open source on everything I reasonably can)."
"So here is my dilemma: do I dump $20K into moving everything to an MS solution that in the short-medium term might make the geometry functions (i.e. show me all the hospitals within a 20km radius of this cluster of immigrants) or do I get him to spend the same amount writing something similar for MySQL? The only issue there, is that I am not too fond of Oracle having ownership of MySQL. Should I be directing $20K into replicating these functions into something like MariaDB? Might be a longshot but again, as a non-profit I'd like to see something go back to everyone, not just my group."
"Really, I am open to any flexible, creative open and reliable solutions. Sorry if my knowledge is limited or if I am grasping at straws, and if I am being terribly biased, but I trust Oracle with open source about as much as I trust MS."
MySQL might fare poorly against MSSQL's geometry support, but how does PostgreGIS stack up?
MySQL and Drupal are fine (Score:5, Interesting)
I can safely say that MySQL handles GIS data sufficiently for the type of scenario you've described on Slashdot. I've got a database I periodically play with that is all the cities / major towns of the world and can quickly query it with distance data. (as in: Give me everything within 100 miles of Lat,Long)
MySQL doesn't make all of it as easy as it should be, but with some careful design decisions it can work surprisingly well for it.
I previously had created a distance function that worked pretty good, but more recent versions of MySQL have better internal support and I was able to ditch some of my custom routines.
(Note: my routines performed efficiently, but gathered extraneous data that could later be filtered out to be more precise. The extra data was maybe 10% more than necessary, but meant the query was very fast.)
Perhaps you should be looking to upgrade your MySQL instance, or getting some procedures/functions written for MySQL to handle some of the calculations.
(It's pretty easy to calculate a rectangle for the query to run against, and then follow up with a more precise distance calculation to filter out a few stranglers that slip into the dataset. (better to include a few that are dropped in later stages than to exclude valid data up front)).
As for the moral issues: Stuff them. They aren't what you should be focusing on.
Re:MS SQL is better (Score:3, Interesting)
I say pay whatever it takes to switch everything to MS-only technologies, since your developers prefer that. Then, to make the budget work out, cut all their paychecks by that same amount. (Actually, you should cut their paychecks by quite a bit more, since you have to account for the continual license fees needed, and the fact that some of them may quit.)
Or offer them the choice between this and doing something lower cost that works as well and doesn't require continual license fees.
Re:PostgreSQL with PostGIS (Score:5, Interesting)
GIS improvements already are in MariaDB (Score:5, Interesting)
MariaDB 5.3.3 [askmonty.org] (currently in RC status) has improved GIS functionality [askmonty.org] versus MySQL. While sponsoring new features could improve it even further what we are releasing right now might be enough for you. Please give it a try, and talk to us about how we could make it fit your needs better.
Re:PostgreSQL with PostGIS (Score:4, Interesting)
Well put.
Plus, Postgresql has things like "text" columns that work, and can be proper Unicode (not MS SQL-Server's crummy UCS-2, or Unicode haxx blobs where the normal text functions won't work anymore). It also doesn't rape your wallet like Oracle.
In short, Postgresql does Internationalization right. It is easy for MS Sql-Server to be considered fast, when it doesn't have to do many things properly (like proper UTF or arbitrary length text columns, or spatial indexing, or having source available, or costing nothing :) etc etc).
Use Postgresql, it will make hot girls like you!
Re:Distance calculation is trivial... (Score:5, Interesting)
...which raises the question: What is the most efficient way to store points on the sphere for lookup? Computationally? And in terms of storage?
1.) You can store lat/long, and use the Haversine formula, as you suggested. This requires trig functions, and has O(n) complexity; you need to iterate through all the points. You also have varying resolution over the surface, which makes bounding and early-outs a bit harder.
2.) A great many other coordinate charts also exist, and it's hard to say why you should choose one over the other without looking in detail at how the distance calculations are performed, etc.
3.) By using multiple charts -- e.g., a cube projection -- you can avoid issues with singularities, at the cost of branching. The complexity of distance calculations depends on the projection, but, without looking too carefully, my bet is that, in terms just of raw speed, cubemap vs. lat/long is probably a wash.
4.) Why use a coordinate chart at all, when you can use an embedding? If you store points in 3d, proximity calculations (since the points are on the sphere) just become a dot product. Much faster! It also opens up the possibility of, e.g. (if you will be doing many lookups but few insertions), storing indexes sorted along the three axes (or more!) to speed bounding-box (or more generally, sweep-'n-prune) calculations. Bins, bounding volume hierarchies, and the other standard tricks of computational geometry come into play. On the other hand, you're wasting a lot of codewords on points that don't actually lie on the sphere.
5.) Is there a more efficient use of codewords? Perhaps a (nearly-)-constant-resolution encoding scheme? If you start with the idea that a node in an octtree can be thought of as an octal number, you can see how you can encode points as real numbers in the interval [0, 1] -- e.g. "octal: .017135270661201") Of course, this still wastes codewords on points not on the sphere, so let's consider a refinement of this idea: At each level of the octree, discard any cube that does not intersect the sphere, and use arithmetic encoding, with the base varying between 8 and 2 depending on the number of cubes that intersect the sphere. This now seems like a (memory)-efficient way to encode points on the sphere -- but it is surely not computationally efficient. On the plus side, this same idea works for any manifold embedded in any Euclidean space, so at least it generalizes.
6.) Since #5 is a mapping from [0,1] to the sphere, one wonders if there are space-filling curves on the sphere. Of course there are -- e.g., the Hilbert curve in 2d, composed with any inverse coordinate chart. Not that this helps much!
I think my favorite of these is #5, but, practically, #1 or #4 are probably better choices.
So how do the real GiS systems do it?