Follow Slashdot stories on Twitter


Forgot your password?
The Media The Internet

Data Mining Rescues Investigative Journalism 91

John Mecklin sends in word of initiatives through which the digital revolution that has been undermining in-depth reportage may be ready to give something back, through a new academic and professional discipline known as "computational journalism." "James Hamilton, director of the DeWitt Wallace Center for Media and Democracy at Duke University, is in the process of filling an endowed chair with a professor who will develop sophisticated computing tools that enhance the capabilities — and, perhaps more important in this economic climate, the efficiency — of journalists and other citizens who are trying to hold public officials and institutions accountable. The goal: Computer algorithms that can sort through the huge amounts of databased information available on the Internet, providing public-interest reporters with sets of potential story leads they otherwise might never have found. Or, in short, data mining in the public interest."
This discussion has been archived. No new comments can be posted.

Data Mining Rescues Investigative Journalism

Comments Filter:
  • by Lumenary7204 ( 706407 ) on Sunday January 04, 2009 @05:58PM (#26323123)
    It doesn't matter how efficient journalistic gum-shoeing becomes, because the end product will still be subject to a certain amount of spin by the publisher.
    • Oh I'm sure some OSS will come out to put an anti-spin to their spin using some of the same data mining they're using.

      • Oh I'm sure some OSS will come out to put an anti-spin to their spin using some of the same data mining they're using.

        I only read articles with +1/2 and -1/2 spin.

    • The question is more along the lines of "what is a journalist".

      Right now, it seems that a transcription machine meets the criteria. The current "journalists" simply do not ask (and follow up on) meaningful questions. They ask crap questions and focus on non-issues. And then they accept non-answers to those questions.

      I'd be very surprised if the majority (51%+) of "political" "journalists" could even name their own Congress Critters.

      And tech "journalism" is even worse.

      About the only fields where they get it

    • Re: (Score:1, Insightful)

      by Anonymous Coward

      Great is truth, but still greater, from a practical point of view, is silence about truth. By simply not mentioning certain subjects... totalitarian propagandists have influenced opinion much more effectively than they could have by the most eloquent denunciations. -Aldus Huxley

    • It doesn't matter how efficient journalistic gum-shoeing becomes, because the end product will still be subject to a certain amount of spin by the publisher.

      The only thing that will save "journalistic integrity" is the journalism field adhering to openly stated ethical principle and practices. No amount of technology is going to fix that problem.

    • The reason spin dominates the media, is not because publishers give a damn. They just want a paycheck. But spin is cheaper than facts. Hard hitting journalism continues to be popular, but in light of crashing budgets, very few outlets are putting resources into it. And investigative journalism is the most expensive kind. If databases make that job cheaper, the quality of information will improve.

  • by spiffmastercow ( 1001386 ) on Sunday January 04, 2009 @05:59PM (#26323131)
    so does this mean maybe reporters will stop pulling statistics out of their asses once they have a tool to provide reliable statistics with a minimum of effort?
    • by mac1235 ( 962716 ) on Sunday January 04, 2009 @06:03PM (#26323165)
      No, most reporters will continue to copy PR releases into articles.
    • Re: (Score:3, Insightful)

      No, it just means they will shove the statistics with which they don't agree back up their asses where the sun don't shine.

      Out of sight, out of mind...
    • What it means is that state and local legislators will start making data illegal or expensive to obtain. This has already happened. In Kansas there is now a fee for motor vehical records. That's because a local reporter ran the motor vehicle records of school bus drivers and reveled how many offenses they have. This was so embarrasing that they made a fee to make it cost prohibative to datamine AND they spun off the school bus business to a private company that is not subject to public review (legally).


  • by MikeRT ( 947531 ) on Sunday January 04, 2009 @06:03PM (#26323159)

    But as it is, we can't get local news media to perform their "watchdog" role in most cases. I can't even begin to count the number of times when I've seen a case that looked suspicious as hell based on the reporting of it, but the local media just parroted the police/prosecutor's story and moved on. Alternatively, when they do get involved, it's often in cases like the Jena 6 where you end up finding out that the media was spreading disinformation and building up a narrative to make more profit.

    Most news media have become a combination of an AP outlet and a source of editorials and classifieds. They're like a primitive RSS feed with some mashed up content thrown in there for local flair.

    • by EmbeddedJanitor ( 597831 ) on Sunday January 04, 2009 @06:50PM (#26323539)
      Journalism is not about reporting the truth, it is about contributing to and competing in an advertising and entertainment industry. In depth is not important, quickly generating good TV and print images to attract eyeballs and thus newspaper/advertising sales is everything. Getting access to the information and sources is an absolute must.

      The journalists groom their resources and need to keep in their sources good books to keep up access. Play ball and you get indented with a patrol so you can send back gripping combat footage. Piss off the brass and you get indented with the guys washing trucks at the transport park.

      It is no wonder that editors and TV execs are quick to fire and distance themselves from any journalists that forget this and start snooping too deeply. Just look at []

      • "Journalism is not about reporting the truth, it is about contributing to and competing in an advertising and entertainment industry."

        Which by your definition bloggers will never be journalists.

      • Journalism is not about reporting the truth, it is about contributing to and competing in an advertising and entertainment industry.

        Your observations are valid but incorrectly attributed. You are confusing journalism with publishing.

      • Re: (Score:3, Insightful)

        to be fair, what you're describing is the media industry, not journalism itself. journalism is a trade/discipline that serves a crucial role in a free & democratic society. that it has been bastardized and corrupted by commercial interests does not preclude the existence of true journalism which is based on professional integrity and a civic duty to keep the public informed.

        what i'm confused about is why the poster accuses the "digital revolution" of undermining in-depth reportage. there's a huge differ

        • by umghhh ( 965931 )

          It may be true that what you called 'monopoly of main stream outlets' is bad for quality of journalism, this quality is not improved however by multiplying the number of outlets. If anything the quality went down which has two reasons: people pay no attention to this many sources of information and the sources of information being rubbish as nobody is willing to pay for quality. Another thing which young people in the so called industrialized world may find difficult to believe is that there is non-digitize

        • You are correct, what I was talking of is journalism as is practiced. I expect that most journalists come out of college with a sparkly eyed passion for the truth and are soon brought down to ground with a major thump when they find that the industry, as a whole, does not want this. Real journalists are few and far between and are seldom linked to the main media outlets.

          The blogosphere does mean that anyone can become a reporter which makes for a far more democratic medium where you are not censored by an e

      • by R2.0 ( 532027 )

        It is no wonder that editors and TV execs are quick to fire and distance themselves from any journalists that forget this and start snooping too deeply. Just look at []

        Hmmm, I wondered about that - my memory didn't agree with your statement. So I followed the link and found this in the article:

        In 1998 Arnett narrated a joint venture between CNN and Time Magazine called NewsStand, which described what he called "Operation Tailwind." The report falsely claimed that the

    • Re: (Score:3, Insightful)

      More to the point, I want to know how you preclude all these shiny-miney algorithms from being tweaked with misinformation.
      Sure, the really gross stuff is going to get dumped, but the real Machiavellis will engage in propaganda oh so subtly...
      • by Qzukk ( 229616 )

        I want to know how you preclude all these shiny-miney algorithms from being tweaked with misinformation.

        Who needs misinformation? This will be as useless for the journalists as datamining is for governments, except "Link Discovered Between Illinois Senator Pick and Price of Canned Spinach!" is more entertaining than being "detained" for days while men in black interrogate your boss and coworkers because the computer said so.

  • In other news... (Score:4, Insightful)

    by djupedal ( 584558 ) on Sunday January 04, 2009 @06:09PM (#26323219)
    Investigative Journalism Rescues Data Mining []
  • by vlm ( 69642 ) on Sunday January 04, 2009 @06:30PM (#26323383)

    SELECT *
    FROM advertising_revenue_table, list_of_local_business_table
    WHERE advertising_revenue_table.business_name = list_of_local_business_table.business_name
      AND advertising_revenue_table.cost_of_ad_space_purchased = 100
      AND list_of_local_business_table.owners NOT IN (select names from list_of_publishers_buddies)
    ORDER BY cost_of_ad_space_purchased ASC

  • by Anonymous Coward on Sunday January 04, 2009 @06:31PM (#26323391)

    The digital revolution didn't do-in journalism. That was Watergate. After that, and the Left's orgasm over the idea of reporters taking down presidents, propagandists are now all we have. Remember the 'fight' over which reporter would fly with Obama to Iraq, while no one was fighting to go with McCain all those times he went.

    Ask them: "Why be a journalist?"
                        "To make a difference." is the reply.

    By definition, journalists don't "make a difference", they tell a story. Propagandists "make a difference". Just ask Himmler.

    It's gotten so bad that, despite all the channels, and all the money-losing newsrooms on cable/satellite TV, the stories all use the same words. It's because the left owns almost all of them.

    Some might say this consensus makes them right, but it really doesn't. How many times is Fox News chided because they don't agree? Who's programmed, the TV, or us?

    What they leave OUT of a story is just as important as what gets IN.

    Until just the other day, Charlie Rose and (I think it was) Dan Rather were discussing Obama. "We don't know anything about him- who are his heroes?"

    Meanwhile so much was known about "Joe the plumber" that he could barely get work in his town.

    Meanwhile they sent 30+ reporters to scam information in Alaska about Palin, making up things when nothing was available.

    But no...two years of investigation on Obama turned up nothing. Not a word on broadcast TV about Bill Ayers (an unrepentant bomber of the Pentagon and murderer who got free on a technicality). Not a word about Obama's heros like Saul Alinsky (sp?) who is so far Left he bumps elbows with Stalin.

    These people are not in the periphery; these are people with whom he's tightly tied. But that doesn't matter any more, he's elected. Just remember you asked for it. He'll make history, alright.

    But now I suppose, we expect reporters to dig through computer data, and the digital revolution might do something for the industry. Well after being the top radio show host for two decades, they still think Limbaugh is racist. (Not hard to disprove) or fat (that was a decade ago). Yeah, those reporters are really hard working investigators. All they need do is *listen* to the show, and they won't do that.

    Journalism suffers from the same thing science does: loss of integrity. "Show me the money". And "vote for my guy". Truth no longer matters to these people, though it should to you.

    This 'digital revolution' will do nothing but help THEIR causes, not truth.

    • These people are not in the periphery; these are people with whom he's tightly tied. But that doesn't matter any more, he's elected. Just remember you asked for it. He'll make history, alright.

      From the context, it sounds like you are phrasing that as a negative.

      So, make a statement that can be tested as to what, specifically, you believe he will do.

      Otherwise you're the same as the people you denigrate.

      This 'digital revolution' will do nothing but help THEIR causes, not truth.

      Truth is a difficult thing. I'l

      • we don't know exactly what Obama will do, but we do know that the beliefs he has, upon which he will base his actions, are fundamentally flawed.

    • My friend, McCain got over it pretty quickly. You should too. Breathe!
    • Propagandists "make a difference". Just ask Himmler.

      I'm not sure Himmler made a huge propaganda difference, otherwise old Germans would be crazy about occult/pagan shit.

    • Not a word on broadcast TV about Bill Ayers (an unrepentant bomber of the Pentagon and murderer who got free on a technicality).

      Being framed by the FBI is a technicality?

    • To summarise your post: Everyone is suffering from a loss of integrity except Limbaugh and Fox News. You also know all your assertions are "true" because Fox and Limbaugh told you they were "true".

      The depressing irony is that propoganda has convinced you that you are immune to propoganda.
    • I think we on the right need to stop crying about the "left wing" media, when, we now have our own media outlets too. We dominate radio, we have a good and growing presence on TV, and our print is expanding while theirs is shrinking.

      The fact is, we lost this election because the Republican Party has tried to fuse libertarian economic policies with social conservatism and that plan could not work at a time when libertarian economics is in considerable doubt. The conventional wisdom is that Republicans shou

      • Are you on crack?

        McCain was everything that you said the GOP needed and he got destroyed.
        The only votes McCain got were the "not Obama" crowd
        • McCain was everything that you said the GOP needed and he got destroyed

          I was wrong about McCain, but a look at the demographics of this election is illuminating. Free trade cost McCain dearly. Every state that McCain lost is a state that has lost big in free trade, and that includes Virginia and North Carolina. The conventional wisdom is that values don't matter and Republicans should stick to their economic guns, but its just political suicide.

    • It depends a lot on your perspective.

      If you're standing next to Bush, all the medias (and even Berlusconi and Sarkozy) will look leftist.(

  • All it means is that they sit in an office surfing for who knows what, instead of getting "out there" and discovering the news first hand.

    A lot (most?) TV and print media have well publicised portals for eye witnesses to call in, or send their photos. It's certainly cheaper than having to employ one of your own (or, god forbid, having to pay out for agency or newswire product) to get the pictures for the evening news. Plus, of course, eye witnesses give the impression of "real people" - so it's got to be g

  • Subject (Score:4, Insightful)

    by z-j-y ( 1056250 ) on Sunday January 04, 2009 @06:50PM (#26323549)

    It's not what journalists don't know. It's what they don't report.

    And basically people just don't care. Have we decided who to blame for the economy collapse yet? But bathroom foot tapping, wow, that's the shit we have to get to the bottom of it.

  • Oh bull (Score:5, Interesting)

    by Groo Wanderer ( 180806 ) <.moc.etaruccaimes. .ta. .eilrahc.> on Sunday January 04, 2009 @07:12PM (#26323777) Homepage

    As someone who does investigative journalism for a living, data mining won't get you squat. Having done it for a living for 5+ years, and being very familiar with data mining, the two so rarely cross paths that it rounds to zero.

    Why? Because if it is in minable form, it doesn't take any digging to find. If you can run a google search and get even a tidbit about what you need, you don't need investigative journalism.

    Of the stories I have gotten, little ones like the P4 going 64 bits, it never reaching 4GHz, Dell exploding laptops (an assist on that one), and more recently the Nvidia bump cracking problem(s), none of that would have been possible through data mining.

    If it is out there, it doesn't need an investigative journalist. If it isn't, than data mining won't help. The end.


    • Re: (Score:2, Interesting)

      by binpajama ( 1213342 )

      I'm a grad student and have recently been asked to help out on a research grant proposal for the very same thing. I agree with the point made in the parent post - if its already out there, there's not much investigation needed. Additionally:

      1) How will algorithms figure out if a story is relevant? There's no deux ex machina here. It will see if the article has the relevant buzzwords and if it has been released by a reputable source.

      2) The buzzword factor kills the algorithm's chances of finding somethi

      • Re: (Score:1, Interesting)

        by Anonymous Coward

        The way I understand this Journalistic data mining malarkey is that, as mentioned it helps to discover leads or starting points in public interest stories.

        It's pretty much the way (I think) science should be conducted in the future. The best leads in science come from blips in your data, things that shouldn't be there. Data mining helps to identify these blips in the data, but does nothing to analyse them. As always, there's the hard slog of trying to figure out what the blip means that comes after identifi

        • by wik ( 10258 )

          Finding these blips is the easy part. Any first year grad student can do it. They will even learn something from the process.

          The interesting part is figuring out which blips are important and which don't matter, then explaining why. Pushing the identification part to an algorithm is a waste of time and I don't expect computers to be taking over research part any time in the foreseeable future.

    • The one story I'll note was the options backdating scandal of about a year or two ago. The WSJ found that entirely through mathematical analysis of employee options performance. So sometimes data does turn up interesting things.

      • I agree with that, but do you think the person doing the mining decided to mine all option performance for all companies over the course of the last few years, or do you think they were working on a tip/hunch?

        The situation you describe is some investigative journalist noticing something, and then using data mining as a tool to verify the issue. I may be wrong on this, but to me, there was a biological program at work that decided what and how to use the silicon program.

        That is not to say the mining wasn't v

    • and has been used in at least some news organizations, because more than ten years ago I wrote a data mining program for Crain Communications (publisher of "Crain's New York Business," "Advertising Age," "Pensions & Investments," and "Crain's Chicago Business." They used it to identify trends, which is a crude use of data mining but something used to fill space nonetheless.

      So I don't think data mining per se will help citizens and bloggers do more investigative journalism, but the increasing availabili

    • Charlie: sure, but don't you ever have story leads buried in public data? Some of it is demographic issues ( why are three Alabama counties posting infant mortality rates similar to Africa? ). Some of it is fleshing out stories like why a Senator voted on a specific bill, so you can write from evidence rather than anecdote. These aren't classically investigative, but there's new information of public value, no? It's a different kind of reporting, but it's pretty cool stuff when it works.

      As for an "algorithm

      • Yeah, but the key is simple, did data mining make the story, or did the story get noticed by data mining.

        To use your infant mortality example, do you think that someone or a program was poking away at a database, noticed the correlation, and ran with it, or do you think someone had a theory and tested it by data mining? I would be willing to bet it is the latter, IE someone noticed a high rate of infant mortality, and tossed out a bunch of what if queries.

        To me, and you very well might differ on this, the i

        • Yeah, when I was at the Center for Public Integrity, we very much started with the database and trolled for leads. There are MANY public datasets put out by the government that no one has ever even looked at. One story I worked on crawled legal docs for keywords indicating an appeals court judge rebuked a prosecutor for cheating. Then, we verified the docs (thousands) and had a nice list of the worst prosecutors in America. And guess what: they clustered. Nice little dens of abuse of power, based not on our

    • Charles Lewis: "Thereâ(TM)s a lot of datasets that most people donâ(TM)t know about. I was excited to hear a few years ago, that the USDA has a database of all the bad meat in America thatâ(TM)s been recalled. Who knew? I had no idea. I donâ(TM)t know if itâ(TM)s online or if itâ(TM)s there internally â" but there are hundreds and hundreds of databases with massive amounts of data that nobody knows about or ever looks at. You could dine off the databases."

      He also does inve

  • Just another use (Score:2, Interesting)

    by emilienne ( 647608 )
    The Cline Center for Democracy at UIUC has been running a data mining project, scanning archives and contents of newspapers around the world for reports of political disturbances such as riots &tc. The project, a collaboration between the center and the UIUC CS department, is meant to facilitate research on domestic stability and the like. Currently it's focused primarily on English papers, but efficiency and completeness will dictate searches in other languages sooner or later.

    Information can be s
  • by DrEasy ( 559739 ) on Sunday January 04, 2009 @07:50PM (#26324117) Journal

    To me a journalist is someone who provides the raw data. In the "Web 2.0" world (pardon the buzzword), anybody can do the data mining and editorializing, and it's great to be able to read different interpretations of the same data by different people.

    This is what happens in the sabermetrics world (i.e. baseball stats analysis). Some source provides the raw data, but people merrily discuss and disagree on its meaning on various blog sites. There is none of this confusing mix of data and biased interpretation that you get in most news reporting nowadays.

    If a blog is commercially successful, it will be an incentive to the blogger to dig out more raw data, or rather get a journalist to find him some, as it's not necessarily the same skill.

    • No, sources provide the raw data. Journalists report the most interesting bits to the public.

      Databases are just another source. There is so much data out there that no one looks at (public records, etc). When I was doing investigative work full time, we had spiders out that just pulled every .MDB file from a .gov URL. All kinds of interesting stuff showed up, most of it not "published" in any usable way, but often of great public interest (example: we located and published all the raw contracts between DOJ

      • by DrEasy ( 559739 )

        I don't know, if all a journalist did was wait for press releases to come to him/her, there wouldn't be much facts and truth uncovered, would it? To me a journalist is someone who actively gathers facts and reports them (and that's a dying breed, maybe that's why our definition of journalism has shifted?). Of course gathering facts is not a neutral and unbiased process (it is natural to tend to look harder for evidence that confirms your own opinion).

        The way I see it, the journalists are doing the initial p

        • Sure, but in many, many cases there are public databases that no one has ever looked at. It's not all that different from investigative reporters digging around for information in court docs. Some of the best investigators work ENTIRELY from legal records. Then they do some interviews to color it up at the end when they have it nailed down. It's the same workflow when reporting from an obscure government or corporate database.

          Oh, and just for perspective I've been working as a reporter since 2001, and I rar

  • Why bother ? "Journalists" already have access to Facebook and MySpace and they can even hit Wikipedia for a quote now and again. What more do they need to write a sensationalist op-ed ?

    They don't even bother harrassing family members for photos anymore - they rip them straight from Facebook. All the pics, family links, likes and dislikes...

    Bobby Young (pictured left) died tragically yesterday when...blah..blah..blah. The 8 year old university student, a deeply religious man and devout Jedi, was said t

  • If you're in the world of investigative journalism I'd encourage you to take a look at a new class of semantic data generation tools. New capabilities like Calais ( from Thomson Reuters allow you to ingest unstructured text (news articles, press releases, FOIA documents, whatever) and automatically extract semantic metadata like people, companies, management changes, natural disasters and hundreds of others. You can take the output of these tools and load them directly into databases to
  • by DynaSoar ( 714234 ) on Monday January 05, 2009 @12:31AM (#26326123) Journal

    > a new academic and professional discipline
    > known as "computational journalism."

    Differing only in complexity but not principle from the same sort of search engine journalism that's resulted in decline of both accountability and accuracy of news over the past decade. Perhaps some investigative journalism into the lack of actual investigation into investigation is in order. "Hits" != veracity.

  • Nice to see an interest in computer assisted reporting (CAR), although I'm a little baffled at the article linked calling this an "emerging" practice. I've been at this for about a decade, and there were plenty here when I showed up.

    A few observations:

    1) Regarding other commenters. anyone who talks about "journalism" as if the field is one homogeneous, cohesive group are maybe not thinking too deeply about media. Kind of like how "Americans" or "humans" covers a lot of folks.

    2) All journalism is data mining

  • by tjstork ( 137384 ) <todd,bandrowsky&gmail,com> on Monday January 05, 2009 @02:20AM (#26326771) Homepage Journal

    You may as well rename this, "Crackpottery goes mainstream". Instead of calling a few people, doing a couple of interviews, writing up their impressions as a story, journalists will now have automation to help them do what nuts do. Just like so-called UFO, alien and jfk assassination researchers do manually, journalists will be able to arrange players, dates and events to fit any tale imaginable. Government, UN, corporate, environmental conspiracy stories will abound, and the sky is the limit.

  • Journalists will continue to use the NULL search technique.

    To support your claim of <insert cause here>, do a search the REQUIRES a lot of words that are descriptive of your opposition. Then, when NO RESULTS are found, you can write that no one opposes your claim of <insert cause here>.
  • Computational Journalism [] is much broader than just data-mining. At Georgia Tech I taught courses in the area in 2007 [] and 2008 [] which covered everything from mobile newsgathering, to information visualization, automatic content analysis, social computing, storytelling and authorship, aggregation, summarization, information mashups, and consumption interfaces. The bigger question is: how can computation help in every aspect of journalism: gathering, sensemaking, authoring, and dissemination, while still mainta