Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Books Math Stats Science

Machine-Learning Algorithm Ranks the World's Most Notable Authors 55

HughPickens.com writes: Every year the works of thousands of authors enter the public domain, but only a small percentage of these end up being widely available. So how do organizations such as Project Gutenberg choose which works to focus on? Allen Riddell has developed an algorithm that automatically generates an independent ranking of notable authors for any given year. It is then a simple task to pick the works to focus on or to spot notable omissions from the past. Riddell's approach is to look at what kind of public domain content the world has focused on in the past and then use this as a guide to find content that people are likely to focus on in the future.

Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on. This produces a "public domain ranking" of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's. Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
This discussion has been archived. No new comments can be posted.

Machine-Learning Algorithm Ranks the World's Most Notable Authors

Comments Filter:
  • https://medium.com/the-physics... [medium.com]

    Gave us the most influential person in world history was Linnaeus

    Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.

    Maybe this should be recategorized funny things you can do with computers ?

  • by Anonymous Coward

    What a load of crap.

    This is why you get rubbish like the BBC destroying lots of "classic" early TV series (throwing the film into skips). But they made sure there was space for old episodes of Panorama most of which involved cretins of the day talking shite which is irrelevant in a few years.

    The whole point of archiving is that you literally have *no clue whatsoever* what is going to be valuable in the future.

    If you did you would be a stock market billionaire multiple times over.

    • The trouble is budgets and manpower.

      If you know you don't have the resources to save everything, you have to have some way of prioritizing.

      Personally, I would rather save one or two pieces from as many different authors as possible, rather than trying to get everything of the "most important" authors.
    • Because the BBC was basing their decision on a machine learning algorithm?

      No, wait, you seem to be an illiterate moron who was moderated positively because people agree with your basic premise of "archive anything" without realizing that you have nothing whatsoever to do with the topic at hand.

      And when I say illiterate I mean your prostitute slash sister typed these words for you. And the two people who moderated you positively are on some unknown strain of weed that makes them agree with someone who says

  • Ridiculous and sad (Score:5, Insightful)

    by Katatsumuri ( 1137173 ) on Tuesday November 18, 2014 @08:29AM (#48410033)

    Of the individuals who died in 1965 and whose work will enter the public domain next January

    This says so much about our culture...

    Are there jurisdictions where one could legally and openly operate a Project Gutenberg clone with more recent works?

    • by Katatsumuri ( 1137173 ) on Tuesday November 18, 2014 @08:48AM (#48410113)

      I quickly checked Wikipedia [wikipedia.org], and most countries seem to stick with at least "Life + 50yr" term. That is a great achievement of the lobbyists.

      Some island nations seem to have no known copyright legislation, but they are still usually parties to some limiting international treaties, and also have similar restrictions under other names ("unauthorized copying", etc.)

      Seriously, is there no place on Earth with more reasonable terms?

      • by tlhIngan ( 30335 )

        I quickly checked Wikipedia, and most countries seem to stick with at least "Life + 50yr" term. That is a great achievement of the lobbyists.

        Some island nations seem to have no known copyright legislation, but they are still usually parties to some limiting international treaties, and also have similar restrictions under other names ("unauthorized copying", etc.)

        Seriously, is there no place on Earth with more reasonable terms?

        You have to realize that most countries are bound by the Berne Convention [wikipedia.org] w.r.t. c

  • I really like G.K. Chesterton, but how can he be ranked higher [publicdomainrank.org] than Arthur Conan Doyle and Sigmund Freud?
  • by Mikkeles ( 698461 ) on Tuesday November 18, 2014 @08:51AM (#48410129)

    It may make more sense to concentrate on those lower in the list. The works of highly rated authors are likely to remain available anyway whereas those of lower rated authors are more likely to be lost.
        Admittedly, the loss may be deserved, but I am willing to bet there are some (if not many) that will be more highly appreciated in a century or so.

  • What if I translate someone's book, and release my translation into the Public Domain immediately? Would an alternative Project Gutenberg of liberally licensed translations work?

    At least the Berne Convention says that "Translations, adaptations, arrangements of music and other alterations of a literary or artistic work shall be protected as original works without prejudice to the copyright in the original work."

    Of course the translation is not the same thing. Also, it is more complicated than that. The auth

    • by bws111 ( 1216812 )

      Your translation does not make the original copyright invalid, which is what your highlighed phrase means. You still need permission to make the translation in the first place, and if you don't have you have committed copyright infringment. However, if you have a license from the copyright holder, then your new work can be released on whatever terms you and the original copyright holder agreed to.

      • by Anonymous Coward

        You still need permission to make the translation in the first place, and if you don't have you have committed copyright infringment.

        You are technically incorrect. Making a translation without the authors permission isn't copyright infringement. Distributing it is.

    • Your translation will have at least two copyrights applying to it: the original author's and the translator's. It can't be used without licenses from both. It can't be distributed just with a license from the original author, hence the protection as an original work. It can't be distributed just with a license from the translator, since that would be prejudicial to the original author's copyright.

  • Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on....Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.

    For folks like Winston Churchill and Malcom X who had notable careers outside of writing, I wonder how they distinguish what part of their Wikipedia stats is due to their writing and what part comes from the rest of their careers?

  • Glancing at the partial list of topics presented suggests this work won't be too hard to improve on:

    Topic | Characteristic words
    4 | categori of birth death stub date name persondata place metadata
    20 | univers of the faculti colleg at and edu professor alumni
    31 | painter paint of art artist the and in work museum
    35 | he in his was and the to of categori at
    77 | he the his in to was of and on at
    97 | chines china hong kong zh taiwan zhang shanghai wang beij
    100 | the book writer novel fiction of and stori isbn novelist
    149 | of the and in historian univers languag histori studi translat
    160 | she her in the and was to of as with
    168 | the to that in and of ref was had by
    Table 1: Examples of topics derived from text of Wikipedia articles

  • This not a machine picking out what authors are worthy of digitizing, it is a computer scanning wikipedia and a few other sites. In other words, it is meta: ranking what regular humans have already ranked by their words and effort to describe. The merit of the critics/reviewers is questionable.

    Deciding what is worth digitizing based on the merit of the work itself is not part of this article. For now, I'll stick with librarians deciding what to focus on.

  • ... Edward Bulwer-Lytton?

  • It all sounds fairly standard, as these things go.What has earned it the "machine-learning" distinction?

  • So, based on this algorithm, the #1 priority author would be Sherrilyn Kenyon (who writes paranormal romance), followed by Al Sarrantonio (who writes horror, and puts together a bunch of anthologies), and Muammar Gaddafi (yes, that Muammar Gaddafi). Number six is Gardner Dozois, who's also (like Sarrantonio) an anthologist.

    If this is designed to be popularity-based (e.g. designed to determine what people most want to see get scanned/uploaded/entered/produced by something like Gutenberg, rather than an asse

  • Based on his prolific works on Slashdot, I'm wondering where frequent contributor Bennett Haselton is on the list?
  • when a machine actually reads all these books and starts making comparisons based on content.

  • Bram Stoker being #1 in the 1910 decade, way ahead of someone like Mark Twain? In what universe?

    The list is full of mediocrity floating at the top, while profound authors being ranked way lower (Calamity Jane > Chekhov for instance).

    The complete failure of this ranking experiment just shows how true AI is still 20 years in the future (as it has been for the past 50 years)...

  • For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's.

    Which will lead to... exactly the thing we started from.

    Wikipedia is a huge circle-jerking effort. If you run this effort over the whole of it, you'll no doubt find out that the "works" of some porn stars are more influential than some of the more obscure philosophers.

    It's not so simple, and while the basic project is interesting, drawing conclusions like "you should focus more on this" are clearly written by imbeciles who don't understand that influence isn't the same as citation count or page rank.

    The pre

  • Every year the works of thousands of authors enter the public domain

    No copyright has expired in the US since 1998, and none will expire until at least 2019. I say "at least", because you can be sure there will be lots of lobbying to extend them even further. I hope the rest of the world is enjoying their public domain... while they still have it.

  • Take a look at "most important" (highest ranking) deceased author from the 1980s [publicdomainrank.org]. It is science fiction/fantasy writer Tom Godwin. Number two is Stanton A. Coblentz . Also in the top 20 (in order): Lin Carter, Robert A. Heinlein, Mack Reynolds, Theodore Sturgeon, James Tiptree, Jr., Clifford D. Simak. Forty percent of the top 20 are SF&F authors. Meanwhile we have Tuchman at 101, Sartre at 112, Borges at 254, Tennessee Williams at 439, Toynbee at 526, and so.

    Looking at the 1990s, the top loading by SF

  • ... does that mean I've read too much Harry Potter?

"The great question... which I have not been able to answer... is, `What does woman want?'" -- Sigmund Freud

Working...