Machine-Learning Algorithm Ranks the World's Most Notable Authors 55
HughPickens.com writes: Every year the works of thousands of authors enter the public domain, but only a small percentage of these end up being widely available. So how do organizations such as Project Gutenberg choose which works to focus on? Allen Riddell has developed an algorithm that automatically generates an independent ranking of notable authors for any given year. It is then a simple task to pick the works to focus on or to spot notable omissions from the past. Riddell's approach is to look at what kind of public domain content the world has focused on in the past and then use this as a guide to find content that people are likely to focus on in the future.
Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on. This produces a "public domain ranking" of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's. Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on. This produces a "public domain ranking" of all the authors that appear on Wikipedia. For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's. Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world, the new algorithm picks out TS Eliot as the most highly ranked individual. Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
Lets see the last one of these (Score:2, Insightful)
https://medium.com/the-physics... [medium.com]
Gave us the most influential person in world history was Linnaeus
Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.
Maybe this should be recategorized funny things you can do with computers ?
Of the individuals who died in 1965 (Score:3, Informative)
Just to be Anglo centric I don't even see William Shakespeare as eligible on the new list.
Maybe this should be recategorized funny things you can do with computers ?
It's only authors who died in 1965. From the SUMMARY:
Of the individuals who died in 1965 and whose work will enter the public domain next January in many parts of the world,
you think I am making joke... (Score:2)
Re:Of the individuals who died in 1965 (Score:5, Informative)
It's only authors who died in 1965. From the SUMMARY:
RTFA MAN
http://publicdomainrank.org/ [publicdomainrank.org]
Starts at authors who died in 1900. If you going to completely misunderstand the meaning of the point and nitpick on petty details at least get them right.
Do not use algorithms ! (Score:2, Insightful)
What a load of crap.
This is why you get rubbish like the BBC destroying lots of "classic" early TV series (throwing the film into skips). But they made sure there was space for old episodes of Panorama most of which involved cretins of the day talking shite which is irrelevant in a few years.
The whole point of archiving is that you literally have *no clue whatsoever* what is going to be valuable in the future.
If you did you would be a stock market billionaire multiple times over.
Re: (Score:3)
If you know you don't have the resources to save everything, you have to have some way of prioritizing.
Personally, I would rather save one or two pieces from as many different authors as possible, rather than trying to get everything of the "most important" authors.
Re: (Score:2)
Because the BBC was basing their decision on a machine learning algorithm?
No, wait, you seem to be an illiterate moron who was moderated positively because people agree with your basic premise of "archive anything" without realizing that you have nothing whatsoever to do with the topic at hand.
And when I say illiterate I mean your prostitute slash sister typed these words for you. And the two people who moderated you positively are on some unknown strain of weed that makes them agree with someone who says
Ridiculous and sad (Score:5, Insightful)
Of the individuals who died in 1965 and whose work will enter the public domain next January
This says so much about our culture...
Are there jurisdictions where one could legally and openly operate a Project Gutenberg clone with more recent works?
Life + 50 years almost everywhere (Score:5, Interesting)
I quickly checked Wikipedia [wikipedia.org], and most countries seem to stick with at least "Life + 50yr" term. That is a great achievement of the lobbyists.
Some island nations seem to have no known copyright legislation, but they are still usually parties to some limiting international treaties, and also have similar restrictions under other names ("unauthorized copying", etc.)
Seriously, is there no place on Earth with more reasonable terms?
Re: (Score:2)
Check out the $1 videos at any garage sale.
Re: (Score:3)
You have to realize that most countries are bound by the Berne Convention [wikipedia.org] w.r.t. c
The Ben Franklin / Copyright "Pirate" connection (Score:2)
"Ben Franklin and others who owned printers realized that copyright didn't apply to them, so they promptly began making copies of everything - books, sheet music, etc."
I had know that for much of US history there was no respect for foreign copyrights (from other countries). I never saw anyone connect this to Ben Franklin's success before. Interesting!
Now that I look:
"Benjamin Franklin, Copyright Pirate"
http://www.tuxdeluxe.org/node/... [tuxdeluxe.org]
And:
"Benjamin Franklin, the first IP pirate?"
http://arstechnica.com/infor [arstechnica.com]
Re: (Score:2)
I guess if you never knew it existed, then you can't miss it, right?
Bad ranking (Score:2)
Losing Literature (Score:3)
It may make more sense to concentrate on those lower in the list. The works of highly rated authors are likely to remain available anyway whereas those of lower rated authors are more likely to be lost.
Admittedly, the loss may be deserved, but I am willing to bet there are some (if not many) that will be more highly appreciated in a century or so.
Re: Losing Literature (Score:1)
Re: (Score:2)
I agree. The most popular ones may not all need the love and attention of the archivists anywhere near as much as some of the lesser-knowns.
Translation workaround (Score:2)
What if I translate someone's book, and release my translation into the Public Domain immediately? Would an alternative Project Gutenberg of liberally licensed translations work?
At least the Berne Convention says that "Translations, adaptations, arrangements of music and other alterations of a literary or artistic work shall be protected as original works without prejudice to the copyright in the original work."
Of course the translation is not the same thing. Also, it is more complicated than that. The auth
Re: (Score:2)
Your translation does not make the original copyright invalid, which is what your highlighed phrase means. You still need permission to make the translation in the first place, and if you don't have you have committed copyright infringment. However, if you have a license from the copyright holder, then your new work can be released on whatever terms you and the original copyright holder agreed to.
Re: (Score:1)
You still need permission to make the translation in the first place, and if you don't have you have committed copyright infringment.
You are technically incorrect. Making a translation without the authors permission isn't copyright infringement. Distributing it is.
Re: (Score:2)
Your translation will have at least two copyrights applying to it: the original author's and the translator's. It can't be used without licenses from both. It can't be distributed just with a license from the original author, hence the protection as an original work. It can't be distributed just with a license from the translator, since that would be prejudicial to the original author's copyright.
A riddle, wrapped in a mystery, inside an enigma (Score:1)
Riddell's algorithm begins with the Wikipedia entries of all authors in the English language edition (PDF)—more than a million of them. His algorithm extracts information such as the article length, article age, estimated views per day, time elapsed since last revision, and so on....Others highly ranked include Somerset Maugham, Winston Churchill, and Malcolm X.
For folks like Winston Churchill and Malcom X who had notable careers outside of writing, I wonder how they distinguish what part of their Wikipedia stats is due to their writing and what part comes from the rest of their careers?
Re: (Score:2)
Asimov died something like 30 years after 1965. His works are nowhere near public domain yet.
Re: (Score:2)
Stop words? (Score:2)
Glancing at the partial list of topics presented suggests this work won't be too hard to improve on:
Topic | Characteristic words
4 | categori of birth death stub date name persondata place metadata
20 | univers of the faculti colleg at and edu professor alumni
31 | painter paint of art artist the and in work museum
35 | he in his was and the to of categori at
77 | he the his in to was of and on at
97 | chines china hong kong zh taiwan zhang shanghai wang beij
100 | the book writer novel fiction of and stori isbn novelist
149 | of the and in historian univers languag histori studi translat
160 | she her in the and was to of as with
168 | the to that in and of ref was had by
Table 1: Examples of topics derived from text of Wikipedia articles
Not an independent machine ranking of the work (Score:1)
This not a machine picking out what authors are worthy of digitizing, it is a computer scanning wikipedia and a few other sites. In other words, it is meta: ranking what regular humans have already ranked by their words and effort to describe. The merit of the critics/reviewers is questionable.
Deciding what is worth digitizing based on the merit of the work itself is not part of this article. For now, I'll stick with librarians deciding what to focus on.
Where is ... (Score:2)
Where does the "machine-learning" come in? (Score:2)
It all sounds fairly standard, as these things go.What has earned it the "machine-learning" distinction?
Some really weird results (Score:2)
So, based on this algorithm, the #1 priority author would be Sherrilyn Kenyon (who writes paranormal romance), followed by Al Sarrantonio (who writes horror, and puts together a bunch of anthologies), and Muammar Gaddafi (yes, that Muammar Gaddafi). Number six is Gardner Dozois, who's also (like Sarrantonio) an anthologist.
If this is designed to be popularity-based (e.g. designed to determine what people most want to see get scanned/uploaded/entered/produced by something like Gutenberg, rather than an asse
Where's Bennett? (Score:2)
I'll be interested (Score:2)
when a machine actually reads all these books and starts making comparisons based on content.
*something* in, rubbish out... (Score:2)
Bram Stoker being #1 in the 1910 decade, way ahead of someone like Mark Twain? In what universe?
The list is full of mediocrity floating at the top, while profound authors being ranked way lower (Calamity Jane > Chekhov for instance).
The complete failure of this ranking experiment just shows how true AI is still 20 years in the future (as it has been for the past 50 years)...
circle-jerk (Score:2)
For example, the author Virginia Woolf has a ranking of 1,081 out of 1,011,304 while the Italian painter Giuseppe Amisani, who died in the same year as Woolf, has a ranking of 580,363. So Riddell's new ranking clearly suggests that organizations like Project Gutenberg should focus more on digitizing Woolf's work than Amisani's.
Which will lead to... exactly the thing we started from.
Wikipedia is a huge circle-jerking effort. If you run this effort over the whole of it, you'll no doubt find out that the "works" of some porn stars are more influential than some of the more obscure philosophers.
It's not so simple, and while the basic project is interesting, drawing conclusions like "you should focus more on this" are clearly written by imbeciles who don't understand that influence isn't the same as citation count or page rank.
The pre
Not in America! (Score:2)
Every year the works of thousands of authors enter the public domain
No copyright has expired in the US since 1998, and none will expire until at least 2019. I say "at least", because you can be sure there will be lots of lobbying to extend them even further. I hope the rest of the world is enjoying their public domain... while they still have it.
I Like Tom Godwin But... (Score:2)
Take a look at "most important" (highest ranking) deceased author from the 1980s [publicdomainrank.org]. It is science fiction/fantasy writer Tom Godwin. Number two is Stanton A. Coblentz . Also in the top 20 (in order): Lin Carter, Robert A. Heinlein, Mack Reynolds, Theodore Sturgeon, James Tiptree, Jr., Clifford D. Simak. Forty percent of the top 20 are SF&F authors. Meanwhile we have Tuchman at 101, Sartre at 112, Borges at 254, Tennessee Williams at 439, Toynbee at 526, and so.
Looking at the 1990s, the top loading by SF
If I misread the researcher's name as "Riddle" ... (Score:2)