

Boston Public Library Aims To Increase Access To a Vast Historic Archive Using AI 30
An anonymous reader quotes a report from NPR: Boston Public Library, one of the oldest and largest public library systems in the country, is launching a project this summer with OpenAI and Harvard Law School to make its trove of historically significant government documents more accessible to the public. The documents date back to the early 1800s and include oral histories, congressional reports and surveys of different industries and communities. "It really is an incredible repository of primary source materials covering the whole history of the United States as it has been expressed through government publications," said Jessica Chapel, the Boston Public Library's chief of digital and online services. Currently, members of the public who want to access these documents must show up in person. The project will enhance the metadata of each document and will enable users to search and cross-reference entire texts from anywhere in the world. Chapel said Boston Public Library plans to digitize 5,000 documents by the end of the year, and if all goes well, grow the project from there. Because of this historic collection's massive size and fragility, getting to this goal is a daunting process. Every item has to be run through a scanner by hand. It takes about an hour to do 300-400 pages.
Harvard University said it could help. Researchers at the Harvard Law School Library's Institutional Data Initiative are working with libraries, museums and archives on a number of fronts, including training new AI models to help libraries enhance the searchability of their collections. AI companies help fund these efforts, and in return get to train their large language models on high-quality materials that are out of copyright and therefore less likely to lead to lawsuits. "Having information institutions like libraries involved in building a sustainable data ecosystem for AI is critical, because it not just improves the amount of data we have available, it improves the quality of the data and our understanding of what's in it," said Burton Davis, vice president of Microsoft's intellectual property group. [...] OpenAI is helping Boston Public Library cover such costs as scanning and project management. The tech company does not have exclusive rights to the digitized data.
Harvard University said it could help. Researchers at the Harvard Law School Library's Institutional Data Initiative are working with libraries, museums and archives on a number of fronts, including training new AI models to help libraries enhance the searchability of their collections. AI companies help fund these efforts, and in return get to train their large language models on high-quality materials that are out of copyright and therefore less likely to lead to lawsuits. "Having information institutions like libraries involved in building a sustainable data ecosystem for AI is critical, because it not just improves the amount of data we have available, it improves the quality of the data and our understanding of what's in it," said Burton Davis, vice president of Microsoft's intellectual property group. [...] OpenAI is helping Boston Public Library cover such costs as scanning and project management. The tech company does not have exclusive rights to the digitized data.
I heard this story on NPR today.. (Score:4, Insightful)
Re: (Score:2)
Yes, finally a sensible use is being reported here. LLMs are pretty good at summation and categorization. I have a pretty big media library I'm thinking of using local models to categorize. Since it doesn't much matter how long it takes, I can underclock everything and let it just plod along.
Re: (Score:2)
Yes, if it can be a thing (like Windows did for the Media Center thing).
The whole ball of wax for Media Center is a whole thing.
Better hurry (Score:4, Informative)
Before the fascist goon makes them rewrite history to fit his agenda [cbsnews.com].
Re: (Score:2)
Re: (Score:2)
(Offtopic)
Of course... that's if the 'other party' has a worthy candidate.
(On topic)
Why does this need AI? I could see maybe converting a scanned page or book to a text file that a computer can search through... but Adobe has been able to do that for quite a while (OCR).
"AI" is not the answer to everything. Humans (last I checked, anyway) are capable of reading books and researching stuff.
Re: (Score:2)
Re: (Score:2)
Of course it does... it removes the extraneous crap and gives you a few highlights... what about the in-between information?
My mind has never hallucinated anything... the closest approximation would be dreams. Not that any medical establishment has any clue about the whole mind/brain thing... the 'soul' is still a mystery (even though people lose a couple dozen grams when they die... they don't know why).
If I want to know something (say 'Chernobyl'), I'll read the first article that seems like it has info,
Re: (Score:2)
I would say yes. People are spitting out AI books and flooding the market with them. They may or may not be able to make a 3D model yet, I don't know. I know they can do the basics of SW programming well, and their syntax is almost perfect. It can edit or even create a video based on specific verbal instructions for anybo
Re: (Score:2)
Yeah... and, are those "AI" books worth taking 5 minutes to glance at? I would much rather read a good book written by Michael Crichton (or whoever), than something written "in the style of Crichton".
What about the potential authors (the humans) who are trying to be authors? Should they lose their livelihood because of AI? What about the people who write movies or even act in movies? Should they be unemployed because of AI?
"AI" can't make the 3D model for me because it can't understand that I want to ma
Re: (Score:2)
Re: (Score:2)
Thank you... that's rare on here.
The "AI" rush is ignoring the fact of what it actually is... it's the same thing that your phone does when you text someone... predictive text.
If you stuff it (the hardware and crap) in a closet without 'net connectivity... can it solve the problem of how to get out?
If it can't (consistently... run the test a dozen times), is it intelligent?
Sure... if I give it access to my Arduino code stores, it can reference the code pile and make something following the standards.
If I as
Re: (Score:2)
Re: (Score:2)
Y'know... you could just write the code.
Funny how that works... the shit (can I use that here?) that is supposed to help us ends up holding us back.
Re: (Score:2)
Re: (Score:2)
Yeah, I could do that sitting here listening to Tatu
What advantage does the "AI" give you?
I'm not gonna do a full standoff because there's no real-world thing.
We'd have to sync watches and crap... my watch sets itself to the atomic clock... a Casio PAW-1500
If anyone musses with the door (without saying anything)... I'm up (without the glasses), and have the katana in hand
Re: (Score:2)
(Offtopic) Of course... that's if the 'other party' has a worthy candidate.
(On topic) Why does this need AI? I could see maybe converting a scanned page or book to a text file that a computer can search through... but Adobe has been able to do that for quite a while (OCR). "AI" is not the answer to everything. Humans (last I checked, anyway) are capable of reading books and researching stuff.
Simple search *can* work, if you know precisely what set of words you're looking for. AI enabled search works well when it comes to documents. I have a PrivateGPT instance at home running on a not particularly high-spec machine for finding info in my fictional universe, currently somewhere around 400k words. Instead of having to remember precise word patterns, I can just ask, "What rank was $character on $date" or "what color eyes does $character" have or "what was the name of the captain on $shipname in th
Re: (Score:2)
So... why do you need a LLM indexing your Google stories folder?
I know what's in my stories folder, and which has what... I don't need to launch a search across all 8 drives on the tower to find one thing.
It's not "AI"... it's predictive text (it's the same thing your phone does when you text someone... just on a bigger scale).
It can be useful for certain things... but it's not going to replace "remembering where that file is" anytime soon.
Re: (Score:3)
So... why do you need a LLM indexing your Google stories folder?
-1 don't use google anything for my written work. It's a private, local-only LLM.
I know what's in my stories folder, and which has what... I don't need to launch a search across all 8 drives on the tower to find one thing.
So do I, in theory. But I can't tell you the number of times I've spent half an hour or more trying to track down precisely where in that massive tome of documents I mentioned this fact or that fact, when a machine-enabled search can find it in seconds.
It's not "AI"... it's predictive text (it's the same thing your phone does when you text someone... just on a bigger scale).
I think we're well past the point where we can argue about the definition of AI. Words no longer mean things. It's marketing labels that mean things.
It can be useful for certain things... but it's not going to replace "remembering where that file is" anytime soon.
Actually, that's PRECISELY wh
Re: (Score:2)
eh, so Trump re-writes history, and fakes the GDP, inflation, and jobs numbers for awhile. I am still optimistic that in a year and a half, the other party will get control of the house, and I think Trump will be "toast" then. Then we are looking at 2 years of paralysis... then maybe someone better will come along that can move the US forward, instead of backwards.
Forty years of stalemate/backwards/stalemate/backwards and you have hope? I admire your optimism, but I think the best we can hope for is a return of stalemate.
Who will own the results? (Score:2)
Of course, all those books Google digitized are now behind Google's login.
Re: (Score:2)
Well... it's Google or a competitor... can't think of a competitor other then M$
It depends on what that scanned book/article is behind... depending on copyright, it should (maybe has to) be available to public.
Whether it's Google or ArsTechnica or whoever... the actual thing needs to be no excuse public, especially for rare stuff.
Paywalled articles about new drugs or science papers shouldn't exist... maybe do a "publish the article, but contact info is paid-for" or something.
For the people "complaining" abo
I have approximate knowledge of many things (Score:1)
Oh wonderful. In exchange for company access to the entire library catalogue, the public gets a shitty semantic search engine that can only work by probabilistically stringing words together. This will do nothing for people who need factual information.
Re: (Score:2)
Re: (Score:2)
Sure... you can ask the "AI" to directly quote an ArsTechnica article... and it'll give you what you want. You entirely could go to the site itself and read the article.... what did the "AI" do for you?
It's a very basic search engine, that replies using (more or less) "plain text", and I'm sure there's a lot of censorship and filtering built into it.
I can type something into a search engine and review the results, and click on the closest relative to what I wanted to find.
Of course, I can find my article a
Re: (Score:3)
Re: (Score:2)
Let me boil it down for ya... the entirety of any of them is relying on search engines (which kind of defeats the whole purpose)
Why do you need it to do emulations of physics and material science crap... you can't be bothered to do that stuff on your own?
Re: (Score:1)
If it's an LLM powered search engine, then it's strictly worse than what we used to have because it doesn't return information, it returns a statistically likely arrangement of words. Hallucinations are how LLMs work: mashing together words/tokens in its training set to give you a series of words that are likely to occur. "Training" is just a cover, the industry cannot solve a problem that is the foundation of how the system works.
more important (Score:2)
The most important legal documents, however, are the annotated decisions, explaining precedent and interpretation, that are very expensive to buy, only available in law libraries, which are closed to the public.
Re: (Score:2)
That would be a good use, because I don't wanna have to read a 500-page law document just to find that "someone VS someone" relied on some obscure law referenced in some other case 50 years ago... a version that gives the highlights of the first case, including mentions of the 'law referenced' (so you could try and look that up if you were so inclined) would be a good thing, instead of paying a bunch to buy the file and finding the information you want is actually in a different one.
I doubt that'll happen,