Google and Harvard debut dataset with 1m public domain books for AI training



Harvard University, in conjunction with Google, has released a dataset of a million public domain books to train the next generation of AI.

The books span genres, languages, and authors such as Dickens, Dante, and Shakespeare which are no longer copyright protected because of their age. The new dataset initiative comes as AI training data is naturally pricey and best suited for tech firms with deep pockets.

Harvard got financial backing from tech giants

According to a TechCrunch article, the initiative is spearheaded by the Harvard’s Institutional Data Initiative (IDI).  This initiative contains books derived from Google’s longstanding book-scanning project Google Books.

Other books contained in the dataset include Czech math textbooks and Welsh pocket dictionaries.

The university teased the IDI in March clearly stating its plans to create a “trusted conduit for legal data for AI.” Since then, not much was heard from it until the formal launch on Thursday and tech giants Microsoft and OpenAI funded the project.

The dataset is not a preserve of Silicon Valley alone but IDI has opened it to anyone, that is from research labs to AI startups that want to train their large language models.

By opening the dataset to anyone, IDI executive director Greg Leppert said the dataset is meant to level the playing field, at a time when the cost of training AI remains high and prohibitive to smaller companies and making it preserve of those with huge budgets.

Leppert added that the dataset is “rigorously reviewed,” which according to Fudzilla presumably means someone checked to ensure that Bard was really gone and out of the way.

The Harvard dataset will need more resources

According to Leppert, who compared the dataset’s potential to Linux, the open source operating system, the success of the Harvard dataset will be hinged on a number of variables. Leppert said its success will require more resources, expertise, and a “sprinkle of magic” from those same deep-pocketed corporations that the initiative is designed to challenge.

The million books contained in the dataset were scanned as part of Google Books program. Fudzilla describes the initiative as a digital time capsule from when Google’s ambitions to scan every book seemed quirky rather than dystopian.

However, Leppert is upbeat about the project’s potential uses, further suggesting it could a such a treasure trove helping train AI models for everyone from garage startups to the corporate conglomerates.

While some have praised the initiative as a revolutionary leap forward in democratizing AI, Fudzilla opines that some might see this as a subtle means of ensuring that any ambitious upstart with a few terabytes of server space can now compete in a race to develop the next ChatGPT.

However, they will need more resources to compete and make a dent in the market. ChatGPT launched in November 2022 to immediate success, which spurred the race for generative AI models across the globe. However, the development of these models has created a thirst for data to perfect them and this desire for more data has caused problems on how much information they can get, without stealing it.

To date, publishers like the Wall Street Journal and the New York Times have sued OpenAI and Perplexity over using their data without permission.

A Step-By-Step System To Launching Your Web3 Career and Landing High-Paying Crypto Jobs in 90 Days.



Source link