Swedish Library Harnesses NVIDIA to turn Library Archives into AI Training Data
The National Library of Sweden is leveraging NVIDIA’s state-of-the-art models for AI training data to parse all the paper information into digital assets. The library is renowned for collecting everything that is written in Swedish language and archiving it in a readable format. Today, with AI training data, the library is transforming trillions of information archives into digital assets. This would benefit researchers in history, linguistics, media studies and so on. NVIDIA is playing an important role in this transformational journey of medieval manuscripts to state-of-art digital copies.
Let’s understand how The National Library of Sweden and NVIDIA have partnered on AI training data.
There are hundreds of libraries in Europe, but there is something unique about the National Library of Sweden that allowed NVIDIA to dive in with its expertise in AI training data. It is a mandate in Sweden that requires a copy of everything published in Swedish to be submitted to the library — also known as Kungliga biblioteket, or KB. These copies include books, newspapers, radio and TV broadcast, internet content, dissertations, letters, menus and even video games. In total, it is 26 petabytes of data — a gold rush for AI researchers looking to build AI training data for Swedish language. AI researchers used NVIDIA DGX systems to develop more than two dozen open-source transformer models. These model are currently available on Hugging Face. The models enable research at the library and other academic institutions.
Using this model, researchers can create specialized data sets to understand the context of every Swedish content ranging from postcards to internet blogs and videos. It would also enable language analysts to review how Swedish has changed over centuries and its distinction from other European language in formal and informal terms.
The ongoing work at KBLab, established in 2019, was inspired by an early, multilingual, natural language processing model by Google that included 5GB of Swedish text. Soon, this lab began experimenting with Dutch, German and Norwegian content to develop a multilingual dataset . Together, these data sets may improve AI’s performance in computing larger models for language research and content translation.
KBLab started out with NVIDIA GPUs, but soon upgraded to NVIDIA DGX. The lab has two NVIDIA DGX systems from Swedish provider AddPro for on-premises AI development. The systems do these three things for the project:
- collect and store sensitive data
- conduct large-scale AI depolyments
- fine-tune ML models
Additionally, these are also used to prepare for even larger runs on massive, GPU-based supercomputers across the European Union — including the MeluXina system in Luxembourg. The team has also adopted NVIDIA NeMo Megatron, a PyTorch-based framework for training large language models, with NVIDIA CUDA and the NVIDIA NCCL library under the hood to optimize GPU usage in multi-node systems.
In addition to transformer models that understand Swedish text, KBLab has an AI tool that transcribes sound to text, enabling the library to transcribe its vast collection of radio broadcasts so that researchers can search the audio records for specific content.
KBLab is also starting to develop generative AI text models. It would thrust the research into building an AI model that could process videos and create automatic descriptions of their content.
KBLab has partnered with researchers at the University of Gothenburg, who are developing downstream apps using the lab’s models to conduct linguistic research — including a project supporting the Swedish Academy’s work to modernize its data-driven techniques for creating Swedish dictionaries.
Source: NVIDIA/ Isha Salian