IBM AI Provides Ultra-Modern Captioning for News Broadcasts
IBM researchers have devised a software architecture that can achieve best-in-class results for captioning news broadcasts. Only about two years ago, the company had achieved something similar with transcriptions which is not as easy as it sounds. The machine learning driven initiative had to outsmart a plethora of obstacles before reaching its goal. Now, researchers of the Armonk, New-York based software giant have achieved a breakthrough in captioning capabilities. They have detailed their findings in a paper and will be presenting it later at a conference in Brighton.
IBM states the technology was hard to develop considering background noises and news anchors speaking about a wide range of topics. Also, there was a large volume of disparate subjects like onsite interviews, multimedia, TV show clips et al.
As IBM researcher Samuel Thomas explains in a blog post, the AI leverages a combination of long short-term memory (LSTM) — a type of algorithm capable of learning long-term dependencies — and acoustic neural network language models, along with complimentary language models. The acoustic models contained up to 25 layers of nodes (mathematical functions mimicking biological neurons) trained on speech spectrograms, or visual representations of signal spectrums, while the six-layer LSTM networks learned a “rich” set of various acoustic features to enhance language modeling.
IBM researchers followed the below-mentioned modus operandi –
- The entire system was fed with 1,300 hours of data that was imported from the Linguistic Data Consortium
- The researchers deployed AI on the test set — the set consisted of two hours of data from six shows all tied together by 100 overlapping speakers
- Then there was a second test with four hours of data from 12 shows with 230 overlapping speakers
- For measuring results, IBM worked with speech and search technology firm Appen
- The results — 6.5% & 5.9% on the first and second test respectively
- This was deemed a little poorer than human performance (3.6% and 2,8% on the first and second test respectively)