Cerebras Systems Enables GPU-Impossible Long Sequence Lengths Improving Accuracy in Natural Language Processing Models
Cerebras Systems, the pioneer in accelerating artificial intelligence (AI) compute, released yet another industry-first capability.Customers can now rapidly train Transformer-style natural language AI models with 20x longer sequences than is possible using traditional computer hardware. This new capability is expected to lead to breakthroughs in natural language processing (NLP). By providing vastly more context to the understanding of a given word, phrase or strand of DNA, the long sequence length capability enables NLP models a much finer-grained understanding and better predictive accuracy.
“Earlier this year, the Cerebras CS-2 set the record for training the largest natural language processing (NLP) models of up to 20 billion parameters on a single device,” said Andrew Feldman, CEO and co-founder of Cerebras Systems. “We are now enabling our customers to train with longer sequences on the largest NLP models. This provides previously unobtainable accuracy, unlocking a new world of innovation and possibilities across AI and deep learning.”
Language is context specific. This is why translating word by word with a dictionary fails —without context, the meaning of words is often vague. In language, a word is best understood in the context of the surrounding words, which provide guides to understand the meaning. This is true in AI as well. Long sequence lengths enable an NLP model to understand a given word, within a larger and broader context.
Imagine hearing the expression “To be or not to be” without context, just using a dictionary. And then imagine understanding it within the context of Act II, Scene 1 of Hamlet. And then imagine if you had broader context and could understand it within the context of the entire play – or better yet, within the context of all Shakespearian literature. As the context within which understanding occurs is broadened, so too is the precision of the understanding. By vastly enlarging the context (the sequence of words within which the target word is understood), Cerebras enables NLP models to demonstrate a more sophisticated understanding of language. Bigger and more sophisticated context improves the accuracy of understanding in AI.
While many industries will benefit from this new capability, Cerebras’ pharmaceutical and life sciences customers are particularly excited about the implications for their drug discovery efforts. DNA is the language of life, and the analysis of DNA has been a particularly powerful application of large language models.
“Machine learning at GSK involves taking complex datasets generated at scale and answering very challenging biological questions,” said Kim Branson, senior vice president and global head of AI and Machine Learning at GSK. “The long sequence length capability enables us to examine a particular gene in the context of tens of thousands of surrounding genes. We know that surrounding genes have an impact on gene expression, but we have never before been able explore this within AI.”
The proliferation of NLP has been propelled by the exceptional performance of Transformer-style networks such as BERT and GPT. However, these models are extremely computationally intensive. Even when trained on massive clusters of graphics processing units (GPUs), today these models can only process sequences up to about 2,500 tokens in length. Tokens might be words in a document, amino acids in a protein, or base pairs on a chromosome. But an eight-page document could easily exceed 8,000 words, which means that an AI model attempting to summarize a long document would lack a full understanding of the subject matter. The unique Cerebras wafer-scale architecture overcomes this fundamental limitation and enables sequences up to a heretofore impossible 50,000 tokens in length.
This innovation unlocks previously unexplored frontiers of deep learning. Even within traditional language processing, there are many examples of tasks in which this type of extended context matters. Recent work has shown that for tasks such as evaluating intensive care unit patient discharge data and analyzing legal documents, seeing the entire document matters for understanding. These documents can be tens of thousands of words long. The potential applications beyond language are even more exciting. For example, research has shown that protein structures are highly dependent on long-range interactions between building blocks, and training models with longer sequence lengths is likely to yield better results. Now that the Cerebras CS-2 system makes long sequence training not only possible, but easy, researchers are sure to uncover many more applications and solve problems previously thought to be intractable.
Training large models with massive data sets and long sequence lengths is an area that the Cerebras CS-2 system, powered by the Wafer-Scale Engine (WSE-2), excels. The WSE-2 is the largest processor ever built. It is 56 times larger, has 2.55 trillion more transistors, and has 100 times as many compute cores as the largest GPU. This scale means that the WSE-2 has both the memory to hold computations for the largest layers for the largest models, and the computational power to process such huge computations quickly. In contrast, similar workloads on GPUs have to be parallelized across hundreds or thousands of nodes to train a model in a reasonable amount of time. This type of GPU infrastructure requires specialized expertise and valuable engineering time to set up. Meanwhile, the Cerebras CS-2 system can perform similar workloads with the push of a button, removing the complexity while accelerating time to insight.
Recommended AI News: Traceable AI Offers Free Solution To Stop Log4j Attacks
[To share your insights with us, please write to email@example.com]