Telescent and MIT CSAIL Collaborate to Accelerate Machine Learning Workflows
Telescent a leading manufacturer of automated fiber patch-panels and cross-connects for networks and data centers, announces that results of the company’s collaboration with the Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory (MIT CSAIL), aimed at accelerating training time for machine learning workflows, will be showcased in an invited presentation at the Networked Systems Designs and Implementation (NSDI) conference taking place April 17-19, 2023 in Boston, MA.
Latest Insights: AiThority Interview with Thomas Kriebernegg, Managing Director & Co-Founder at App Radar
The collaboration between Telescent and MIT CSAIL focused on improving the training time for #ML workflows by optimizing the communication between workers in the #GPU cluster through programmable network connections. Results accelerated workflows by 3.4x.
The NSDI conference focuses on the design principles, implementation, and practical evaluation of networked and distributed systems. The goal of the organization is to bring together researchers from across the networking and systems community to foster a broad approach to addressing overlapping research challenges.
Today’s machine learning (ML) training systems are deployed on top of traditional datacenter fabrics with electrical packet switches arranged in a multi-tier topology. The performance and efficiency of this architecture faces severe limitations because of localized network bandwidth bottlenecks. The Telescent programmable patch panel can provision and deliver network connections with essentially unlimited network bandwidth (i.e. thousands of Terabits per second) within a massive GPU cluster while consuming minimal energy. The collaboration between Telescent and MIT CSAIL focused on improving the training time for machine learning workflows by optimizing the communication between workers in the Graphics Processing Unit (GPU) cluster through programmable network connections. The collaboration accelerated workflows by 3.4 times, a significant performance improvement that overcomes limitations of current GPU clusters in ML training applications.
According to Manya Ghobadi, Associate Professor at MIT CSAIL and program co-chair of NSDI, large-scale ML clusters require enormous computational resources and consume a significant amount of energy. As a prime example, training a ChatGPT model with 65 billion parameters requires 1 million GPU hours and costs over $2.4 million [source]. Just in January 2023, ChatGPT served 600 million live inference queries and used as much electricity as 175,000 people [source]. As a result, “this trend is not sustainable,” said Ghobadi.
Hot AI News: Innovations in Media Quality With The Emergence of New Mediums
To address this challenge, the MIT CSAIL researchers proposed TopoOpt, a reconfigurable optical datacenter for DNN (Deep Neural Network) training leveraging the unique performance and scalability of the Telescent programmable patch panel. TopoOpt is the first ML-centric network architecture that co-optimizes the distributed training process across three dimensions, computation, communication, and network topology, to significantly improve performance. The team at MIT CSAIL integrated TopoOpt with Nvidia’s NCCL library and built a fully functional prototype of TopoOpt with the Telescent robotic patch panel and remote direct memory access (RDMA) forwarding at 100 Gbps. According to Prof. Ghobadi, “This is the only-known testbed that allows topology and parallelization co-optimization for ML workloads … our experiments showed that TopoOpt improves the training time of real-world DNNs by a factor of 3.4.”
“Large-scale deep neural networks are reshaping our daily life and how we interact with the world,” adds Weiyang “Frank” Wang, a third-year Ph.D Student working at the Network and Mobile System group at MIT CSAIL, advised by Manya Ghobadi. “TopoOpt is our latest attempt to speed up the training process of these large models through innovations in the fundamental infrastructures people use for these processes. Inspired by Telescent’s recent inventions on reconfigurable optical patch panels, we dive deep into the world of reconfigurable topology specifically for DNN training. Using reconfigurable network topology brings a new dimension for optimizing large DNN training workloads.”
AiThority: Put People, Not Tech, at the Heart of Your MarTech Program
[To share your insights with us, please write to sghosh@martechseries.com]
Comments are closed.