Optimizing IBM’s Cloud-Native AI Supercomputer for Superior Performance

By Pooja Choudhary On Mar 15, 2024

Maximize Performance with IBM’s Cloud-Native AI Supercomputer

A lot has happened in artificial intelligence this past year, from new technologies being well-known to models with tens of billions of parameters being used for practical purposes. The sophisticated AI capabilities were brought to IBM clients across a wide variety of industries with the launch of Watsonx, the data and AI platform for enterprises. This was made possible by utilizing several inventions that originated from our IBM Research community.

An increasing number of tasks across the AI lifecycle necessitate the development of systems with appropriate computational capacities. For this and other reasons, IBM built the AI supercomputer Vela last year and put it in the IBM Cloud. By utilizing Vela, we can streamline the entire AI workflow on the IBM Cloud. This includes data pre-processing, model training and tuning, deployment, and even incubation of new products.

Read: 4 Common Myths Related To Women In The Workplace

What is Vela?

Vela was built to be versatile and extensible, so it can train current large-scale generative AI models and handle future demands as they come along. Its architecture was also made to be easily deployed and maintained from any location on Earth. Vela has been used for AI training and prototyping by IBM AI practitioners over the past year. One such AI technology is watsonx.ai, IBM’s next-generation AI studio. It would have been impossible to swiftly launch a global platform like watsonx.ai without Vela’s cloud-first design.

New: 10 AI ML In Personal Healthcare Trends To Look Out For In 2024

Features of Vela The AI Supercomputer

At this time, Vela is exclusively available to members of the IBM Research community. IBM has described the system as its new “go-to environment” for AI researchers within the firm. Though IBM did say as much, they also hinted that Vela is more of a trial run for a bigger rollout.

Details first. In contrast to IBM’s Power10 chips, which were announced in 2021, each Vela node has two Intel Xeon “Cascade Lake” CPUs, eight Nvidia A100 GPUs (each with 80GB of RAM), 1.5 terabytes of RAM, and four 3.2 terabytes of NVMe storage. The system is announced by IBM on its blog. According to the company, the nodes are linked through “multiple 100G network interfaces.” To ensure strong cross-rack bandwidth and component failure isolation, each node is connected to a separate top-of-rack switch, which is connected to four separate spine switches. One of IBM Cloud’s features is that Vela is “natively integrated” with their VPC environment.

Read: Data monetization With IBM For Your Financial benefits

USP

SiMa.ai Raises $85Million to Scale Physical AI, Bringing Total Funding to $355Million

Aug 1, 2025

GAIB Secures $10 Million to Accelerate AI Infrastructure Growth in Strategic Investment Round Led by Amber Group

Aug 1, 2025

Basics Of Modern AI Architecture Impacting Enterprise Operations

Aug 1, 2025

Prev Next 1 of 41,315

Boosting capability

We wanted to make sure that adding more GPUs to Vela didn’t take up too much room or resources, even if that was the original plan all along. Specifically, we sought a method to quadruple the density of the server racks, which increased capacity by a factor of two without necessitating more room or networking hardware.

The enhancement of diagnosis and operations

Improving the system’s efficiency was another goal of the Vela team. The complexity of AI servers causes them to fail more often than many more conventional cloud systems. Also, their failure modes are often surprising and difficult to identify. Also, a training job that uses hundreds or thousands of GPUs can have its performance affected when just one or two of the nodes in the network go down. So, to maintain the environment productive, automation that uncovers and detects these kinds of problems and generates alerts as fast as feasible is crucial.

Boosting Vela’s speed

To train and deploy this specific wave of AI, there are unique interdependencies with the underlying infrastructure. The increasing size of data sets being used to train models necessitates the utilization of additional GPUs per operation to move quickly. To avoid GPU-to-GPU communication becoming a bottleneck to workload progress, there needs to be a corresponding increase in network performance as the number of GPUs computing in parallel grows. We successfully scaled training individual workloads to thousands of GPUs per job after deploying a major upgrade to the Vela network this year. We implemented GDR (GPU-direct RDMA) and RoCE (RDMA over Converged Ethernet) as the primary enabling technologies on Vela.

Read: Top 10 Benefits Of AI In The Real Estate Industry

[To share your insights with us as part of editorial or sponsored content, please write to sghosh@martechseries.com]

AI supercomputer IBM Cloud.Vela watsonx.ai

Optimizing IBM’s Cloud-Native AI Supercomputer for Superior Performance

Maximize Performance with IBM’s Cloud-Native AI Supercomputer

What is Vela?

Features of Vela The AI Supercomputer

USP

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy

Optimizing IBM’s Cloud-Native AI Supercomputer for Superior Performance

Maximize Performance with IBM’s Cloud-Native AI Supercomputer

What is Vela?

Features of Vela The AI Supercomputer

USP

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

﻿Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought. Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy

Please fill your details and we’ll get in touch with you!

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy