Artificial Intelligence means different things to different people and different organizations. This can be especially true depending on whether AI is being used to tackle small problems or big problems at scale. Generally, the accepted definition of AI is the ability of a computer system to perform a task typically thought to require intelligence, learn from that task, and refine the intelligence. Often the tasks associated with AI are thought to require at least close to human-level reasoning ability, the difference being that AI systems can deal with data at a volume and velocity far above human ability. At one end of the spectrum, AI can mean anything from recognising faces on your phone to reading documents to extract content or context. At the other end of the spectrum, we could be talking about large data center clusters full of telemetry relating to anything from vehicle fleets for a global supply chain operation which is using AI to optimize routes and suppliers to Large Language Models which are trained on vast volumes of unstructured information resulting in tools such as the now popular Chat GPT.
These AI innovations require highly available, high-capacity architectures to support the volumes of data being distributed and redistributed on demand for training and serving these AI models in production. This creates significant challenges for data center operators and hyperscalers especially. They have to deal with increasingly diverse workloads at scale, which can range from batch loading of large datasets to train the models to the high volumes of smaller requests that need to be serviced when these models are deployed in production. Throwing more infrastructure at the problem is not a long-term solution. However, we are not dealing with technologies where long-term needs are known. The challenge is compounded by the fact that data center operators can’t stop to take a breath to figure it out; there is no pause button in this market. Customer demands are higher than ever, and whilst current approaches may be suitable for smaller AI workloads, this is a different story when addressing larger requirements.
Allowing for the evolution of an accidental network architecture through short-term solutions, patching capacity, or availability issues is typically a recipe for long-term pain. This is especially true in the world of AI, where an AI cluster can be formed by combining thousands of AI accelerators into one homogeneous “AI Brain”. The challenge here doesn’t come especially from north-south traffic (data moving in and out of the data center), which is still important as we need to get data into these systems, but from the fact that high-performance AI systems are the aggregation of these AI accelerators, which scale horizontally (Adding new nodes to the network rather than adding hardware to each node). This means the performance needs relating to east-west traffic or node-to-node communication within the cluster must be heavily optimized. Connecting thousands of the highest compute capacity AI Accelerators to behave as a single AI Brain, requires a completely separate, dedicated, low latency network across all accelerators in the cluster leading to far greater fibre infrastructure density requirements in the data hall. Effectively a dedicated network needs to act as super low latency backplane for the cluster to perform as efficiently as possible.
This is demonstrated clearly by Nvidia’s acquisition of Mellanox so they could include a dedicated high-speed InfiniBand network to act as the dedicated network backplane for their DGX GPU Accelerators. When you consider these needs alongside the pace of innovation in the AI developer community and the increasing scale of the models being trained and served, data centers are faced with a real challenge. Traditional thinking suggests they need to futureproof their architecture; however, the future needs of what they may deal with in 6 months, 1 year, or 3 years ahead is uncertain. The ability to futureproof is limited, making it essential to build as flexibly as possible, allowing maximum scalability. However, the needs in the market are so immediate and pressing, operators are forced to implement whatever the most current and highest performing networking solutions are available right now, in the knowledge that next year the technology may change and their infrastructure must accommodate that future technology.
As the pace of innovation in AI continues to accelerate, partnering with a company like AFL that has strong R&D at the core of its organization is critical. We are working with customers on a day-to-day basis, forced with making hard decisions around where and how they compromise traditional established norms and good practices in the pursuit of not just keeping up with the market but staying ahead of it.

Written by Keith Sullivan
Director of Strategic Innovation, AFL

Alan Keizer
Senior Technical Consultant, AFL
April 13, 2023