Rethink physical infrastructure, to enable an AI-led future

Interest in the possibilities of AI and machine learning (ML) in real-life applications has undoubtedly been rising. New applications have been emerging daily since ChatGPT went viral in the news. This momentum has proliferated across Asia-Pacific, with regional spending on AI projected to rise to approximately US$78 billion by 2027.

Millions globally are now regularly engaging with AI through interfaces like ChatGPT and Bard. However, most users do not realise that their cosy desktop exchanges with AI assistants are powered by massive data centres worldwide. These data centres increasingly rely on a high density of fibre optic cables.

Enterprises investing in AI are focusing on building these capabilities within data centres, where AI models are constructed, trained, and refined to align with specific business strategies. These physical AI cores consist of racks of GPUs (graphical processing units) that provide the parallel processing power AI models need for exhaustive algorithm training. This power is necessary to analyse and interpret vast volumes of data sets, enabling AI solutions like ChatGPT to perform tasks natural to the human mind, such as differentiating the use cases of the word “date” in multiple contexts, and solving complex problems.

This real-time “intelligent” processing has captured the imaginations of organisations everywhere. However, developing useful AI algorithms requires vast amounts of data for training, leading to expensive, power-intensive processes. These can give rise to cost and energy challenges, limiting the deployment of enterprise AI.

Built for speed 

AI clusters must be built for speed. Data centres facilitating AI typically maintain discrete GPU clusters that work collaboratively to process data for AI algorithm training. The substantial heat generated by these energy-intensive GPUs limits the number that can be accommodated in each rack. This necessitates new approaches to cooling and a reevaluation of the fibre connectivity architecture.

Compared to traditional data centres, AI clusters require significantly more inter-rack cabling to ensure sufficient low latency connectivity at speeds ranging from 100 to 400G. These requirements cannot be met by copper cables. Thus, optimising the physical layout or even implementing a new data centre architecture is crucial to reduce heat and minimise link latency.

In most day-to-day operations within data centres, training large-scale AI typically involves approximately 30% of the time being consumed by network latency, with the remaining 70% spent on computing. To illustrate, a minor reduction in latency, such as 50 nanoseconds by shortening the fibre run by 10 metres, can result in substantial time and cost savings, particularly when training large AI models that can cost US$10 million or more.

A necessary investment

It has never been more crucial for operators to carefully consider the optical transceivers and fibre cables they use in their AI clusters, aiming to minimise cost and power consumption. Having foresight into future data centre operations is key to avoiding the risk of technical debt, which can result from settling for quick fixes.

To ensure optimal cost efficiency, data centres might consider transceivers that use parallel fibre. One advantage of parallel fibre is that it does not require the optical multiplexers and demultiplexers used for wavelength division multiplexing, leading to both lower costs and reduced power consumption for these transceivers. While parallel fibres are marginally more expensive, the cost savings outweigh the slight increase in price compared to duplex fibre cables.

Market research suggests that for high-speed transceivers (400G+), the cost of a single-mode transceiver is double that of an equivalent multimode transceiver. While multimode fibre is slightly more costly than single-mode fibre, the cable cost difference between multimode and single-mode is less significant, as multi-fibre cable costs are primarily driven by multi-fibre push connectors.

In a single AI cluster, with up to 768 transceivers, using multimode fibre can save up to 1.5 kW. While this may seem minor compared to the 10 kW consumed by each GPU server, in AI clusters, any opportunity to save power is valuable, offering significant savings across multiple instances of AI training and operation.

Challenging conventional approaches

Typically, most AI and ML clusters, as well as high-performance computing systems, use active optical cables (AOCs) for interconnecting GPUs and switches. An active optical cable is a fibre cable with integrated optical transmitters and receivers at both ends. Most AOCs, used for short distances, typically employ multimode fibre and vertical cavity surface-emitting lasers.

Each transmitter and receiver in an AOC do not need to meet rigorous interoperability specs; they only need to function with the specific unit at the other end of the cable. Since no optical connectors are exposed to the installer, there is no requirement for capabilities to clean and inspect fibre connectors.

However, the traditional use of AOCs entails a higher failure rate—double that of other methods—and a time-consuming installation and rerouting process, which detracts from operational usage. Additionally, upgrading network links necessitates the complete removal and replacement of AOCs.

Transceivers, on the other hand, are seen as a generational solution, as the fibre cabling becomes part of the infrastructure and is likely to remain in place for several decades, accommodating various data rates.

Securing the future, today

With all the potential powers and wonders that AI and ML can ever muster, the onus of its operation running smoothly and as cost-efficiently as possible still falls on the human operator.

Enterprises that have in place an optimised data centre infrastructure to train AI quickly and efficiently will have a competitive advantage in a fast-changing, super-connected world.

Today’s investment in the advanced fibre infrastructure that drives AI training and operation today, could very well deliver incredible results tomorrow.