AI Data Centres: Foundations of a Tech-Driven Ecosystem

As the global AI data centre market accelerates to reach $75 billion by 2025, how are advanced AI Data centres important for businesses?

AI's progress is not just a technological marvel; it necessitates a re-evaluation of how data centres are built and managed. With AI models becoming larger and more complex, data centres must adapt to handle the unprecedented scale of these systems.

As noted by Dylan Patel from SemiAnalysis, "With these large systems, no matter what, you can't fit it on a single chip, even if you're Cerebras.

Well, how do I connect all these split-up chips together? If it's 100 that’s manageable, but if it's thousands or tens of thousands, then you're starting to have real difficulties.

AI data centres are emerging as pivotal infrastructure components, essential for powering the next generation of AI technologies. They are not merely supporting the existing computational needs but are at the forefront of enabling breakthroughs in AI capabilities. The importance of these facilities extends beyond mere data processing; they are integral to the scalability and efficiency of AI operations, driving forward the boundaries of what artificial intelligence can achieve.

Artificial intelligence (AI) has developed so quickly that it completely transformed industries and brought in a period with computing needs never seen before. With 77% of companies either using or exploring AI and 83% prioritising it in their business plans, this shift reflects growing computational demands.

As AI models grow increasingly sophisticated, they require data centres that can keep pace with their immense processing power and storage needs. This shift has prompted a significant evolution in data centre design and operation, reflecting the growing importance of these facilities in supporting AI-driven innovations.

The evolution of AI Data Centres

The transformative journey of AI data centres began in 2017 with Google's seminal paper, "Attention is All You Need," which introduced the transformer model. This innovation enabled significant parallelisation, drastically reducing AI training times. This breakthrough spurred the development of generative AI models based on transformers, such as OpenAI’s GPT-4, a leading large language model (LLM).

Rishi Bommasani, co-founder of Stanford’s Center for Research on Foundation Models, states, "The model is a combination of lots of data and lots of compute. Once you have a foundation model, you can adapt it for a wide variety of different downstream applications."

Scaling laws and optimisation

The development of these models hinges on understanding the scaling laws, which seek to balance model size, data volume, and computational resources. OpenAI's 2020 paper by Jared Kaplan highlighted the complex relationship between these factors, noting that increasing model size, dataset size, and compute power improves overall performance. However, this approach has evolved.

In 2022, DeepMind introduced the 'Chinchilla scaling laws,' which suggested that previous models were overly large in terms of parameters relative to tokens. DeepMind’s Chinchilla model, with 67 billion parameters, outperformed the much larger Gopher model by utilising a more compute-optimal strategy. This approach highlighted the potential for smaller, more efficient models, shifting the focus from sheer size to optimal resource allocation.

The cost of scaling

Training these colossal models demands substantial resources. For instance, training a trillion-parameter model on Nvidia A100 GPUs would cost approximately $308 million over three months, excluding additional costs. According to SemiAnalysis, the optimal training of a ten trillion parameter model could reach $28.9 billion over two years, underscoring the immense financial investment required.

Jaime Sevilla of Epoch AI Research notes, “The most expensive model where we can reasonably compute the cost of training is Google’s [540bn parameter] Minerva,” estimating a $10 million expenditure after multiple training runs. As demand for these models grows, retraining for new data and maintaining accuracy further escalates costs.

Inference and efficiency

Once trained, these models must be deployed, a process known as inference, which, though less resource-intensive than training, requires substantial compute power due to widespread usage. Finbarr Timbers, a former DeepMind researcher, explains, “Making the model bigger is worse in every way except performance. It’s this necessary evil that you do.”

Strategies to reduce inference costs include sparsity, pruning, and mixture of experts (MoE) models. Despite their promise, MoE models face challenges such as complexity and training instability. Dylan Patel of SemiAnalysis anticipates significant growth in MoE utilisation, predicting that efficiency improvements will not halt the drive to scale further.

Technological challenges and innovations

The rapid advancement of AI technologies has exposed significant limitations in both silicon capabilities and networking infrastructure. As AI models grow increasingly complex, traditional chips struggle to keep pace. "With these large systems, no matter what, you can't fit it on a single chip, even if you're Cerebras," observes Dylan Patel from SemiAnalysis. This highlights the pressing challenge of scaling silicon to meet the demands of massive AI models. The struggle is not just about creating more powerful chips but also about effectively connecting these chips to handle vast amounts of data efficiently.

Networking is another critical area where innovations are essential. The sheer scale of AI systems, which can involve thousands of chips working in concert, requires advanced networking solutions to manage data transfer and communication. Traditional networking architectures are often inadequate for these needs, prompting the development of novel approaches to data centre design.

Cloud company involvement

In response to these challenges, cloud companies are taking a leading role in developing proprietary networking gear and topologies tailored to AI workloads. Amazon Web Services (AWS), for instance, has made significant strides with its Nitro networking cards. By deploying clusters of up to 20,000 GPUs and integrating Elastic Fabric Adapters, AWS has created a high-performance network architecture designed to handle the immense data demands of AI. This bespoke approach allows AWS to optimise data flow and enhance the efficiency of its AI processing capabilities.

Similarly, Google is at the forefront of innovation with its Mission Apollo project. This ambitious endeavour involves deploying custom optical switching technology at an unprecedented scale. Unlike traditional data centre networks that rely on electronic packet switches, Apollo uses optical interconnects to manage data distribution more efficiently. By redirecting beams of light with mirrors, Google’s optical circuit switch enables high-bandwidth communication and dynamic reconfiguration. This technology not only improves data transfer speeds but also allows for more flexible and resilient network topologies.

As AI technology continues to evolve, addressing these technological challenges will be crucial for maintaining performance and efficiency in data centres. The advancements in silicon design and networking infrastructure are pivotal in supporting the ever-growing computational needs of AI models, ensuring that data centres can keep pace with the demands of the future.

Case studies: Cloud giants leading the way

Amazon Web Services (AWS)

Amazon Web Services (AWS) has emerged as a significant player in the realm of AI data centres, leveraging its proprietary technology to push the boundaries of computational capabilities. AWS has deployed large-scale GPU clusters, with a notable example being the utilisation of up to 20,000 GPUs. Central to this infrastructure is AWS’s Nitro technology, which includes Elastic Fabric Adapters (EFAs) that enhance network performance. Chetan Kapoor from AWS highlights, “We leverage our Nitro technology to have our own network adapters, which we call Elastic Fabric Adapters.”

One of the key advancements under AWS’s strategy is its focus on bandwidth expansion. The company is in the process of upgrading its Elastic Fabric Adapters to support increased per-node bandwidth. This upgrade will see a significant leap from the A100s to the H100s, with bandwidth expected to reach up to 3,200Gbps per node. This expansion is critical for meeting the growing demands of AI workloads, ensuring that data processing and transfer speeds are optimised for large-scale AI models.

Google

Google’s approach to tackling AI data centre challenges involves a multi-year overhaul of its network infrastructure. The company’s ambitious Project Apollo is a testament to this effort, focusing on deploying custom optical switching technology. Traditionally, data centre networks are structured with spine and leaf configurations, involving electronic packet switches. However, Apollo revolutionised this by replacing the spine with optical interconnects that use mirrors to redirect light beams. Google’s Amin Vahdat explains, “Apollo has allowed the company to build networking topologies that are more closely matched to the communication patterns of these training algorithms.”

The benefits of Google’s optical switching technology are manifold. Apollo facilitates flexible capacity deployment by enabling dynamic reconfiguration of network connections. This adaptability is crucial for managing the immense data throughput required for training AI models, where network demands can fluctuate rapidly. Additionally, the optical circuit switch's ability to swiftly adjust to changes in network topology enhances overall efficiency and reduces the need for complete system restarts during failures or maintenance.

Implications for the wider data centre industry

Generative AI's impact

The rapid evolution and increasing scale of generative AI are reshaping the data centre landscape, necessitating significant adjustments in network architecture and data centre design. As generative AI models continue to expand in complexity and computational demands, data centres may face the need to completely overhaul their existing infrastructures. Ivo Ivanov, CEO of Internet exchange DE-CIX, points out, "If generative AI becomes a major workload, then every data centre in the world could find that it has to rebuild its network."

This shift stems from the substantial computational and data transfer requirements associated with training and deploying large-scale AI models. The traditional data centre setups, often designed for more generalised applications, may not be adequate to handle the intensive demands of AI workloads. Consequently, data centres will need to adapt by implementing new networking strategies and technologies to accommodate the increasing data traffic and connectivity needs.

Critical services

The emergence of generative AI underscores the importance of several critical services within data centres. These include:

Cloud exchange: Direct connectivity to individual cloud providers is becoming crucial as organisations increasingly rely on specific cloud services for their AI workloads.
Direct interconnection: This involves establishing high-bandwidth connections between different cloud environments used by enterprises, facilitating seamless data transfer and collaboration.
Peering: Direct interconnects to other networks and end-user customers are essential for optimising data flows and enhancing network performance.

Industry response

To address the evolving needs of the data centre industry, operators must focus on developing and offering future-proof interconnection platforms. As the landscape shifts, the industry is compelled to innovate and adapt its offerings to stay relevant and competitive. This involves integrating advanced networking solutions, such as high-capacity fibre optics and enhanced interconnection technologies, to provide seamless and efficient connectivity.

The necessity for data centre operators to stay ahead of technological advancements and anticipate future requirements cannot be overstated. As generative AI and other emerging technologies continue to drive change, the ability to provide flexible, scalable, and high-performance network solutions will be critical in maintaining operational efficiency and meeting the expectations of clients and stakeholders.

Structural and geographical shifts in data centre design

The advent of generative AI is driving a transformation in data centre design, leading to both larger facilities and more intensive power requirements. As AI models become more complex and demand more computing power, the traditional data centre configurations are evolving to accommodate these changes. Digital Realty’s CEO, Andy Power, remarks, “It's still new as to how it plays out in the data centre industry, but it's definitely going to be large-scale demand. Just do the math on these quotes of spend and A100 chips and think about the gigawatts of power required for them.”

This shift necessitates the development of bigger data centres capable of supporting higher power densities. With AI models requiring more processing power, data centres are increasingly facing the challenge of managing higher heat outputs. Consequently, there is a growing emphasis on innovative cooling solutions to maintain operational efficiency and prevent overheating. The design of new data centres must integrate advanced cooling technologies and accommodate the dense packing of high-power servers.

In response to the rising demands of AI workloads, the industry is seeing the development of specialised data centre facilities. These dedicated buildings are designed to handle the unique requirements of AI training and inference processes. By focusing on AI-specific needs, these facilities can optimise infrastructure and improve performance for large-scale computational tasks.

Amazon exemplifies this trend with its strategic approach to data centre deployment. The company is building large clusters in key regions, such as Northern Virginia and Oregon, with dedicated facilities for AI training and inference. This strategy includes the use of specialised infrastructure, such as high-speed storage racks and advanced cooling systems, tailored to support AI models effectively.

Similarly, Google is adopting a forward-thinking vision by blurring the line between training and serving. The company’s approach involves creating large-scale clusters specifically for AI training, while also considering how these facilities can evolve to support ongoing model inference. This strategy aims to streamline operations and enhance the efficiency of both the training and deployment phases.

As AI continues to evolve, data centre designs must adapt to accommodate these specialised requirements. The development of dedicated facilities and innovative cooling solutions is essential to meet the growing demands of AI technologies and ensure that data centres remain capable of supporting cutting-edge applications.

Scaling AI data centres for the future

Expanding data centre capacity comes with its own set of challenges. Environmental and logistical constraints are major factors that data centre operators must navigate. Power constraints, for example, are a significant concern; finding sufficient power sources to support large-scale AI operations is becoming increasingly difficult.

Scaling AI data centres to meet the rapidly growing demands of generative AI is a monumental task. Andy Power, CEO of Digital Realty, highlights the industry's struggle: “Demand keeps out-running supply, [the industry] is bending over coughing at its knees because it's out of gas,” Power said.

“The third wave of demand is not coming at a time that is fortuitous for it to be easy streets for growth.” This shift underscores the need for massive and highly specialised facilities to support AI workloads.

The traditional cloud approach, which involves distributing workloads across various regions, may not suffice for AI model training. Due to the high intensity of compute required, AI facilities need to be strategically located, balancing proximity to other data centres with the necessity for powerful infrastructure and data exchange capabilities. This means that while AI data centres will remain focused on major metropolitan areas, finding suitable, contiguous land and power resources is becoming increasingly challenging.

Cooling solutions are another critical aspect of this evolution. As data centres incorporate more power-dense servers to handle AI workloads, they are becoming hotter environments. Digital Realty is exploring advanced cooling technologies, such as liquid cooling, which, though currently niche, may become more standard. This adaptation will be necessary to manage the increased heat generated by densely packed servers.

Amazon Web Services (AWS) is addressing these challenges by developing specialised data centres. According to AWS's Chetan Kapoor, the company is focusing on building large clusters with dedicated infrastructure for AI training, including storage racks to support high-speed file systems. This approach ensures that the required compute power and storage capacity are aligned with the specific needs of AI workloads. For inference tasks, AWS plans to integrate infrastructure across multiple regions to provide real-time support, reflecting the demand for low-latency applications.

Google is also adapting its strategy to the changing landscape. Amin Vahdat from Google acknowledges the need for specific clusters dedicated to large-scale training but suggests, “The interesting question here is, what happens in a world where you're going to want to incrementally refine your models? I think that the line between training and serving will become somewhat more blurred than the way we do things right now.”

The broader data centre industry faces significant challenges in scaling up to meet these demands. As Power points out, the sector is grappling with issues such as power constraints, environmental concerns, and supply chain delays. Addressing these challenges will require a concerted effort to expand capacity rapidly while balancing sustainability and efficiency. Despite these hurdles, the growth of AI presents a tremendous opportunity for the industry to innovate and evolve, ensuring that data centres can support the future of AI advancements.

business resources

AI Data Centres Are The Building Blocks Of A Tech-Enabled Business Ecosystem

13 Sept 2024, 2:15 pm GMT+1

The evolution of AI Data Centres

Technological challenges and innovations

Case studies: Cloud giants leading the way

Implications for the wider data centre industry

Structural and geographical shifts in data centre design

Scaling AI data centres for the future

Share this

Pallavi Singal

Editor

previous

next

More Articles

We value your privacy

business resources

AI Data Centres Are The Building Blocks Of A Tech-Enabled Business Ecosystem

13 Sept 2024, 2:15 pm GMT+1

The evolution of AI Data Centres

Technological challenges and innovations

Case studies: Cloud giants leading the way

Implications for the wider data centre industry

Structural and geographical shifts in data centre design

Scaling AI data centres for the future

Share this

Pallavi Singal

Editor

previous

next

More Articles