A well-defined OE process that spans across countries and locations is a must for providing seamless services to global DC customers

In an increasingly digitized world, organizations are adopting a cloud-first approach, moving data, processes, and compute to cloud-based servers. In this new environment, IT complexity has gone up dramatically and data centers (DCs) have become the central nervous system of the modern IT network, powering the cloud and providing organizations with global reach.

Data centers have their roots in the computer rooms of the 1940s. Despite being around for so long, before the advent of the digital era, a typical DC has usually been a smaller and less complicated facility than their modern-era analogs. In these small DCs, it was possible to manually check each component to ensure everything was in working order.

The role and scope of DCs have changed dramatically with rapid digitalization which is driving the demand for more data storage, cloud-based compute and processing. This, in turn, has exponentially increased complexity of DC facilities. In addition, given the mission-critical nature of DCs that are connected to the Internet, there is a need to have them always on and available.  In this scenario, a manual approach to maintenance and troubleshooting no longer works and in the DC industry, it is a matter of record that many facility outages can be attributed to human or mechanical error which results from poor operations and maintenance practices.

Modern-day DC operators find themselves having to simultaneously secure the IT part of the operations, things like connectivity, rack space, and CPU availability, along with physical infrastructure components such as UPS (uninterrupted power supply) devices, PDUs (power distribution units), chillers, HVAC (heating, ventilation, and air cooling), as well as processes that ensure the physical security of these facilities.

With the size and the number of moving parts increasing dramatically, it is no longer possible to ensure service availability of 99.99 percent, which is typically demanded in SLAs (service level agreements), without sophisticated systems and processes. Technology and automation is the key to ensure all moving parts work smoothly in a large DC.

Culture of OE
While technology solutions are available, systems, processes, and training are a constant requirement for DC operators like Princeton Digital Group (PDG). This is why the culture of operational excellence (OE) is vital for companies like PDG since it is a process that allows DC operators to create the environment required for mission-critical operations without having to worry about downtime. OE allows operators to ensure that the equipment that is used to store, process, or route the data is consistently secured and operates within a controlled environment.

DC operators use OE methodologies to quantifiably measure every aspect within the DC facility and improvements are made based on these measurements. This is a continuous process that successful companies use to improve their systems by making sure all operational aspects are under control. A well-defined OE methodology ensures that data centers work all the time.

A company-wide OE culture is especially critical for DC operators who have multiple facilities across different countries, each with its local peculiarities. The internet is global and global internet platforms grow in scale by leveraging networks that are multi-country and are seamlessly connected. For DC providers to continue to serve these platforms at a relevant scale, it is critical to be a multi-country partner to them. Many of PDGs customers have a presence in different geographies where PDG also has facilities and they expect to receive the same level of service in every country.

Challenges of OE in a multi-country environment
Each DC campus can be thought of as a massive machine that encompasses acres of space. The campus is connected to power and water utilities with massive onsite backup generators, delivers several 10s and sometimes 100s of MWs, and dissipate billions of BTUs (British Thermal Units) of heat a month, all the while ensuring 100 percent operational uptime.

For DCs to deliver true OE, all the components must work in perfect harmony. And this harmony needs to have the resilience to face unexpected events such as external shocks and human errors.

Even for a single DC, this is a challenging task. And this challenge is multiplied manifold in a multi-country environment since each country has its regulations, utility infrastructure, vendor and support ecosystem, and talent pool quality. Multi-country DC operators like PDG contend with the challenge by using a company-wide set of best practices that are ensconced within its OE methodology. This provides a consistent operational experience for customers across all their sites.

Delivering OE in a multi-country context
Delivering OE consistently relies on three pillars: People, Principles, and Practices. This framework applies at a site level, at a country level as well as across a multi-country platform. Of the three, common principles are relatively easier to drive across multiple countries. Frameworks that include a consistent basis of design help drive these principles across site selection, engineering, development, and actual operations. Communication within the organization is a key necessity to formulating company-wide principles. The principles of OE should stand inviolate and be communicated often.

While principles set up the framework, practices are what ensure that these principles are applied and absorbed into the organization’s fabric. Practices are a two-way street and while the principles need to be non-negotiable, practices can often vary from country to country and also sometimes from site to site within a country. For example, practices in countries where the power utility is fundamentally unreliable will be somewhat different from those where a failure of power utility is a rare event.

The real challenge is in the people pillar. There are significant variations in the depth of the talent pool across the different markets in which PDG operates. The evolution of each market has been different and dependent on the talent pool and diversity of available employees. At a platform level, leveraging this diversity of experience is critical. While maintaining OE is about delivering consistent standards, building true fault tolerance is about being ready to meet the unknown and this can only be done by harnessing the variance between countries rather than trying to put them all in a homogenous box.

Going beyond Robustness
Nassim Nicholas Taleb, mathematician, writer, and thinker, talks about how scale is sometimes not good since along with scaling operations you are also scaling risk. Intuitively, this is true and a large DC campus would somehow feel more vulnerable and harder to control due to higher complexity than smaller data centers. However, we cannot escape the fact that we are building increasingly larger data centers. As we build these larger sites, we must be even more focused on ensuring their robustness.  OE provides the methodology to ensure this robustness. Being a multi-country business brings a measure of potential robustness as you learn from events that don’t just affect you but other providers as well and transmit this learning between countries.

About PDG
Princeton Digital Group (PDG) is a leading investor, developer and operator of Internet infrastructure. Headquartered in Singapore with presence and operations in China, Singapore, India, Indonesia, and Japan, its portfolio of data centers powers the expansion of hyperscalers and enterprises in the fastest-growing digital economies across Asia.

Varoon Raghavan

Author Varoon Raghavan

Chief Operating Officer, Co-Founder at PDG

More posts by Varoon Raghavan