Cloud

4-minute read

26 January 2024

Understanding site reliability engineering: the big picture

NTT DATA’s site reliability engineering approach supports the full cloud-native development and management lifecycle as well as your overall business goals

Sunil Mathur

Vice-President: Product Engineering at NTT

26 January 2024

4-minute read

Understanding site reliability engineering: the big picture

Topics in this article

Cloud

If you’ve invested time and money to make your IT infrastructure robust enough to support your business operations, you don’t want to end up frustrated by downtime or unable to scale your infrastructure to match your business growth.

Site reliability engineering (SRE) refers to making complex IT infrastructure and operations highly reliable and scalable by applying software engineering principles.

It’s a framework that has been continually enhanced ever since Google coined the term in 2003, and it has become an industry-leading practice for service reliability.

According to the DevOps Institute’s Global SRE Pulse 2022 survey, 62% of respondent organizations are now using SRE processes – with 55% using it for specific teams, products or services and 19% applying SRE across the whole organization.

The key principles of SRE

SRE emphasizes automation and incorporates DevOps practices through a close relationship with development teams to create scalable and reliable systems. In traditional managed services, the level of integration with developers and the DevOps emphasis may vary.

So, it introduces a cultural shift in how developers and operations teams collaborate, with a shared responsibility for reliability, whereas the traditional model may involve more distinct roles between operations and development teams.

NTT DATA has spent decades helping organizations get the most value out of their infrastructure through managed services. We have recently formalized an SRE Services offering to help our clients drive agile software delivery with optimized reliability, performance and cost – especially as they migrate to the cloud as part of their digital transformation.

This means we collaborate with your team to assess, monitor and improve your processes, identify bottlenecks, streamline your workflows and optimize your resources:

We define service-level objectives (SLOs), which are measurable targets for the reliability of a service. SLOs help align your engineering and operations teams with your overall business goals. This is another way in which SRE differs from traditional managed services: meeting SLOs is prioritized over conventional SLAs, which define the expected level of service but may not accurately capture user experience.

Then, we use an error budget – the difference between 100% reliability and an agreed-upon SLO – to quantify and manage acceptable levels of service disruption. If the error budget is exhausted, new features or changes are postponed until reliability improves.
Capacity planning is also important to establish that your systems can handle both current and anticipated loads. It involves forecasting traffic patterns, analyzing performance metrics and scaling infrastructure accordingly.
SRE also incorporates principles of FinOps, for cost and performance optimization between IT, finance and business operations teams; DevSecOps, to address best practices for infrastructure, including security and resiliency, and minimize risk; and observability, to focus on system reliability by identifying performance issues and acting quickly to resolve them.

ALSO READ → Cloud economics in action can get you to 1,000% ROI

Dealing with incidents

When we apply SRE principles to improve system reliability within your organization, we can automate tasks such as deployment, scaling, monitoring and incident response. This creates consistency and reduces the risk of human error.

Monitoring and alerting systems are an essential part of SRE. Alerts based on key performance indicators and service-level indicators can identify and help to address potential problems before they affect users.

And, should an incident occur, we follow well-defined incident response processes to restore service swiftly and we conduct post-incident reviews to learn from the experience.

More automation also means less toil – a term used in the SRE context to describe work that tends to be manual, repetitive, automatable and lacking enduring value. The less time an SRE practitioner spends on toil, the more room there is for long-term engineering project work.

Managing risk

As part of our SRE service, we identify and manage risks to system reliability. This includes evaluating the impact of changes, mapping out potential failure scenarios and introducing measures that will minimize risk.

You can never start minimizing risk too early. We will also work closely with your development teams on the design and architecture of new systems to make them more reliable and prevent reliability issues from being introduced later during the development lifecycle.

This is how we create a balance between the need for rapid development and innovation and the requirement for a highly reliable and available system.

Reliability in the cloud

SRE is core to a cloud-native strategy. Much of cloud transformation starts with a shift to infrastructure as code (IaC), which involves managing and provisioning computing infrastructure through script files rather than through physical hardware configuration.

The goal is to treat infrastructure as if it were software, so it can be version-controlled, tested and automated – all part of the road to reliability.

Next, observability comes into play once again to ensure your system reliability continually meets service-level agreements and SLOs.

ALSO READ → 5 reasons why cloud transformation is key to cloud success

Take the first step

At NTT DATA, we’re always learning from our own SRE experience to improve our offering.

So, if you’re looking for highly reliable, resilient infrastructure and services, combined with optimized costs, improved performance and less risk, allow us to show you what we can do.

WHAT TO DO NEXT

Read more about NTT DATA’s Site Reliability Engineering Services to see how we can help you modernize with confidence.

Industries

Industries

Featured services

Edge Compute Services

2026 Global AI Report: A Playbook for AI Leaders

Services

Services

Enterprise Networking

Services

Cloud

Services

Consulting

Services

Edge

Services

Data and Analytics

Services

Infrastructure Solutions

Services

Global Data Centers

Services

CX and Digital Products

Services

Application Services

Services

Sustainability Services

Services

Digital Workplace

Services

Business Process Services

Services

Generative AI

Services

Cybersecurity

Services

Enterprise Application Platforms

Accelerate outcomes with agentic AI

Insights

Insights

Recent Insights

The Future of Networking in 2025 and Beyond

Using the cloud to cut costs needs the right approach

Make zero trust security work for your organization

Insights

2026 Global AI Report: A Playbook for AI Leaders

Solutions

2026 Global AI Report: A Playbook for AI Leaders

About us

CLIENT STORIES

Liantis

Randstad

Sustainability

CLIENT STORIES

Liantis

Randstad

2026 Global AI Report: A Playbook for AI Leaders

Industries

Industries

Featured services

Edge Compute Services

2026 Global AI Report: A Playbook for AI Leaders

Services

Services

Enterprise Networking

Services

Cloud

Services

Consulting

Services

Edge

Services

Data and Analytics

Services

Infrastructure Solutions

Services

Global Data Centers

Services

CX and Digital Products

Services

Application Services