Topics in this article

If you’ve invested time and money to make your IT infrastructure robust enough to support your business operations, you don’t want to end up frustrated by downtime or unable to scale your infrastructure to match your business growth.

Site reliability engineering (SRE) refers to making complex IT infrastructure and operations highly reliable and scalable by applying software engineering principles.

It’s a framework that has been continually enhanced ever since Google coined the term in 2003, and it has become an industry-leading practice for service reliability.

According to the DevOps Institute’s Global SRE Pulse 2022 survey, 62% of respondent organizations are now using SRE processes – with 55% using it for specific teams, products or services and 19% applying SRE across the whole organization.

The key principles of SRE

SRE emphasizes automation and incorporates DevOps practices through a close relationship with development teams to create scalable and reliable systems. In traditional managed services, the level of integration with developers and the DevOps emphasis may vary.

So, it introduces a cultural shift in how developers and operations teams collaborate, with a shared responsibility for reliability, whereas the traditional model may involve more distinct roles between operations and development teams.

NTT DATA has spent decades helping organizations get the most value out of their infrastructure through managed services. We have recently formalized an SRE Services offering to help our clients drive agile software delivery with optimized reliability, performance and cost – especially as they migrate to the cloud as part of their digital transformation.

This means we collaborate with your team to assess, monitor and improve your processes, identify bottlenecks, streamline your workflows and optimize your resources:

  • We define service-level objectives (SLOs), which are measurable targets for the reliability of a service. SLOs help align your engineering and operations teams with your overall business goals. This is another way in which SRE differs from traditional managed services: meeting SLOs is prioritized over conventional SLAs, which define the expected level of service but may not accurately capture user experience.
  • Then, we use an error budget – the difference between 100% reliability and an agreed-upon SLO – to quantify and manage acceptable levels of service disruption. If the error budget is exhausted, new features or changes are postponed until reliability improves.
  • Capacity planning is also important to establish that your systems can handle both current and anticipated loads. It involves forecasting traffic patterns, analyzing performance metrics and scaling infrastructure accordingly.
  • SRE also incorporates principles of FinOps, for cost and performance optimization between IT, finance and business operations teams; DevSecOps, to address best practices for infrastructure, including security and resiliency, and minimize risk; and observability, to focus on system reliability by identifying performance issues and acting quickly to resolve them. 

Dealing with incidents

When we apply SRE principles to improve system reliability within your organization, we can automate tasks such as deployment, scaling, monitoring and incident response. This creates consistency and reduces the risk of human error.

Monitoring and alerting systems are an essential part of SRE. Alerts based on key performance indicators and service-level indicators can identify and help to address potential problems before they affect users.

And, should an incident occur, we follow well-defined incident response processes to restore service swiftly and we conduct post-incident reviews to learn from the experience.

More automation also means less toil – a term used in the SRE context to describe work that tends to be manual, repetitive, automatable and lacking enduring value. The less time an SRE practitioner spends on toil, the more room there is for long-term engineering project work.

Managing risk

As part of our SRE service, we identify and manage risks to system reliability. This includes evaluating the impact of changes, mapping out potential failure scenarios and introducing measures that will minimize risk.

You can never start minimizing risk too early. We will also work closely with your development teams on the design and architecture of new systems to make them more reliable and prevent reliability issues from being introduced later during the development lifecycle.

This is how we create a balance between the need for rapid development and innovation and the requirement for a highly reliable and available system.

Reliability in the cloud

SRE is core to a cloud-native strategy. Much of cloud transformation starts with a shift to infrastructure as code (IaC), which involves managing and provisioning computing infrastructure through script files rather than through physical hardware configuration.

The goal is to treat infrastructure as if it were software, so it can be version-controlled, tested and automated – all part of the road to reliability.

Next, observability comes into play once again to ensure your system reliability continually meets service-level agreements and SLOs. 

Take the first step

At NTT DATA, we’re always learning from our own SRE experience to improve our offering.

So, if you’re looking for highly reliable, resilient infrastructure and services, combined with optimized costs, improved performance and less risk, allow us to show you what we can do.

WHAT TO DO NEXT

Read more about NTT DATA’s Site Reliability Engineering Services to see how we can help you modernize with confidence.