2021 State of Chaos Engineering

Tanat Tonguthaisri
10 min readJan 27, 2021

Over the past twelve years, I’ve had the opportunity to be part of and watch the growth of Chaos Engineering. From its humble origins, most often met with “Why would you want to do that?” to its position today, helping ensure the reliability of the top companies in the world, it’s been quite the journey.

I first began practicing this discipline, years before it had a name, at Amazon where it was our job to prevent the retail website from going down. As we were having success, Netflix wrote their canonical blog post on Chaos Monkey (ten years ago this July). The idea hit mainstream and many engineers were hooked. After my tour of duty at Amazon, I rushed to join Netflix to dive deeper in this space. We were able to advance the art even further, building developer-focused solutions that spanned the entire Netflix ecosystem, ultimately resulting in another nine of availability and a world-renowned customer experience.

Five years ago my co-founder, Matthew Fornaciari and I founded Gremlin with a simple mission: Build a more reliable internet. We are both ecstatic to see how far the practice has come in this time. Many within the community have been hungry for more data on how to best leverage this approach, and so we are proud to present the first inaugural State of Chaos Engineering report.

Engineering teams across the globe use Chaos Engineering to intentionally inject harm into their systems, monitor the impact, and fix failures before they negatively impact customer experiences. In doing so, they avoid costly outages while reducing MTTD and MTTR, prepare their teams for the unknown, and protect the customer experience. In fact, Gartner anticipates that by 2023, 80% of organizations that use Chaos Engineering practices as part of SRE initiatives will reduce their mean time to resolution (MTTR) by 90%. We see the same parallels from the inaugural State of Chaos Engineering Report: top-performing Chaos Engineering teams boast four nines of availability with an MTTR of less than one hour.

Kolton Andrus

CEO, Gremlin

Key findings

Increased availability and decreased MTTR are the two most common benefits of Chaos Engineering

Teams who frequently run Chaos Engineering experiments have >99.9% availability

23% of teams had a mean time to resolution (MTTR) of under 1 hour and 60% under 12 hours

Network attacks are the most commonly run experiments, in line with the top failures reported

While still an emerging practice, the majority of respondents (60%) have run at least one Chaos Engineering attack

34% of respondents run Chaos Engineering experiments in production

Things break

From the survey, the top 20% of respondents had services with an availability of more than four nines, an impressive level. 23% of teams had a mean time to resolution (MTTR) of under an hour, with 60% having an MTTR of under 12 hours.

What is the average availability of your service(s)?

42.5%38.1%19.4%

  • <=99%
  • 99.5%-99.9%
  • >=99.99%

Average number of high severity incidents (Sev 0&1) per month

81.4%18.6%

  • 1–10
  • 10–20

What is your mean time to resolution (MTTR)?

23.1%39.8%15.5%15.2%0.5%5.9%

  • <1 hour
  • 1 hour — 12 hours
  • 12 hours — 1 day
  • 1 day — 1 week
  • > 1 week
  • I don’t know

One of the more beneficial things that we did is run daily red versus blue attacks. We have the platform team come in, make attacks against us and our services and treat it like a real production incident by responding, and going through all of our run books and making sure we were covered.

JUSTIN TURNER

SEE HOW H-E-B PREPARED FOR BAD CODE PUSHES AND OUTAGES

When things do break, the most common causes were bad code pushes and dependency issues. These are not mutually exclusive. A bad code push from one team can cause a service outage for another. In modern systems where teams own independent services, it’s important to test all services for resiliency to failure. Running network-based chaos experiments, such as latency and blackhole, ensures that systems are decoupled and can fail independently, minimizing the impact of a service outage.

What percent of your incidents (SEV0&1) have been caused by:

Bad code deploy

39%24%19%11.8%6.1%

Internal dependency issues

41%25%20%10.1%3.7%

Configuration error

48%23%14%10.1%5.2%

Networking issues

50%19%13%15.7%1.7%

3rd party dependency issues

48%23%13%14.3%1%

Managed service provider issues

61%14%9%12.5%3.9%

Machine/infrastructure failure (on-prem)

64%14%6%12%4.4%

Database, messaging, or cache issues

58%18%18%5.2%1.2%

Unknown

66%10%15%7.4%1%

  • <20%
  • 21–40%
  • 41–60%
  • 61–80%
  • >80%

Who finds out

Monitoring for availability varies by company. For example, Netflix’s traffic is so consistent, they can use video starts per second from the server-side to spot an outage. Any deviation from the projected pattern signals an outage. Google uses Real User Monitoring mixed with windowing to determine if a single outage had a large impact or if multiple small incidents are impacting a service, leading to deeper analysis of the cause of the incident(s). Few companies have consistent traffic patterns and sophisticated statistical models like Netflix and Google. That’s why a standard uptime over total time using synthetic monitoring sits at the top as the most popular way to monitor the uptime of services, while many organizations use multiple methods and metrics. We were pleasantly surprised that all of the respondents are monitoring availability. This is often the first step teams take to get proactive about improving customer experiences in applications.

What metric do you use to define availability?

0%5%10%15%20%25%30%35%40%45%50%47.9%38.3%21.6%44%53.3%

  • Error Rate (Failed requests/total requests)
  • Latency
  • Orders/transactions vs historical predication
  • Successful requests/total requests
  • Uptime/total time period

How do you monitor availability?

0%5%10%15%20%25%30%35%40%45%50%55%60%37.1%64.4%50.4%

  • Real user monitoring
  • Health checks / synthetics
  • Server-side responses

When looking at who receives reports about availability and performance, it was no surprise that the closer a person is to operating applications, the more likely they are to receive reports. We believe the trend of DevOps bringing Operations and Development closer together is bringing the developer in line with Ops as the mindset of build and operate becomes pervasive in organizations. We also believe that as digitization increases and online user experience becomes more paramount, we’ll see an increase in the percent of C-level staff that receive availability and performance reports.

Who monitors or receives reports on availability?

0%5%10%15%20%25%30%35%40%45%50%55%60%15.7%11.8%33.7%30.2%51.1%61.4%54.5%

  • CEO
  • CFO or VP of Finance
  • CTO
  • VP
  • Managers
  • Ops
  • Developers
  • Other

Who monitors or receives reports on performance?

0%5%10%15%20%25%30%35%40%45%50%12%10.6%36.1%28.3%51.8%53.8%54.1%

  • CEO
  • CFO or VP of Finance
  • CTO
  • VP
  • Managers
  • Ops
  • Developers
  • Other

Top performers

Top performers had 99.99%+ availability and an MTTR of under one hour (highlighted above). In order to achieve these impressive numbers, we looked into what tooling teams used. Notably, autoscaling, load balancers, backups, select rollouts of deployments, and monitoring with health checks were all more common in the top availability group. Some of these, such as multi-zone, are expensive, while others, such as circuit breakers and select rollouts, are a time and engineering expertise issue.

Teams who consistently run chaos experiments have higher levels of availability than those who have never performed an experiment, or do so ad-hoc. But ad-hoc experiments are an important part of the practice, and teams with >99.9% availability are performing more ad-hoc experiments.

Frequency of Chaos Engineering experiments by availability

<99%%99%-99.9%%>99.9%%49.4%35.7%25.7%18.1%25%28.4%13.3%11.6%16.2%8.4%10.3%6.8%8.4%8.5%17.6%8.9%5.4%

  • Never performed an attack
  • Performed ad-hoc attacks
  • Quarterly attacks
  • Monthly attacks
  • Weekly attacks
  • Daily or more frequent attacks

Tool use by availability

Autoscaling

<99%%99%-99.9%%>99.9%%43%52%65%

  • <99%
  • 99%-99.9%
  • >99.9%

DNS failover/elastic IPs

<99%%99%-99.9%%>99.9%%24%33%49%

  • <99%
  • 99%-99.9%
  • >99.9%

Load balancers

<99%%99%-99.9%%>99.9%%71%64%77%

  • <99%
  • 99%-99.9%
  • >99.9%

Active-active multi-region, AZ or DC

<99%%99%-99.9%%>99.9%%19%29%38%

  • <99%
  • 99%-99.9%
  • >99.9%

Active-passive multi-region, AZ, or DC

<99%%99%-99.9%%>99.9%%30%34%45%

  • <99%
  • 99%-99.9%
  • >99.9%

Circuit breakers

<99%%99%-99.9%%>99.9%%16%22%32%

  • <99%
  • 99%-99.9%
  • >99.9%

Backups

<99%%99%-99.9%%>99.9%%51%46%61%

  • <99%
  • 99%-99.9%
  • >99.9%

DB replication

<99%%99%-99.9%%>99.9%%37%47%51%

  • <99%
  • 99%-99.9%
  • >99.9%

Retry logic

<99%%99%-99.9%%>99.9%%31%33%41%

  • <99%
  • 99%-99.9%
  • >99.9%

Select rollouts of deployments (Blue/Green, Canary, feature flags)

<99%%99%-99.9%%>99.9%%27%36%51%

  • <99%
  • 99%-99.9%
  • >99.9%

Cached static pages when dynamic unavailable

<99%%99%-99.9%%>99.9%%19%26%26%

  • <99%
  • 99%-99.9%
  • >99.9%

Monitoring with health checks

<99%%99%-99.9%%>99.9%%53%58%70%

  • <99%
  • 99%-99.9%
  • >99.9%

Evolution of Chaos Engineering

In 2010, Netflix introduced Chaos Monkey into their systems. This pseudo-random failure of nodes was a response to instances and servers failing at random. Netflix wanted teams prepared for these failure modes, so they accelerated the process to demand resiliency to instance outages. It created both a test for reliability mechanisms and forced developers to build with failure in mind. Based on the success of the project, Netflix open sourced Chaos Monkey and created a Chaos Engineer role. Chaos Engineering has evolved since then to follow the scientific process, and experiments have expanded beyond host failure to test for failures up and down the stack.

Google searches for “Chaos Engineering”

0%2000%4000%6000%8000%10000%12000%14000%16000%18000%20000%22000%24000%26000%28000%30000%6990191002480031317

  • 2016
  • 2017
  • 2018
  • 2019
  • 2020

For every dollar spent in failure, learn a dollar’s worth of lessons

“MASTER OF DISASTER”

JESSE ROBBINS

In 2020, Chaos Engineering went mainstream and made headlines in Politico and Bloomberg. Gremlin hosted the largest Chaos Engineering event ever, with over 3,500 registrants. Github has over 200 Chaos Engineering related projects with 16K+ stars. And most recently, AWS announced their own public Chaos Engineering offering, AWS Fault Injection Simulator, coming later this year.

Chaos Engineering today

Chaos Engineering is becoming more popular and improving: 60% of respondents said they have run a Chaos Engineering attack. Netflix and Amazon, the creators of Chaos Engineering, are cutting edge, large organizations, but we’re also seeing adoption from more established organizations and smaller teams. The diversity of teams using Chaos Engineering is also growing. What began as an engineering practice was quickly adopted by Site Reliability Engineering (SRE) teams, and now many platform, infrastructure, operations, and application development teams are adopting the practice to improve the reliability of their applications. Host failure, which we categorize as a State type attack, is far less popular than network and resource attacks. We’ve seen an uptake in simulating lost connections to a dependency or a spike in demand for a service. We’re also seeing many more organizations moving their experimentation to production, although this is in the early days.

459,548

ATTACKS USING THE GREMLIN PLATFORM

68%

OF CUSTOMERS USING K8S ATTACKS

How frequently does your organization practice Chaos Engineering?

>10,000 employees

5.7%8%4.6%16.1%31%34.5%

5,001–10,000 employees

0%13.2%18.4%21.1%23.7%23.7%

1,001–5,000 employees

8.3%11.1%8.3%9.7%22.2%40.3%

100–1,000 employees

10.9%10.9%8.6%10.9%22.7%35.9%

<100 employees

3.7%7.3%9.8%8.5%15.9%54.9%

  • Daily or more frequent attacks
  • Weekly attacks
  • Monthly attacks
  • Quarterly attacks
  • Performed ad-hoc attacks
  • Never performed an attack

What teams are involved in conducting chaos experiments?

0%5%10%15%20%25%30%35%40%45%50%52%10%42%32%49%37%50%14%

  • Application Developers
  • C-level
  • Infrastructure
  • Managers
  • Operations
  • Platform or Architecture
  • SRE
  • VPs

What percentage of your organization uses Chaos Engineering?

7.3%17.7%21%54%

  • 76%+
  • 51–75%
  • 26–50%
  • <25%

What environment have you performed chaos experiments on?

0%5%10%15%20%25%30%35%40%45%50%55%60%63%50%34%

  • Dev/Test
  • Staging
  • Production

Percent of attacks by type

46%38%15%1%

  • Network
  • Resource
  • State
  • Application

Percent of attacks by target type

70%29%1%

  • Host
  • Container
  • Application

Results of chaos experiments

One of the most exciting and rewarding aspects of Chaos Engineering is discovering or verifying a bug. The practice makes it easier to uncover unknown issues before they impact customers and identify the real cause of an incident, speeding up the patching process. Another major benefit that showed up in the write-in response to our survey was a better understanding of architectures. Running chaos experiments helps identify where there is tight coupling or unknown dependencies that adversely affect our applications and often remove many of the benefits of creating microservices applications. From our own product, we found that customers were frequently identifying incidents, mitigating the issue, and verifying the fixes with Chaos Engineering. Our survey respondents frequently found their applications increased in availability while they reduced their MTTR.

After using Chaos Engineering, what benefits have you experienced?

0%5%10%15%20%25%30%35%40%45%47%45%41%38%37%25%

  • Increased availability
  • Reduced mean time to resolution (MTTR)
  • Reduced mean time to detection (MTTD)
  • Reduced # of bugs shipped to production
  • Reduced # of outages
  • Reduced # of pages

Future of Chaos Engineering

What is the biggest inhibitor to adopting/expanding Chaos Engineering?

20%20%20%17%12%11%

  • Lack of awareness
  • Other priorities
  • Lack of experience
  • Lack of time
  • Security concerns
  • Fear something might go wrong

The biggest inhibitors to adopting Chaos Engineering are a lack of awareness and experience. These are followed closely by ‘other priorities’ but interestingly more than 10% mentioned the fear that something might go wrong was also a prohibitor. It’s true that in practicing Chaos Engineering we are injecting failure into systems, but using modern methods that follow scientific principles, and methodically isolating experiments to a single service, we can be intentional about the practice and not disrupt customer experiences.

We believe the next stage of Chaos Engineering involves opening up this important testing process to a broader audience and to making it easier to safely experiment in more environments. As the practice matures and tooling evolves, we expect it to be more accessible and faster for engineers and operators to design and run experiments to improve the reliability of their systems across environments — today, 30% of respondents are running chaos experiments in production. We believe that chaos experiments will become more targeted and automated, while also becoming more commonplace and frequent.

We’re excited about the future of Chaos Engineering and its role in making systems more reliable.

Demographics

The data sources for this report include a comprehensive survey with 400+ responses and Gremlin’s product data. Survey respondents are from a range of company sizes and industries, primarily in Software and Services. Adoption of Chaos Engineering has hit the enterprise, with nearly 50% of respondents working for companies with more than 1,000 employees, and nearly 20% working for companies with more than 10,000 employees.

The survey highlighted a tipping point in cloud computing, where nearly 60% of respondents ran a majority of their workloads in the cloud, and used a CI/CD pipeline. Containers and Kubernetes are reaching a similar level of maturity, but the survey confirmed that service meshes are still in their early days. The most common cloud platform is AWS at nearly 40%, with GCP, Azure, and on-premises following around 11–12%.

400+

QUALIFIED RESPONDENTS

How many employees work at your company?

21.4%9.3%17.7%31.4%20.1%

  • >10,000
  • 5,001–10,000
  • 1,001–5,000
  • 100–1,000
  • <100

How old is your company?

25.8%32.9%27.3%14%

  • Over 25 years old
  • 10 to 25 years old
  • 2 to 10 years old
  • Less than 2 years old

What industry is your company in?

50.2%23.2%10.7%8.3%7.6%

  • Software & Services
  • Banks, Insurance & Financial Services
  • Energy Equipment & Services
  • Retail & eCommerce
  • Technology Hardware, Semiconductors, & Related Equipment

What is your job title?

32.2%25.3%18.2%8.8%4.9%10.6%

  • Software Engineer
  • SRE
  • Engineering Manager
  • System Administrator
  • Non-technical Executive (ex: CEO, COO, CMO, CRO)
  • Technical Executive (ex: CTO, CISO, CIO)

What percent of production workloads are in the cloud?

35.1%23.1%21.4%20.4%

  • >75%
  • 51–75%
  • 25–50%
  • <25%

What percent of production workloads are deployed using a CI/CD pipeline?

39.8%21.1%20.4%18.7%

  • >75%
  • 51–75%
  • 25–50%
  • <25%

What percent of production workloads use containers?

27.5%19.9%23.6%29%

  • >75%
  • 51–75%
  • 25–50%
  • <25%

What percent of production workloads use Kubernetes (or another container orchestrator)?

19.4%22.4%18.4%39.8%

  • >75%
  • 51–75%
  • 25–50%
  • <25%

What percent of production environment routes leverage service mesh?

10.1%16.5%17.9%55.5%

  • >75%
  • 51–75%
  • 25–50%
  • <25%

In addition to examining the survey results, we also aggregated information about the technical environments of Gremlin users to understand what specific tools and layers of the stack are most often targets of Chaos Engineering experiments. Those findings are below.

What is your cloud provider?

0%5%10%15%20%25%30%35%38%12%12%11%

  • Amazon Web Services
  • Google Cloud Platform
  • Microsoft Azure
  • Oracle
  • Private Cloud (On Premises)

What is your container orchestrator?

0%2%4%6%8%10%12%14%16%18%13%19%16%12%6%

  • Amazon Elastic Container Service
  • Amazon Elastic Kubernetes Service
  • Custom Kubernetes
  • Google Kubernetes Engine
  • OpenShift

What is your messaging provider?

0%2%4%6%8%10%12%14%16%18%20%22%24%5%17%25%13%

  • ActiveMQ
  • AWS SQS
  • Kafka
  • IBM MQ
  • RabbitMQ

What is your monitoring tool?

0%2%4%6%8%10%12%14%16%18%20%20%13%18%9%18%

  • Amazon CloudWatch
  • Datadog
  • Grafana
  • New Relic
  • Prometheus

What is your database?

0%2%4%6%8%10%12%14%16%18%20%22%5%14%16%22%22%

  • Cassandra
  • DynamoDb
  • MongoDB
  • MySQL
  • Postgres

https://www.gremlin.com/state-of-chaos-engineering/2021/

--

--