Kavin Arvind Ragavan

Speaker

Kavin Arvind Ragavan

Cloud Performance Engineer with a demonstrated history of working in the Information Technology and services industry. Has over a decade of experience in Non- Functional testing with strong expertise in Performance Testing & Engineering, Chaos Engineering and Site Reliability Engineering in Resiliency & Observability Areas.

Specialized in AWS & GCP Cloud Performance testing and in designing & implementing Cloud Test frameworks for Performance, Resiliency and Observability. Has involved in creation of automated Performance & Resilience Engineering frameworks and implementing Continuous integration & Continuous delivery to perform early performance ,resilience and accessibility testing and identify potential performance bottlenecks during the development phase. Has presented 4 Whitepapers related to Cloud Performance Testing, Chaos Engineering and Microservices at Software Conferences.

Title: Achieving IT Resilience within Google Cloud: GCP Well Architected Frameworks 

Abstract:

GCP Chaos Engineering

• Resilience/ Chaos engineering is the discipline of experimenting on a system in order to build
confidence in the system’s capability to withstand turbulent conditions in production

• Applying Chaos Engineering experiments on Google Cloud Platform and its Cloud Services helps to continuously improve application’s performance, observability, and resiliency.

• Google Cloud (also known as Google Cloud Platform or GCP) is a provider of computing resources for developing, deploying, and operating applications on the Web.

• GCP is mainly a service for building and maintaining original applications, which may then be
published via the Web from its hyperscale data center facilities

GCP Resiliency Focus Areas

  • Resiliency is the ability of the system to gracefully handle and recover from hardware and software
    failures and provide an acceptable level of service to the business
  • To test the system for resiliency, introduce below failures and ensure that the system recovers fully
  1. SPOF Failures- Failure of one service or component should not have cascading impact on the other
    components
  2. Dependency Failures- Failure of the dependent service like the database, cache shouldn’t make the
    application down
  3. App level Failure Injections- Introduce resource, state, network level faults into the application
  4. Data Failures- Data to be available to applications if the system that originally hosted the data fails
  5. Canary Deployment Failures- Verify automated rollback mechanism for code in production in case of
    failure
  • GCP Hotspots for Failure Injections:
  • Compute-Compute Engine (VM Instances)
  •  Compute-Kubernetes Engine ( Containers)
  •  Storage- Cloud SQL
  •  Storage- Cloud Memory Store
  •  Storage- Cloud Big Table
  •  Networking- Cloud Load Balancing

Compute and Kubernetes Engine Fault Simulations:

  • Infrastructure Level Failures
  • State level (Compute Instances-Reboot | Stop , Database Stop/ Reboot/Failover, Zone/ Region
    failures, Container/ Pod- Terminate /Failure, Process Failures)
  • Resource level ( CPU, Memory, Disk, IO)
  • Network Level(Latency, Packet Loss, Corruption, Blackhole)
  • Application-Level Failures- App specific failure injections- Error code injection, Time shift, certificate
    issues, API Throttling, etc.,.
  • Database Level Failures- DB Specific Failure Injections - Connection Pool Exhaust, DB locks, etc.,.

Storage and Network Failovers:

  • SQL Server failover
  •  Regional persistent disk failover
  •  Failover for Memory store for Redis
  • Cloud Big Table failover
  • Failover for External TCP/ UDP Network Load Balancing
  •  Failover for Internal TCP/UDP Load Balancing

Test Models (Will be illustrated in Slides)

  • Google Compute Engine Resilience Test Model
  •  Google Kubernetes Engine Resilience Test Model
  • GCP– Cloud SQL Failover Model

Tools for GCP

Gremlin- Gremlin provides a library of attacks to safely, securely, and easily simulate real outages into
the cloud platforms

        GCP Use case: Google Compute and Kubernetes Engine Failure Injections

Chaos Mesh- Chaos Mesh is a chaos platform made exclusively for Kubernetes applications

      GCP Use case: Google Kubernetes Engine Failure Injections

NetHavoc- NetHavoc can be used to inject various faults into the application infrastructure during a load test.

            GCP Use case: Google Compute and Kubernetes Engine Failure Injections

Chaos Toolkit- Chaos Toolkit is a framework to create custom chaos experiments focusing a lot on extensibility.

        GCP Use case : SQL Failover

Resiliency Metrics:

Below metrics would be collected during a GCP Chaos Test.

Performance Metrics

  1. Performance Degradation
  2.  Failure rate
  3.  Mean Time to Detect
  4.  Mean Time to Recover
  5.  Turnaround time

Cloud Monitoring Metrics

  1. CPU Utilization
  2. Memory Utilization
  3.  Network In/Out
  4.  DB Connections
  5. Queries per second QPS
  6.  Read/Write IOPS

SRE Metrics

  1. Alerts/ Alarms
  2.  Health check status
  3.  Logs and Traces

Title: Designing for Site Reliability &Observability using AWS FIS 

Abstract:

Designing for Site Reliability &Observability using AWS FIS

AWS Chaos Experiments helps in stressing an application in testing or production environments by creating real world disruptive events, such as sudden increase in CPU or memory consumption, observing how the system responds, and implementing improvements that helps in improving Application’s performance, observability, and resiliency. AWS Fault Injection Simulator is an AWS managed service for running Chaos experiments on AWS. For AWS Fault Simulations , AWS FIS proves to be an efficient pay per use model compared to other commercial tools in market like Gremlin and Net havoc. 

FIS Real Time Uses: 

  1. Design & Simulate Failures in PreProd Environment: Simulate Real world failures in Test Environment like Stage or Perf to understand the Autoscaling thresholds and health checks, MTTR,etc,. 
  2. Production Chaos Continue the tests on Production creating potential failure conditions and observing how effectively the team and system responds  
  3. Build Observability FIS can be used to ensure the Observability of the system by testing the alerts ,monitoring dashboards using Fault simulations 
  4. CICD Integration AWS Fault Injection Simulator can be integrated into continuous delivery pipeline which helps to repeatedly test the impact of fault actions as part of the software delivery process. 

Supported AWS Services 

ported AWS Services 

  • FIS supports the AWS Services like EC2, RDS, EKS and ECS and can also be customized to support additional fault simulations for AWS Fargate, EBS Volumes, etc., 
  • FIS can also inject API Throttle, API Unavailable and API Exceptions 

FIS Experiment- Design Steps 

  1. Create &Assign FIS IAM role for the experiment  
  2. Specify Target instances 
  3. Pass the inputs for the experiment as json in the actions 
  4. Configure or select CloudWatch alarms for experiment stop conditions 
  5. Run from console or CLI 

FIS Benefits:

  • Customizable Fault Simulations- FIS allows to combine different level of fault simulations like state, resource, network and customize/ save the fault actions as per our use case 
  • Experiment Control & Visibility- FIS supports CloudWatch & uses existing metrics to monitor FIS experiments. Experiments’ running, completed status, triggers all are visible in the console  
  • No Setup, agents needed- FIS needs no prerequisite setup; the dependencies for fault simulations are managed by AWS.  
  • Cost- Since FIS uses Pay per use model, the overall cost incurred will be far less compared to commercial tools 
  • Security- Experiments are tied to IAM for security. As As FIS is AWS managed service , its safe and secured eliminating the need to install any other agents into the instances 
  • Console access- FIS can be used from Console, CLI and AWS APIs that helps in continuous integration 

Sample Fault Simulations: 

  1. Terminate single/multiple EC2 Instances across zones and regions 
  2. Reboot single/multiple App/Cache/DB Instances across zones and regions 
  3. Stop single/multiple EC2 Instances across zones and regions 
  4. CPU Stress in the EC2 Instances
(High CPU/ Throttle CPU/CPU Burn) 
  5. Memory Stress in the EC2 Instances
(Insufficient Memory) 
  6. Hybrid Resource Stress in the EC2 Instances 
  7. Latency  in the EC2 Instances Instances 
  8. Disconnect Primary DB-Reboot DB Instance 
  9. Failover RDS DB 
  10. Insufficient Memory Issues with DB instance 
  11. Kill a particular Microservice/ process (by PID/ name) in an instance 
  12. Latency in Producers or Consumers Instances