Kavin Arvind Ragavan

Speaker

Cloud Performance Engineer with a demonstrated history of working in the Information Technology and services industry. Has over a decade of experience in Non- Functional testing with strong expertise in Performance Testing & Engineering, Chaos Engineering and Site Reliability Engineering in Resiliency & Observability Areas.

Specialized in AWS & GCP Cloud Performance testing and in designing & implementing Cloud Test frameworks for Performance, Resiliency and Observability. Has involved in creation of automated Performance & Resilience Engineering frameworks and implementing Continuous integration & Continuous delivery to perform early performance ,resilience and accessibility testing and identify potential performance bottlenecks during the development phase. Has presented 4 Whitepapers related to Cloud Performance Testing, Chaos Engineering and Microservices at Software Conferences.

Interactive Session Details

Title: Achieving IT Resilience within Google Cloud: GCP Well Architected Frameworks

Abstract:

GCP Chaos Engineering

• Resilience/ Chaos engineering is the discipline of experimenting on a system in order to build
confidence in the system’s capability to withstand turbulent conditions in production

• Applying Chaos Engineering experiments on Google Cloud Platform and its Cloud Services helps to continuously improve application’s performance, observability, and resiliency.

• Google Cloud (also known as Google Cloud Platform or GCP) is a provider of computing resources for developing, deploying, and operating applications on the Web.

• GCP is mainly a service for building and maintaining original applications, which may then be
published via the Web from its hyperscale data center facilities

GCP Resiliency Focus Areas

Resiliency is the ability of the system to gracefully handle and recover from hardware and software
failures and provide an acceptable level of service to the business
To test the system for resiliency, introduce below failures and ensure that the system recovers fully

SPOF Failures– Failure of one service or component should not have cascading impact on the other
components
Dependency Failures– Failure of the dependent service like the database, cache shouldn’t make the
application down
App level Failure Injections– Introduce resource, state, network level faults into the application
Data Failures– Data to be available to applications if the system that originally hosted the data fails
Canary Deployment Failures– Verify automated rollback mechanism for code in production in case of
failure

GCP Hotspots for Failure Injections:
Compute-Compute Engine (VM Instances)
Compute-Kubernetes Engine ( Containers)
Storage- Cloud SQL
Storage- Cloud Memory Store
Storage- Cloud Big Table
Networking- Cloud Load Balancing

Compute and Kubernetes Engine Fault Simulations:

Infrastructure Level Failures
State level (Compute Instances-Reboot | Stop , Database Stop/ Reboot/Failover, Zone/ Region
failures, Container/ Pod- Terminate /Failure, Process Failures)
Resource level ( CPU, Memory, Disk, IO)
Network Level(Latency, Packet Loss, Corruption, Blackhole)
Application-Level Failures- App specific failure injections- Error code injection, Time shift, certificate
issues, API Throttling, etc.,.
Database Level Failures- DB Specific Failure Injections – Connection Pool Exhaust, DB locks, etc.,.

Storage and Network Failovers:

SQL Server failover
Regional persistent disk failover
Failover for Memory store for Redis
Cloud Big Table failover
Failover for External TCP/ UDP Network Load Balancing
Failover for Internal TCP/UDP Load Balancing

Test Models (Will be illustrated in Slides)

Google Compute Engine Resilience Test Model
Google Kubernetes Engine Resilience Test Model
GCP– Cloud SQL Failover Model

Tools for GCP

Gremlin– Gremlin provides a library of attacks to safely, securely, and easily simulate real outages into
the cloud platforms

GCP Use case: Google Compute and Kubernetes Engine Failure Injections

Chaos Mesh– Chaos Mesh is a chaos platform made exclusively for Kubernetes applications

GCP Use case: Google Kubernetes Engine Failure Injections

NetHavoc– NetHavoc can be used to inject various faults into the application infrastructure during a load test.

GCP Use case: Google Compute and Kubernetes Engine Failure Injections

Chaos Toolkit– Chaos Toolkit is a framework to create custom chaos experiments focusing a lot on extensibility.

GCP Use case : SQL Failover

Resiliency Metrics:

Below metrics would be collected during a GCP Chaos Test.

Performance Metrics

Performance Degradation
Failure rate
Mean Time to Detect
Mean Time to Recover
Turnaround time

Cloud Monitoring Metrics

CPU Utilization
Memory Utilization
Network In/Out
DB Connections
Queries per second QPS
Read/Write IOPS

SRE Metrics

Alerts/ Alarms
Health check status
Logs and Traces

Lab Session Details

Title: Designing for Site Reliability &Observability using AWS FIS

Abstract:

Designing for Site Reliability &Observability using AWS FIS

AWS Chaos Experiments helps in stressing an application in testing or production environments by creating real world disruptive events, such as sudden increase in CPU or memory consumption, observing how the system responds, and implementing improvements that helps in improving Application’s performance, observability, and resiliency. AWS Fault Injection Simulator is an AWS managed service for running Chaos experiments on AWS. For AWS Fault Simulations , AWS FIS proves to be an efficient pay per use model compared to other commercial tools in market like Gremlin and Net havoc.

FIS Real Time Uses:

Design & Simulate Failures in PreProd Environment: Simulate Real world failures in Test Environment like Stage or Perf to understand the Autoscaling thresholds and health checks, MTTR,etc,.
Production Chaos Continue the tests on Production creating potential failure conditions and observing how effectively the team and system responds
Build Observability FIS can be used to ensure the Observability of the system by testing the alerts ,monitoring dashboards using Fault simulations
CICD Integration AWS Fault Injection Simulator can be integrated into continuous delivery pipeline which helps to repeatedly test the impact of fault actions as part of the software delivery process.

Supported AWS Services

ported AWS Services

FIS supports the AWS Services like EC2, RDS, EKS and ECS and can also be customized to support additional fault simulations for AWS Fargate, EBS Volumes, etc.,
FIS can also inject API Throttle, API Unavailable and API Exceptions

FIS Experiment- Design Steps

Create &Assign FIS IAM role for the experiment
Specify Target instances
Pass the inputs for the experiment as json in the actions
Configure or select CloudWatch alarms for experiment stop conditions
Run from console or CLI

FIS Benefits:

Customizable Fault Simulations- FIS allows to combine different level of fault simulations like state, resource, network and customize/ save the fault actions as per our use case

Experiment Control & Visibility- FIS supports CloudWatch & uses existing metrics to monitor FIS experiments. Experiments’ running, completed status, triggers all are visible in the console
No Setup, agents needed- FIS needs no prerequisite setup; the dependencies for fault simulations are managed by AWS.
Cost- Since FIS uses Pay per use model, the overall cost incurred will be far less compared to commercial tools
Security- Experiments are tied to IAM for security. As As FIS is AWS managed service , its safe and secured eliminating the need to install any other agents into the instances
Console access- FIS can be used from Console, CLI and AWS APIs that helps in continuous integration

Sample Fault Simulations:

Terminate single/multiple EC2 Instances across zones and regions
Reboot single/multiple App/Cache/DB Instances across zones and regions
Stop single/multiple EC2 Instances across zones and regions
CPU Stress in the EC2 Instances (High CPU/ Throttle CPU/CPU Burn)
Memory Stress in the EC2 Instances (Insufficient Memory)
Hybrid Resource Stress in the EC2 Instances
Latency  in the EC2 Instances Instances
Disconnect Primary DB-Reboot DB Instance
Failover RDS DB
Insufficient Memory Issues with DB instance
Kill a particular Microservice/ process (by PID/ name) in an instance
Latency in Producers or Consumers Instances

Video Interview

Kavin Arvind Ragavan

Speaker

Kavin Arvind Ragavan

More Speakers

Community Partners

Brought to you by

Get your brand known across the world

Drop us an email at : atasupport@agiletestingalliance.org to sponsor

GET IN TOUCH

code of conduct

ATAGTR © 2021. All Rights Reserved