Speaker
Kavin Arvind Ragavan
Cloud Performance Engineer with a demonstrated history of working in the Information Technology and services industry. Has over a decade of experience in Non- Functional testing with strong expertise in Performance Testing & Engineering, Chaos Engineering and Site Reliability Engineering in Resiliency & Observability Areas.
Specialized in AWS & GCP Cloud Performance testing and in designing & implementing Cloud Test frameworks for Performance, Resiliency and Observability. Has involved in creation of automated Performance & Resilience Engineering frameworks and implementing Continuous integration & Continuous delivery to perform early performance ,resilience and accessibility testing and identify potential performance bottlenecks during the development phase. Has presented 4 Whitepapers related to Cloud Performance Testing, Chaos Engineering and Microservices at Software Conferences.
Title: Achieving IT Resilience within Google Cloud: GCP Well Architected Frameworks
Abstract:
GCP Chaos Engineering
• Resilience/ Chaos engineering is the discipline of experimenting on a system in order to build
confidence in the system’s capability to withstand turbulent conditions in production
• Applying Chaos Engineering experiments on Google Cloud Platform and its Cloud Services helps to continuously improve application’s performance, observability, and resiliency.
• Google Cloud (also known as Google Cloud Platform or GCP) is a provider of computing resources for developing, deploying, and operating applications on the Web.
• GCP is mainly a service for building and maintaining original applications, which may then be
published via the Web from its hyperscale data center facilities
GCP Resiliency Focus Areas
- Resiliency is the ability of the system to gracefully handle and recover from hardware and software
failures and provide an acceptable level of service to the business - To test the system for resiliency, introduce below failures and ensure that the system recovers fully
- SPOF Failures- Failure of one service or component should not have cascading impact on the other
components - Dependency Failures- Failure of the dependent service like the database, cache shouldn’t make the
application down - App level Failure Injections- Introduce resource, state, network level faults into the application
- Data Failures- Data to be available to applications if the system that originally hosted the data fails
- Canary Deployment Failures- Verify automated rollback mechanism for code in production in case of
failure
- GCP Hotspots for Failure Injections:
- Compute-Compute Engine (VM Instances)
- Compute-Kubernetes Engine ( Containers)
- Storage- Cloud SQL
- Storage- Cloud Memory Store
- Storage- Cloud Big Table
- Networking- Cloud Load Balancing
Compute and Kubernetes Engine Fault Simulations:
- Infrastructure Level Failures
- State level (Compute Instances-Reboot | Stop , Database Stop/ Reboot/Failover, Zone/ Region
failures, Container/ Pod- Terminate /Failure, Process Failures) - Resource level ( CPU, Memory, Disk, IO)
- Network Level(Latency, Packet Loss, Corruption, Blackhole)
- Application-Level Failures- App specific failure injections- Error code injection, Time shift, certificate
issues, API Throttling, etc.,. - Database Level Failures- DB Specific Failure Injections - Connection Pool Exhaust, DB locks, etc.,.
Storage and Network Failovers:
- SQL Server failover
- Regional persistent disk failover
- Failover for Memory store for Redis
- Cloud Big Table failover
- Failover for External TCP/ UDP Network Load Balancing
- Failover for Internal TCP/UDP Load Balancing
Test Models (Will be illustrated in Slides)
- Google Compute Engine Resilience Test Model
- Google Kubernetes Engine Resilience Test Model
- GCP– Cloud SQL Failover Model
Tools for GCP
Gremlin- Gremlin provides a library of attacks to safely, securely, and easily simulate real outages into
the cloud platforms
GCP Use case: Google Compute and Kubernetes Engine Failure Injections
Chaos Mesh- Chaos Mesh is a chaos platform made exclusively for Kubernetes applications
GCP Use case: Google Kubernetes Engine Failure Injections
NetHavoc- NetHavoc can be used to inject various faults into the application infrastructure during a load test.
GCP Use case: Google Compute and Kubernetes Engine Failure Injections
Chaos Toolkit- Chaos Toolkit is a framework to create custom chaos experiments focusing a lot on extensibility.
GCP Use case : SQL Failover
Resiliency Metrics:
Below metrics would be collected during a GCP Chaos Test.
Performance Metrics
- Performance Degradation
- Failure rate
- Mean Time to Detect
- Mean Time to Recover
- Turnaround time
Cloud Monitoring Metrics
- CPU Utilization
- Memory Utilization
- Network In/Out
- DB Connections
- Queries per second QPS
- Read/Write IOPS
SRE Metrics
- Alerts/ Alarms
- Health check status
- Logs and Traces
Title: Designing for Site Reliability &Observability using AWS FIS
Abstract:
Designing for Site Reliability &Observability using AWS FIS
AWS Chaos Experiments helps in stressing an application in testing or production environments by creating real world disruptive events, such as sudden increase in CPU or memory consumption, observing how the system responds, and implementing improvements that helps in improving Application’s performance, observability, and resiliency. AWS Fault Injection Simulator is an AWS managed service for running Chaos experiments on AWS. For AWS Fault Simulations , AWS FIS proves to be an efficient pay per use model compared to other commercial tools in market like Gremlin and Net havoc.
FIS Real Time Uses:
- Design & Simulate Failures in PreProd Environment: Simulate Real world failures in Test Environment like Stage or Perf to understand the Autoscaling thresholds and health checks, MTTR,etc,.
- Production Chaos Continue the tests on Production creating potential failure conditions and observing how effectively the team and system responds
- Build Observability FIS can be used to ensure the Observability of the system by testing the alerts ,monitoring dashboards using Fault simulations
- CICD Integration AWS Fault Injection Simulator can be integrated into continuous delivery pipeline which helps to repeatedly test the impact of fault actions as part of the software delivery process.
Supported AWS Services
ported AWS Services
- FIS supports the AWS Services like EC2, RDS, EKS and ECS and can also be customized to support additional fault simulations for AWS Fargate, EBS Volumes, etc.,
- FIS can also inject API Throttle, API Unavailable and API Exceptions
FIS Experiment- Design Steps
- Create &Assign FIS IAM role for the experiment
- Specify Target instances
- Pass the inputs for the experiment as json in the actions
- Configure or select CloudWatch alarms for experiment stop conditions
- Run from console or CLI
FIS Benefits:
- Customizable Fault Simulations- FIS allows to combine different level of fault simulations like state, resource, network and customize/ save the fault actions as per our use case
- Experiment Control & Visibility- FIS supports CloudWatch & uses existing metrics to monitor FIS experiments. Experiments’ running, completed status, triggers all are visible in the console
- No Setup, agents needed- FIS needs no prerequisite setup; the dependencies for fault simulations are managed by AWS.
- Cost- Since FIS uses Pay per use model, the overall cost incurred will be far less compared to commercial tools
- Security- Experiments are tied to IAM for security. As As FIS is AWS managed service , its safe and secured eliminating the need to install any other agents into the instances
- Console access- FIS can be used from Console, CLI and AWS APIs that helps in continuous integration
Sample Fault Simulations:
- Terminate single/multiple EC2 Instances across zones and regions
- Reboot single/multiple App/Cache/DB Instances across zones and regions
- Stop single/multiple EC2 Instances across zones and regions
- CPU Stress in the EC2 Instances (High CPU/ Throttle CPU/CPU Burn)
- Memory Stress in the EC2 Instances (Insufficient Memory)
- Hybrid Resource Stress in the EC2 Instances
- Latency in the EC2 Instances Instances
- Disconnect Primary DB-Reboot DB Instance
- Failover RDS DB
- Insufficient Memory Issues with DB instance
- Kill a particular Microservice/ process (by PID/ name) in an instance
- Latency in Producers or Consumers Instances
More Speakers
- Abhijit Apte
- Adish Apte
- Aditya Garg
- Adolf Patel
- Anish Murlidharan
- Anjana Kaladhar
- Anupam Agarwal
- Apoorva Ram
- Arnab Majumdar
- Arul Murugan Mani
- Arun Narayanaswamy
- Asmita Parab
- Bhuvaneshwari S
- Brijesh Deb
- Chaitanya Deshpande
- Chidambaram Vetrivel
- Chintan Shah
- Deepak Koul
- Deepthi K
- Dheeraj Bendale
- Dimpy Adhikary
- Geosley Andrades
- Home1
- Karthikeyan Balasubramanian
- Karthikeyan Lakshminarayanan
- Kartik Dhokaai
- Kavin Arvind Ragavan
- Kushan Amarasiri
- Maaret Pyhäjärvi
- Mahathee Dandibhotla
- Mayur Chitnis
- Meena Malu
- Mesut Durukal
- Michael Bolton
- Mradul Bansal
- Nalini Kannan
- Niruphan Rajendran
- Nitin Jain
- Peeyush Girdhar
- Prashant Palvai
- Praveen Arun
- Presentations
- Priya Tandon
- Rahul Parwal
- Rahul Parwal1
- Rahul Tripathi
- Ramya Moorthy
- Ranganath HR
- Rashmi Konda
- Rituraj Patil
- Sai Sivasailem
- Senthilkumar Thirumalaisamy
- Shawn Jaques
- Shriram Krishnan
- Sivaranjani Nagalakshmi
- Sophia Raphael
- Sumit Mundhada
- Sundaresan K
- Veena Murthy
- Veeresh Erched
- Videos
- Vinod Antony
- Vishwanath Manogaran