NEWS AND BLOGATAHON2021 BLOGS

A Decade in automation – Learning and Gotchas from automated testing of cybersecurity products

Blogathon2021

A Decade in automation – Learning and Gotchas from automated testing of cybersecurity products

By Balachandran Sivak

This June, after over 10 years at McAfee, I switched job and joined Netskope. While that was a happy change, the last 10 years at McAfee were some of the most memorable years of my professional career. While I intend to write a series of blogs on “life @ McAfee”, I like to start with the job role for which I was hired there in 2010 – Test Automation and talk about how we achieved extensive automation that helped us almost complete CI/CD with no manual effort

I was originally part of the Email and Web Security appliances – These are Layer-7 Gateway Appliances that get deployed at customer’s data centers and in most cases are Internet facing. These appliances scan thousands of emails and hundreds of thousands of HTTP requests and responses  per day. The number of functional and non-functional requirements and UX expectations from the customers, along with the basic fact that this secure’s the customers’ systems and data meant that there can be absolutely no lapse in either quality or performance.

Background 

To set the context – The Email and Web Security (later called McAfee Email Gateway, and McAfee Web Gateway – 2 separate products) catered to large financial institutions that do algorithmic trading (HTTP latencies should be < 300 ms), insurance companies (receiving thousands of large emails per day) to defense contractors (holding the defense secrets of some of the nuclear superpowers of the world). So a bug in the software could lead to disastrous consequences not just to 1 person, but even to nation states.

Test Infrastructure and Strategies

The team, located in India and the UK, was fortunate to have very talented, enthusiastic and experienced engineers leading the way. It maintained three checklists

  • The Dev design checklist – These included things like feature flags, threat modelling etc.
  • The Test Design checklist – Included primary areas to cover when coming up with test cases for a feature/bug. Things include – performance numbers , error cases, mails with subject/headers having unicode data, large attachments etc.
  • The Test Plan checklist – Execution of Performance/Longevity tests, UI/UX, L10N etc.

Having these checklists gave everyone a baseline to start with when coming up with test cases for a feature. Also, the Test Plan and the Test Design almost always tied to the UACs of the stories being worked on.

Once test cases are identified for a feature, the team has a very rigorous “Test case review” meeting where all engineers involved in the feature work, along with senior engineers get together to review and brainstorm the test cases identified. The three checklists mentioned above help a lot in reviewing if the identified test cases and the test plan cover all aspects of the feature.

The Parallel Strategy – Functional testing

Another important thing that the team did very early was to automate everything that could be automated. The team built a very robust, reliable and flexible test automation framework specifically for the product. It can come as a surprise to many – But our team, even during early in the feature development phase, relied on automated testing to test it. That is, automated tests were not just for regression, but for actual feature validation. We achieved this by running the development and test tasks in parallel. That was in turn made possible by having a proper mix of development and automation engineers in each scrum team.

When the developers work on feature design, the automation engineers work on framework design (if needed) and make changes to the test framework to support the new feature. While the development team starts implementing the feature, the automation team starts writing automated test scripts based on the reviewed and agreed upon test cases (using the framework changes that they had just made). So, by the time we have the first couple of meaningful builds from the CI/CD system, there is automation to validate those changes.

A word on the framework – The test automation framework was fully homegrown and it took considerable effort to make it robust, stable and reliable. But the ROI is way higher than the initial efforts and potential slow down that we might run into. To give an idea of the massiveness of the framework – It supported installing, configuring and running validations on multiple versions of 3 different products – All of them appliance gateways and installation here means installing a complete ISO that includes the OS and the product

Hardware Infrastructure and Automation

By the time I left the company/team, we had over 4000 test cases that had been automated. And these were run on every build of every feature that was under development. On an average, there were at least 4 features and 1 patch release that went in parallel. Which meant, 5 “branches” of code, with at least 3 builds a day on each. 15 builds ( 5 branches x 3 builds), each running 4000 automated tests

At such large numbers, it would take an inordinate amount of time if those were executed serially/sequentially. So, we had to make the test runs parallel. This necessitated the automation around creating test setup (which we called rigs). We had about 50 rigs, and we could create/bring down rigs as needed anytime. The 4000 test cases per run were split across 8 to 12 rigs, depending on how soon we needed the results.

UI/UX strategies

Since these were appliance gateways, with dozens of features, it involved a sophisticated UI. To make sure we guarantee a good UX, we limited the number of supported browsers. As of today, only Edge, Chrome and Firefox are supported. And we also define the supported versions of each browser. Since those have been set and communicated to the customers, the testing happens only on those. The test framework that was referred to earlier had UI components as well, by integrating Selenium and adding some convenience wrappers on top of ot. The UI elements were mapped to XML files, with differences in the UI elements between versions being handled with version specific attributes. This made handling UI changes much easier, instead of having the UI element locators directly embedded in the test script.

Non-Functional Test Strategies 

These products ran on very critical network infrastructure as explained in the “Background” section. So, testing the robustness, and performance were extremely critical. To that end, there was yet another framework that was created (back in 2010-2011 when such tools were quite uncommon). This tool had three components:

  • A console based client to trigger Performance / Soak (a.k.a longevity) tests for the appliance backend and collect test data
  • A Web portal to see all test results for all tests historically run (so that we can compare the results from previous releases)
  • A Web service to which the client and portal can talk to, to put and get data
  • And a massive DB

The team also had created  traffic generators that helped us generate traffic at a massive throughput and transaction rate. The data collected and monitored by the portal helped us detect event 1% drop in performance either build-over-build or release-over-release

Now, Some Gotchas….

Though the above paragraphs seem to convey a sense of excellent process, tools and techniques to achieve high quality, it wasn’t without its share of glitches, hiccups and gotchas. For eg. to get a fluid responsive web portal for test monitoring, we had 2 failed attempts over 6 months before getting it right.

Also, the checklists that we discussed earlier, weren’t there on day 1. We’ve had some bad, ugly bugs which gave us the ability to retrospect, and introduce changes

When testing for scale, in the initial days, we did not talk to customer facing folks to understand, for eg. the number of internal certificates that some organisations use and then we had to add that to both our checklists and perf tests to see the performance when huge lists are in use. And finally, the framework itself, now over 14 years in use, did not mature over a week or month. On occasions, due to the depth and breadth of things it did, we had made design decisions that did not allow us to enhance for changes that came a few years later. So we had to rewrite some parts of the framework. 

In Closing

Before I finish, there are a couple of things that I need to share. The above details might make someone think we had a massive team of automation engineers. That wasn’t the case. We had about 8 engineers on an average (across India and UK) that did all this work. But the team worked as an extremely cohesive force, with utmost respect for and trust in each other. No matter the years of experience one had, thoughts and suggestions were not just welcome, they were mostly immediately incorporated as well. So key take aways

  • Use automation for feature validations, not just regressions
  • Automate infrastructure management as well, not just feature tests
  • Automate both functional and non-functional tests
  • Use a good mix of home grown tools and other pre-existing tools to achieve maximum productivity
  • Work together as a great team 🙂

About Author 

Staff Software Engineer at Netskope

Leave a Comment