Important lessons I learned from a large cloud migration project.
Recently I was contracted to help on load and performance testing for a high traffic web app migrating to Google Cloud Platform (GCP). I learned few important lessons about GCP and cloud migration from a load and performance testing standpoint which I would like to share. Although these lessons are specific to GCP, I believe some of them will also help project teams migrating their applications to other cloud providers.
There are subtle differences between the cloud and traditional on-premises data center environments on every aspect of networking. Though I learned and understood those critical differences, the part of GCP networking that surprised me was their Global Cloud Load Balancer.
The Google cloud load balancer is a distributed network of regional GFEs (Google Front Ends) which cascades requests through a complicated waterfall of routes to your backend server instances. It feels like a magical experience. You get a single IP address and your customers traffic will enter Google’s network at the closest location to them. Then it will transfer over Google’s premium dark fiber until it is routed to the closest region were your backend instances are running.
How Google Cloud Load Balancer Work
Google’s global load balancer can introduce multi-second latency on a small fraction of requests and this is due to new load balancers “learning” routes to backends. The more backends you have, the more route learning you will see. The learned path sticks around for a day. We were able to identify the drop in latency during our large load tests and later learned from the Google support team that GCP global load balancers are different from traditional load balancers with its learning algorithms. The multi-second latency on very few requests on a grand scale of the traffic for us was not a blocker but a small wrinkle on user experience.
Google Compute Engine offers a really unique technology called “Live Migration” which keeps your instances running even when a host undergo downtimes such as during software or hardware update. Google Compute Engine migrates your running instances to another physical host in the same network zone rather than requiring your instances to be rebooted.
Live migration helps Google to perform maintenance which is integral to keeping the infrastructure protected and reliable without interrupting any of your instances. The Live Migration is a very cool feature, but your instances might experience a short period of decreased performance. Live migration of instances to new host takes around few hundred milliseconds, during that time your application might experience decreased performance in terms of high latency but there won’t be any connection drops or errors. We were able to correlate few sudden CPU usage spikes in our caching servers that were caused because of the Live Migration during our load and performance tests.
During our high volume distributed load tests with thousands of virtual users, we found one of our backend stack’s performance was not as expected. We decided to increase the no. of servers in the stack from 200 to 360. All of these servers were connected to the frontend web application through Google’s ILB (Internal Load Balancer). Even after adding additional servers still, the performance on our backend stack didn’t show much improvement. After a detailed investigation, we found around 100+ newly added servers in the stack were not receiving any traffic from our load tests.
After checking with Google support team, we found that Google Internal Load Balancers have a hard limit to support only 250 backend servers, and we were offered a solution to split our backend stack servers to be placed under multiple Internal Load Balancers.
There are other configurable restrictions in Google cloud that are classified into Quotes and Limits, it’s very critical to understand these restrictions and make sure to take them into account when setting up your infrastructure in the cloud. These restrictions are established by Google in order for us to have a better control of CAPEX spending and security in the cloud.
There are two aspects of Google Cloud Virtual Machines that were important for us to troubleshoot performance related issues.
During the initial days of our cloud migration journey, we were using Sandy Bridge Intel-based processors for our database servers. We noticed few performance lags and were looking for options. Based on Google’s recommendations we changed our DB servers to use Intel’s best in class processor Skylake and saw good performance improvements on the servers CPU utilization. We did a load test to compare CPU performance between the 3 major processors offered in GCP on our backend servers and found servers running on Skylake were performing better than others with 10% improvement.
CPU utilization among different Intel processors available in GCP
As you may have noticed, don’t expect your lift and shift cloud migration strategy to work smoothly. You should expect to discovery many surprises in your cloud migration journey. The most important lesson of our journey was we validated every step of our cloud migration process through the continuous load and performance testing which helped us to learn many new things eventually have a successful transition into the cloud without any rollback.
Kindly share any lessons you may have learned or best practices followed in your cloud migration project. Thanks for reading!