The Great 502 chase of 2024
The clock was ticking as we chased down a mysterious hisenbug that threatened to bring down our very high profile app launch
This July, my team at Comcast’s Emerging Products and Technologies team launched “Passport to Paris”, a digital “stamps” experience powered by the digital keepsakes platform that I have been working on for the last couple of years. The keepsakes platform allows customers to collect digital collectibles (keepsakes) by participating in various activities ranging from trivia and polls to even watching something specific on TV. You can read more about it on the Comcast PR blog
Passport to Paris was part of a suite of experiences supporting the 2024 Paris Olympics and with a ton of eyes on it, its safe to say this was the biggest launch I have had in a while.
Behind the scenes, the digital collectibles dev team is a very small team of full-stack engineers / generalists, each of us wearing different hats on different days. My own has included the product concept, front-end (Flutter) and middleware engineering, and occasionally even design when we didn’t have design resources to fill a gap. We are using Strapi as our CMS with a ton of custom controllers and services. Strapi also lets us create our database models via its own higher level ORM / UI which has been a great convenience, though that can become suspect if things start going wrong
The whole setup is hosted on AWS. Pretty standard architecture: The middleware is deployed as a Docker container and managed by ECS. An ALB fronts the Strapi instances and is itself fronted by CloudFront that caches all the non-individual-level APIs. We have CloudWatch alarms set up to track system health.
We soft-launched the new application a week ahead of traffic and things seemed to be good. On Monday, the partner teams flipped the switch to start sending us real traffic. The app seemed to perform pretty well for most of the day.
Until the evening. Thats when one of our CloudWatch alarms tracking "client-getting-502-errors" triggered. We got together over Slack to figure out what could be causing the issue. While the numbers were low (our alarm threshold was pretty low), it did make us worry for when the real traffic would hit later in the month. The problem was definitely correlated with traffic and the alarm would trigger / un-trigger based on the traffic on site.
"Is it a Downstream service?"
We had a few downstream services that we had never really tested with any production traffic (we had mocked them in our own load tests) so my first theory was, it could be a downstream service that is taking too long. But we saw that the error would even occur in code paths that were totally within our application and our AWS X-Ray traces indicated that those calls were fine. So that theory was discarded fast.
"Why isn't it auto-scaling?"
Our application was designed to auto-scale based on traffic patterns including average CPU utilization, but that hadn't kicked in and average CPU utilization looked fine. Some of the APIs were taking too long to respond occasionally, but we figured that was something to be dealt with after our immediate problem
"Load Balancer timeout?"
Another theory was "Hey, maybe the load balancer is waiting too long for the origin server and giving up", but what we were seeing was 502 (Bad Gateway), not 504 (Server unavailable). So adjusting the ELB's timeout window didn't seem like it would fix anything.
We weren't getting anywhere and there were definitely production issues so as an immediate patch, we increased the number of Containers/ECS tasks which reduced the errors.
Investigations continued the next day, as did the alarms. Our ops engineer created an AWS Athena instance to look at our ALB logs. At this point we were really starting to look questioningly at even more fundamental elements of our system: Was something broken in our Docker, load balancing or networking setup? Or do we have to re-create our entire infrastructure again?
Then something did crop up: while average CPU utilization was ok (and hence not triggering any auto-scale policies) , we hadn't realized that throughout the problem, we were seeing spikes in maximum CPU used, often hitting 100%. Something in our code was indeed processing a lot of info and hitting the limit on these machines. This could be causing a crash of the entire container and therefore send TCP resets or other incorrect responses to the load-balancer. A quick fix seemed to be changing the machine types and bump the CPU value from 1 to 4. We did another deploy and errors came down significantly, not eliminated but way better.
But it felt weird. We are running way over our expected requirements. While we weren't in emergency anymore, we needed to find out what was going on. With a CPU-bound process now being deemed the villain, we decided to take another look at our entire codebase.
The villain appears
We looked at performance of all the APIs and one clearly seemed to take the most time: a call for all the keepsakes the user has collected so far, taking upto 5-6 seconds for a returned array of a maximum of 9 items. Digging deeper, we found the bug: an incorrectly written ORM query which ended up bringing all the keepsakes FOR EVERY USER in the system into application memory and then filtering it there instead of doing it in the database. Yikes!
We patched the issue and deployed and ...ta daaa!!! The 502s disappeared and system returned to normal.
Lessons learned
As they say: "Whatever doesn't kill you only makes you stronger", and this was the case here too. While incorrectly written queries bringing applications to a crawl or halt is not new in the annals of software development, this was the first time I had to deal with it and the whole journey definitely leveled up my own AWS skills as the small team (3 of us) tried to debug the weirdness.
There were some lessons definitely learned:
We didn't catch the error in our own load testing since our dev environment didn't have as much data as the prod one. So even the expensive database queries were handled fine. One of our plans going forward is add a lot more mock data into our DB to mimic prod to a better extent.
Better alarms: While it may not have helped in this case, we do think our CloudWatch alarms could be configured better. Some of our alarms are based on time (x errors in 10 mins) and don't take traffic patterns into consideration and that may be something we should add as well.
There is a followup to this story that I will post soon, but this is already long enough so pausing it for now. If you have any other lessons you have learned from similar experiences, do let me know as well.