How 8 bytes broke production!

How 8 bytes broke production!

✏️Preface

This blog is one of the many in a series which I hope to write in the future about a project which I was assigned in my professional career. This project is in the edtech domain and is used by many of the big players, let's call this "Revolutionary Education Knowledge Technology" a.k.a "REKT" .

There are many stories which I would like to share about project REKT, but let me start with this one. This is a story of how a mere of 8 bytes broke the entire production.

📖A Little bit of Backstory

It was the day before the production deployment was scheduled and if I remember correctly it was also a day before a major content update for a game which I was playing. I can't go into much/actual details about the project but a quick overview of the project was basically imagine udemy but with proprietary domain specific content for companies. There were many things which I had to implement in REKT and one of them was making an admin/content management panel to manage educative modules. One of the features was to implement a CRUD for content system which can be attempted by the user in modules. Just like in udemy, a module contains many lectures and content for the user to attempt it, it was the same here.

Well, I was not the first developers in this project. Before me there was a group of developers who had developed the entire "architecture" and chosen the tech stack for REKT.

Now, I will tell my programming journey in another post but a bird's eye view is I believe in grasping the fundamentals rather than going and learning the language because according to me how to solve a problem is more important rather than knowing the syntax for solving it. A simple example, if I need to do something repetitively I would need like a loop. Here loop is a solution to my problem, as for the ways to implement it are different in different programming languages.

// For loop in javascript
for (let i = 0; i < 10; i++) {
  console.log(i);
}
# For loop in python
for x in range(0, 10):
  print(x)
// For loop in PHP
for ($x = 0; $x <= 10; $x++) {
  echo $x;
}

Hence, I focus on developing the problem solving mindset rather than focusing on the language specific ways and/or syntax to solve a problem.

Well, coming back to the point, I at the time had somewhat idea of frontend, backend and server from a MVC and monolith perspective. I wasn't new to the component driven frontend aspects, fortunately I learnt it from SvelteJs❤️ and unfortunately I hadn't had the chance to delve into frameworks like ReactJs and VueJs. And guess what the tech stack for the project REKT was...frontend in ReactJs backend in Fastify and a tightly coupled aws ecosystem dependency and to put a cherry on top it was all serverless and all of this managed by terraform. At the time, all I knew about serverless was that it basically spins up a server like environment runtime when the request comes.

Also, there was very less or close to no documentation with what the entire architecture looked like and what services were used.

🎮Deployment and Gaming

Past aside, fast forward to the day of the deployment of the feature. I and the QA had extensively tested the feature with almost all the scenarios which we could have come up with. Everything was greenlit, and with the approval of the manager we pushed it to the production in the start of the day. After the deployment the QA started testing in live environment and call it my overconfidence, naivety or lack of experience, I without any worries installed the update for my game and started playing it. Our manager saw this, but since everything was going fine they didn't bat much eye to it. The first half of the day went very smoothly, everything was working perfectly in the production we had verified and tested it several times and we told that it was ready for client usage.

⚡Getting REKTed

Then comes the second half of the workday. I just came after lunch and was planning to take up other tickets, suddenly a message popped up in the teams chat.

"The questions aren't being added when I save the content." - Client

Thoughts started flooding in my mind,

How can this be possible? We tested everything multiple times on dev, staging and production too? Surely, this must be a misunderstanding.

"Can you tell them to clear the browser cache or hard refresh and try again?" - Me

"We are already running this incognito mode." - Client

Well fu*k, I thought! Instantly, I opened up the dev deployment and inspected it. I created a lesson, added a content(quiz) with a few questions and options, clicked on the submit button and Voila!🪄 it saved perfectly.

I immediately tried it again, but it worked perfectly everytime. I screen recorded the video and send it in the teams chat.

"It works when I do it." - Me

And boom, client also sends a screen recording with the issue. At this point in time, me and the QA have going through each and every scenario on all three different environments and everything worked perfectly for us.

The clock kept ticking away, time flew but we weren't getting a breakthrough at all. In search for the bug of why the content wasn't getting saved for the client me and the QA found another major bugs in production. Everything was fine until the first half of the day, I was enjoying my game and everything seemed going smoothly and now here we were 11:00 pm in the night with more than one production breaking bugs.

We were too tired to think of anything at that point, I tried adding some patches but they didn't work we were starting to get desperate.

"We aren't getting anywhere. Let's call it a day, come back tomorrow and look at it with a fresh mind" - QA

He was right, we had spent all our day in the same thing over and over and over again we were getting tunnel visioned.

💪Chad Senior Dev

Before we called it a day, something interesting happened which eventually paved the way for me to solve this clusterfu*k of problems. A senior dev, had some time left after completing his work and offered to help us debugging this. We tried our best to sum up the situation, but even we didn't have any idea why the content wasn't getting saved. (We kept other bugs between myself and QA!😉)

The senior dev already had extended his work time and now was helping us, we felt bad and somewhat relieved at the same time. Well, the senior dev did nothing extraordinary or out of the world. He just opened up the dev tools, asked me to enter the data and save it. It saved perfectly. Then he asked me to reproduce what the client was facing and I gladly did.

"I think here lies your problem..." - Chad Senior Dev

He showed us the request payload size for both cases,

  • Passing Case: ~8,192 bytes

  • Failing Case: ~8,200 bytes

"Heh, that can't be the case." - Me

"Surely that can't be the case, it is just 8kb. It has to be something else." - Me

With this thought in mind, we wrapped up the day.

🤬Next time write some documentation!

For the most part, developers frown upon when asked to write documentation but at the very least take 5 minutes and think about the future person who would maintaining your code, I mean do you really want them to write a ~2k word article saying how frustrated they are when someone doesn't write something basic down.

Alright, back to the incident. The next day, we came early around 7:00 AM to the office with a fresh mind. One thing was stuck with me, that the 8kb payload size is bullshit(No offense to the Chad senior). So, I tried extending the POST body size in fastify, redeployed the backend but the same issue persisted. Maybe it was due to CORS or some shit, allow everything for the time being and redeploy. Nothing worked.

I thought let me leave this for a bit and work on other bugs before they are found out by the client. So, I put off this and solved other bugs. They were not easy but they weren't a black box, I could easily trace through the code and was able to find a tiny flaw in the logic. I fixed each of the other bugs. Even though the main thing was left, I finally had some hope. Sometimes little victories help your state of mind and change its thinking direction. The same happened for me.

Up until now, I have been looking through the logs for an error in both the frontend and the backend but there was no sign of error logs anywhere. It was truly a black box for me. We could reproduce the bug but it wasn't leaving any trails to follow.

💡Thinking with a different perspective

Wait, why aren't I getting any logs for this particular request? It is getting fired from the frontend, and it sometimes reaches the backend too. Let's assume that the 8kb thing is right, so it means that something is blocking the request from reaching the backend. What could it be, though? We don't have any server...

"We should revert back the allow all CORS it is not secure." - QA

Aha!! Secure, security, do we have something like security group for routes or anything which has some rules. I logged onto the AWS dashboard went to the security groups but couldn't find anything. Even google searching didn't helped much. Then it struck to me, that to that since the request isn't getting to the backend then there must be something protecting it and since it was serverless, I thought that security groups can't be the one. What other thing could be there, I searched on google "aws website security products" and then looking for something which blocks requests in the transport/network layer and here it was the culprit of this clusterfu*k...

AWS Web Application Firewall(WAF)

Then everything started falling into place, I then searched "aws waf body size limit" and I find out this.

"We also use AWS WAF?" - Me, QA, Manager, maybe client

This not only came a surprise for me but the entire team. Well then, fixing it was as easy as it could get. Just logged into the AWS console, changed the WAF rules to allow over-sized payload for request and then tested again.

Everything Works!🥂

Redeployed after cleaning some code and tested again...it failed again.

The hell?! I thought.

Ah yes, since the entire infrastructure was maintained by terraform it reset the settings for the WAF rules. Well, just change it again manually for the time being and it should work! Automating it is a problem for future me.

🔭Conclusion

After all this settled down, I took around an hour or so. Documented the entire incident with guides and my thought processes and links to the aws docs and manual solution with an in-depth step by step tutorial to help anyone in the future encountering similar kind of things in project REKT!


If you’d like to know more about project REKT or my journey, please feel free to reach out or follow me!

Github: Hetarth02
LinkedIn: Hetarth Shah
Website: Portfolio

💖Thank you for joining me on this journey, and look forward to more interesting articles!

Credits:

Cover Image from, Image Source.