Prologue - Calm before the storm
Middle of June, 2023, Wednesday - it was a beautiful sunny day, and I was just about ready to call it a day. We were supposed to go to my girlfriend’s parents for dinner and have a nice long walk with our golden retriever puppy, Krypto.
He was ready and roaring to go, and so were we. I had no idea what was about to hit me. We had just released a project a few weeks before that, so I grabbed my backpack and my laptop... you know... just in case.
All good, we get there, I sit down, and no more than 5-10 minutes later my phone starts vibrating like crazy. My first thought was... yup, that’s a hotfix incoming, what a great way to start dinner.
I take a look at my phone and notice my work Slack was all over the place:
Did we just release anything?
Did we change something weird today?
Any idea why we’re this slow?
I can only see the spinner!!!!
My first thought was - yeah, we probably messed something up, no rush, we’ll roll back. About the same time, I get a message from our CTO and the Head of Engineering - So it looks like we’re reeeaaally slow to the point you can’t use the app. Would you mind jumping on a call?
Part 1 - It gets worse…
I rush to my laptop, open the AWS console and start digging into things:
Me: So, did we have any releases today => nope, that’s not it.
Me: Hmmm…let’s check the database => same traffic, if not less than what we have at peak hour, hmm, that’s strange.
A, our Head of Engineering: I did notice a Load Balancer alert being triggered around half an hour ago but nothing new since that.
Me: Oh…that’s because….it’s still on, holy moly look at that trafficT, our CTO: So is it bad? People are spamming me about us being slow.
Me: Looks like we have like 10x the traffic we usually do at peak time…
Me: Oh, that’s odd, why did the API Gateway scale out? It shouldn’t unless…. How much traffic do we have?A: Oh lord!
T: Is this a DDoS?
I quickly rushed to check some of our other dashboards and then it struck me…. this truly was a DDoS.
Part 2 - … before it gets better
A few months ago, we had just started a due diligence process for a security certification for the company, and we did our best to get most of that feedback in. Luckily, one of the first things we prioritized was the Web Application Firewall (AWS WAF).
I remember we debated long and hard if it’s really worth it, does it justify the cost for our business, do we really know what we want to filter out with it? We were about to find out for sure.
Me: Ok so I checked the dashboards… it’s bad
A & T: How bad?
Me: DDoS bad… but we have the WAF configured, let me switch that on
So I turn it on and… nothing happens. Well, not exactly nothing, we probably filtered out like 10% of the traffic, but it’s still heavy duty. I ping the other half of the SRE team, and my colleague O. joins in the call.
We got the Avengers assembled, now let’s go:
Me: Hey O, TL;DR there’s a DDoS going on, app is slow as hell, I turned on the WAF, filtered out like 10% of traffic, but it’s still hitting us like crazy.
O: Do we have the Load Balancer access logs?
A & T: We sure do, let's grab those while you guys try to figure out what’s going on
Before anyone managed to do much with the logs, we started getting in the sampling requests from the WAF, I quickly looked at those and saw that most IPs are originating in North Korea, Russia, China, you know, all the fun stuff, none of the places where we serve the app.
I quickly set up a geo rule to enable it, BAM! DDoS was over!
Everyone was happy. We started doing our post mortem to the customers and internally, pretty much everyone that needed to hear about how good we were.
We had two more things to do:
export all the logs and make them queryable with AWS Athena
add alerts based on WAF rules to measure allowed and blocked traffic.
That’s it, we can go back to business; nothing can break us.
Part 3 - there be dragons
Next few days, together my colleague O, we chatted about it, mocked it even at times. It wasn’t that bad; we were even making fun that all the big guys had been DDoSed the past few days and some were down for hours. For us it was barely 15-30 minutes of slowness and that’s it:
Funny how life has a way of striking back when you’re cocky.
I looked at the metrics and was in awe. We had like 60 million requests in 10-15 minutes. It might not seem much, but trust me, for a startup it is. We stopped it just in time. Great job everyone!
Friday 09:30 a.m. EST, peak morning traffic time - alarms went crazy, like everything we set up last night and early in the morning was literally screaming
A minute later the CTO calls us:
T: So… not sure how to say this, but we’re down.
Me: But we have the WAF…
T: Yeah, but we’re still down.
I felt that slap on my face as I opened the WAF dashboard. I had never seen something like that in my life. If 60 million requests in 15 minutes was a lot, well, imagine the shock when it was that in 3 minutes. More exactly we had over 300 million requests hitting our servers in less than 15 minutes.
Me: Ok, they’re probably just using different geo locations.
O: On it, let’s grab the logs
A & T: So can we block them?
Me: Sure thing, let’s block all non-US traffic and be done with it. We can re-assess after that.
5 minutes later
O: Yeah, well, they just redirected all traffic, we’re still being hit.
I start looking at those requests and stripping every part of them:
URL => nope, nothing fishy
Auth headers => nope, nothing weird
User-Agent => yeah, some seem fishy, we block those, still there
Then it strikes me:
Connection: keep-alive
Every X request in our samples had this header on, I quickly check our FE, nope we’re not setting it, neither does Cloudfront, now this is weird.
I find some article from Hackmag and it was right there in my face:
The slow loris attack attempts to overwhelm a targeted server by opening and maintaining many simultaneous HTTP connections to the target - https://hackmag.com/security/connect-and-rule/
I quickly set a rule up, enable it aaaaaand it’s gone. DDoS stopped. Was that it? I disable it one more time aaaaaaand DDoS is back :). Ok, that’s clearly it, EUREKA!!!!
Again postmortem, apologize to customers, debrief internally, all the fun stuff to reassure everyone we’re top of our game. Friday has ended, not great not terrible, 10 minutes of downtime but we were still alive!
Part 4 - Mayday, mayday, we are sinking!
Monday morning we get right back at it. We look at the logs, see if we can get any extra rules in there, nothing pops up, at least not in the samples we’ve looked at. We had more than 300 million data points to check and this time I had a feeling they’ll be back.
08:30 a.m. EST - Almost 1 hour before the expected D-Day time and my heart was racing. It was Monday evening for me but I knew I needed to keep my laptop around. And then, just like a Swiss clock, at exactly 09:30 it began:
O: Oh sh*t they’re back…
A&T: Not again….
Me: We got this, we beat them twice, we can beat them again.
5 minutes later we were down, not slow, not sluggish, dead down, me and my big mouth…
But this time it was weird. We had the alerts that we were blocking a ton, but a ton more were still hitting the mark. To my surprise, all of the requests were now originating in the US, all 300 million of them in the initial 5-minute batch and all the continuous flow for the next half an hour or so :)
It was worse than the previous days and the logs were spamming like crazy:
com.mysql.jdbc.exception.jdbc4.MYSQLNonTransientConnectionException:
Data source rejected establishment of connection,
message from server:"Too Many connections"
Does this mean the DB is down? Oh, it sure did and down it was, overflown with requests, somehow they were getting through and hitting hard.
Me: Checking the logs, it looks like we have a ton of API requests to this endpoint? Did we get hacked?
O: Dunno, let me check the IPs again to see if any got through.
A&T: …
O: So nothing got through but all IPs are from the US, but some are weird. I remember reading about those, are they botnet class IPs? I think it’s the Onion Network!!!!
Me: Darn, let’s block those.
We did just that. We blocked all the known IPs originating in the Dark Web, we blocked anonymous cloud provider IPs, bad reputation IPs, and heck, we even blocked a few random ones we grabbed with Athena. We started blocking like 60-70% of the traffic, but the remaining traffic was still enough to keep us down.
At this time, we just decided to block all non-authenticated traffic for the time being and that was it. A few minutes later, around 90% of our users were able to access the app, but several were also getting blocked unjustly.
That night I did a hotfix and patched both our frontend and the backend. It was not acceptable to have requests passing our API Gateway, even hitting our backend and flooding our DB…
Released the fix, removed the rules, bingo - DDoS was over.
Part 5 - it ain’t over, ‘till it’s over
Conversation the next day early on in the morning EST:
Me: Okay, we must be hurting them like crazy. We blocked so many requests, they’re surely about to stop soon, right? Right? Right?
O: Well not exactly, if those are bot farms it probably costs like 5-10$ per million requests
A, T & Me: ….
And of course, they didn’t stop, they went on and on for almost an entire two weeks, each morning, same time, but this time we were always blocking like 99.9% of their requests.
Finally, the day came when they stopped, just as suddenly as it began, and things went back to normal.
Epilogue - a story to remember
I won’t lie, these were probably the most exhausting two weeks or so in many years. It turned out being a happy ending for us, but often it’s not the case with a DDoS:
So many stories of companies both big and small that suffered terribly because of this. To put it straight, there’s nothing you can do to prevent a DDoS, it doesn’t matter if you’re big or small, if you have a SaaS or a simple blog, it is just a matter of time until it could happen to you.
For when it does, you need to get some things straight and here’s what you could do:
Make sure you’re using a cloud provider that has a Web Application Firewall, learn how to set it up and how to use it. Even if you need to add it on the fly during an attack, having the muscle memory for it will be golden. A reverse proxy can still do the trick if a WAF is missing, though it will be a lot more hard work on your part.
If you’re using a managed app service such as Vercel and the likes of it, make sure they offer DDoS protection and have a spend limit, if they don’t, RUN!!! There are plenty that have it, so find the right one.
If you’re on AWS and can afford the commitment (1 year, 3k+ $ a month, 40k+ $ a year) go for AWS Shield. If you can’t afford that, WAF does pretty much the same thing, but you won’t get reimbursed for traffic cost associated with a DDoS that they failed to block.
If you can afford it, use an API Gateway, it helps you shield your services better than you can imagine, several orders of magnitude cheaper than a WAF and still really effective.
Do enable Load Balancer access logs. Yes, it’s gonna be some extra cost you need to commit to, but how can you study patterns if you don’t know the requests?
Learn how to best use tooling to scrape and aggregate information from access logs. In AWS Athena is your friend, it’s anything but cheap though really fast.
Practice, practice, practice! You need to know your tooling if you are to be at the top of your game.
Sometimes, it’s cheaper to just shut everything down. Choose your battles well. But if you do, beware, you might not be able to get back online for some time! If you have a small blog or personal site, it’s not worth fighting for.
Do not expect to find any WAF rules out there, free for the taking. People will not share their secret sauce!
The last piece of advice I can give you is, GOOD LUCK, and a lot of it that is, you’re gonna need it if you get DDoSed and you’re gonna need it not to be :)
Excuse my ignorance, but why wouldn't a CF challenge solve this on the first day?
I can definitly relate to that ...
What was the hotfix that was applied here? did you mean that both frontend and backend were accepting requests directly without going through the API Gateway?
"That night I did a hotfix and patched both our frontend and the backend. It was not acceptable to have requests passing our API Gateway, even hitting our backend and flooding our DB…"