There are no "No-Brainer" decisions
— Startups, DevOps — 6 min read
It all started with an incident in my startup
Some context on how our system works
Let me begin by explaining a bit about how our system operates. When a new company is onboarded onto our platform, we sync their whole CRM using job queues. And when an employee from their company is onboarded onto our platform, we sync the user's mailbox and calendar using the job queues. We then try to make sense of the data we have using, you guessed it... job queues, to an internal representation. And once that's done, we start updating their CRM with new data using job queues, yet again. The synchronisation of data happens every few minutes, which triggers build for internal representations which in turn triggers jobs for updating CRM. So you can guess that for a user with 1,000 emails and 10,000 records on CRM, we execute a minimum of 100,000 jobs to update anything on their CRM.
For Job Queues, we use an open-source library called Graphile Worker. It is based on Postgres's SKIP LOCKED
functionality. It's very performant if used correctly, as the author intended (we'll come to this towards the end). Our system is hosted on AWS.
The day of the incident
Date: Late September 2022.
As the startup is growing and as more users are being onboarded onto the platform, it was suffering from performance degradation. One fine day, my Engineering Manager (EM) announced on our common slack channel that the Customer Success (CS) team has raised many support requests to look into why users' emails were not being updated on their CRM. At that moment, we realised that the issue had become very severe and we couldn't just scale our servers up to put it off our plates, which was also adding significantly to our AWS bill. We quickly jumped on a slack huddle to discuss solutions that could be implemented immediately and resolve the issue permanently.
Intuitions and Solutions
We all knew what the issue was -
Graphile Worker isn't performant because AWS RDS Postgres isn't meant to be used as a job queue. As the system scales, we should move to a dedicated job queue.
Discussing potential solutions
The reason for writing this blog post is due to my learning about how we were discussing solutions.
Since the performance degradation first came to light (a month back from the date of the incident), by using AWS RDS Performance Insights we only looked at the top queries which were taking a lot of time - which was the Graphile Worker's get_job
function - and decided that it is not our code's problem, but the library maintainer.
In the slack huddle, the following solutions were discussed:
- "Move to AWS SQS or another dedicated job queue solution ASAP, like within the next 2 weeks." It's true, AWS RDS Performance Insights was telling us this for a long time. But the timeline wasn't feasible. And migrating a core component to a new one isn't easy and can cause more problems than it can solve. It was decided that we would start the migration process within the next two months.
- "Remove dead/failed jobs from the jobs table." Makes sense. Since the Graphile worker does a sort on the
run_at
(the time when a job should be executed) on the whole table, dead/failed jobs would only slow the worker trying to fetch the next job details. We implemented this solution soon after, but the performance gains were not very significant. - "Remove all jobs which are scheduled to run after 3 months. Find a solution to schedule jobs within the next three months only." Another caveat in our system - we fetch the calendar events and put it all on the job queue! Even recurring ones!! This is a good idea to implement, but it wouldn't resolve the issue. We implemented this many months later in December. It had some performance gains as we had more than a million jobs scheduled to be run in this manner (all the way to 2039!).
- "Split RDS into two: app and worker." This came from my EM, as a way to quickly know if Graphile Worker was really the issue. The thing is, we didn't really consider this the best solution at the time. Even though I was on board with this, folks were more hell-bent on trying to move away from Graphile Worker ASAP than trying to empirically determine it being the bottleneck.
The reason why solutions 2. & 3. became apparent is that the library maintainer hadn't intended the job queue to be used to schedule millions of jobs into the future, that too till 2039 😂! It was intended to schedule jobs only till the recent future (like a few days or so).
I was tasked with doing 4. within the next two days, including migration of jobs. Since the underlying job queue remained the same, the task was much less risky than moving to a whole new job queue solution. And I started with it immediately.
Mind-blowing discovery post-solution implementation
Once my task regarding splitting RDS and migrating jobs was complete, the AWS RDS Performance Insights showed something our engineering team hadn't expected at all - our application queries were very sub-optimal! The app RDS was running at 99% CPU and worker RDS was running at a healthy 40% utilisation for db.r6g.large instances. 🤦
Man, were we wrong all along! The problem was with us! We quickly scuttled to our DB IDE to run some EXPLAIN ANALYSE
queries to see which indexes were being used and which weren't.
We found that a lot of our indexes weren't being used by Postgres' Query Planner, mainly due to our over-reliance on JSONB columns to store data, whose fields we had indexed, which the Query Planner isn't very fond of. This could be a blog post for another time because right now, we are doing a massive rewrite to move away from JSONB as much as possible. So I'll share my learning post.
Learnings and Takeaways
The main problem during this incident was - the lack of data points to make meaningful decisions. As a learning, I now believe that some development time in startups should be allotted for instrumentation, not just feature development. Consider them separate tasks, but their priority should be equal. One is incomplete without the other. You cannot make meaningful decisions without meaningful data. And good decisions make or break a startup. In this case, we relied too much on our "intuition" than data.
Secondly, if you lack the necessary data to make informed decisions (which is the case quite often in the always-busy startup life) - the next step should be to do something that would generate a lot of insights with the least amount of risk, in a high priority manner, in the least amount of time.