Lessons from my longest production crisis
Incidents are a great way to learn and improve! Sounds like manual remediation for multiple days was a miss but I am sure it left a mark on you :)
Thanks for sharing your story. We have a 3rd party api to draw a chart in an email. So when the schedule task sent the email, emails went with no chart. Of course this didn’t block operations but the business value was lost. The logs reported 429 failure code too.
I wrote directly to the api provider to know their rate limit. I would have even called them. 😀
In another similar incident, api provider black-listed us and that was some fun.
Love this article and the personal experience sharing, Anton.
3rd party APIs have been a pain to deal with and also know how to approach most effectively.
Another place I thought this article might have gone is needing to copy and store our own versions of the data rather and keep it in sync with the 3rd party via background jobs. I've had to work through that in my past experience due to how slow and unreliable the 3rd party was.
Sorry for your weekend :P
The manual work decision is a tough one. It’s not easy to decide what’s worth automating versus not. We all know the story of taking 5 hours to build an automation for something that takes 5 minutes to do manually :)
But in this case, it’s a good lesson learned.
Love that you're sharing real-life tech stories. Awesome visuals and thanks for the mention.