Even though I’m an Uber user, I’ve never thought very much about their IT infrastructure other than to form an (hilariously) simplified mental model of what it probably looks like. It turns out to be a pretty sophisticated operation that has, among other things, a dedicated “Observability Team” that takes care of building and monitoring Uber’s telemetry subsystems. The system gathers fairly fine-grained metrics on what’s happening and these are monitored to spot problems and reveal opportunities to tune the system for better performance.
Wilfred Hughes pointed me to this post by Richard Artoul over at the Uber Engineering blog that tells the story of how after a recent deployment of some updates, the latency for collecting and filing events increased from about 10 seconds to 20 seconds. They reverted the change and went looking for the source of the latency. That turned out to be harder than you’d guess because the obvious strategies, like profiling, failed. Furthermore, the problem only manifested in the production system so they had to debug on the live system rather than one of their testbeds.
What follows is a tale of some impressive engineering and debugging. Even when they located the offending code, it was hard to understand how it was causing the latency. Before they were finished, they even instrumented the Go compiler to help with understanding what was happening. When they were done, they completely understood the problem and had made a change to one of their algorithms to prevent it from happening again.
If you write code, you really should read this post. It’s an example of debugging at its best.