Early in my last role I was convinced we should shard our Postgres database horizontally to handle growing write volume. I put together a design doc and presented it to the team. One of my colleagues pushed back and suggested we try read replicas and query optimisation first before committing to the operational complexity of sharding. I thought about it and realised they had a valid point, so we went with that approach instead. It ended up working well — write latency stayed within acceptable bounds for another eight months, so it was the right call.
Six months ago I was convinced horizontal sharding was the right fix for our write latency spike — p99 had climbed to 420 milliseconds under peak load. I wrote the design doc, ran the cost model, and was ready to greenlight it. Then our data infrastructure lead shared profiling output showing 60 percent of our slow writes were hitting a single poorly-indexed foreign key — nothing sharding would touch. I ran the same profiling independently to verify, added a composite index, and rewrote two query paths. P99 dropped to 85 milliseconds within a week, with no added operational complexity. I updated the doc, closed the sharding RFC, and presented the full evidence trail to the team so we had a shared model of the decision.