We launched a ranking model for our recommendations feed that had a strong AUC of 0.87 offline. After launch, click-through rate actually dropped by about four percent. I dug into the logs and found a feature that had slight staleness in production due to a caching delay — the training data had fresh values but serving was seeing stale ones. I updated the feature pipeline to reduce the cache TTL, retrained the model, and click-through recovered within two weeks. It was a good reminder to always validate feature freshness end to end before launch.
We had a retrieval model with 0.91 recall at ten offline. After launch, downstream conversion dropped seven percent. I diagnosed a feature distribution shift — fine. But then I asked a harder question: why did 0.91 recall at ten give us false confidence? I found our offline holdout was sampled from the same time window as training, so it couldn't surface temporal distribution shift at all. I proposed and built a time-stratified evaluation protocol — holdout always drawn from a later window. The team adopted it as the standard for every retrieval model we launched after that. The real failure wasn't the model; it was the evaluation design I had signed off on.