When I see a gap between offline and online metrics, I start by thinking through the common root causes. First, I'd look at data drift — whether the input distribution has shifted since training. Then I'd check for training-serving skew, making sure the feature engineering logic is consistent between the two environments. I'd also consider label quality issues or delayed feedback loops. I'd pull some sample predictions and compare them to what we'd expect offline. Once I narrow it down, I'd work with the data engineering team to fix the pipeline.
I had this happen on a ranking model I owned. We saw CTR drop roughly four percent in the first forty-eight hours post-launch. I had dashboards tracking feature distribution skew, prediction score distributions, and label delay in real time — so I knew within hours it wasn't a serving infrastructure issue. I isolated it to a feature that was computed differently at serving time than at training time: a freshness signal that used request time in training but event time in production. I filed a fix, rolled back the feature, and shipped a corrected version within a week. After that I added a pre-launch skew check to our deployment checklist so this class of failure would be caught before the next launch.