I'd start with click-through rate on the top results as the primary signal — if users are clicking, the results are relevant. I'd also track average session length and whether users return to the results page after clicking, which tells you about result quality. For longer-term trends I'd look at query volume and user retention week over week. NDCG is useful for offline evaluation against human raters. Together these give a pretty comprehensive picture of whether Search quality is improving.
Before proposing any metric, I'd define 'better' from the user's perspective — did they get the right answer with minimal effort? That frames the whole system. I'd structure it as a hierarchy: the North Star is task success rate, measured via satisfaction surveys and zero-long-click rate. Leading indicators include reformulation rate and time-to-first-click. Guardrail metrics protect against gaming — if CTR rises but reformulation also rises, that's a red flag, not a win. I'd also flag the long-run validity problem: metrics that improve short-term can erode user trust over months, so I'd pair any live experiment readout with longitudinal cohort tracking.