I would design this as a two-stage system — retrieval using a two-tower model to generate candidate videos, then a ranking model with features like watch time, likes, and shares. For freshness I'd incorporate real-time user signals using something like a feature store. To balance short-term and long-term engagement, I'd use a multi-objective loss that weights immediate signals alongside session-level watch time. I'd validate offline with holdout sets and then run A/B tests to confirm online metrics improve before any full rollout.
I'd decompose this into three stages: two-tower ANN retrieval to get candidates, a light ranker filtering on freshness and basic quality signals, then a heavy ranker with dense user and video embeddings. The short-term versus long-term tension is where the real design work lives. Raw watch time is a noisy proxy — it rewards clickbait and harms retention. I'd complement it with explicit satisfaction signals: survey-derived satisfaction scores and repeat-creator consumption as a long-run health proxy. I'd run separate A/B metrics for session engagement and seven-day return rate, and I'd monitor both in production dashboards with alerts on divergence. That split is how you catch reward hacking before it compounds.