I would build a two-tower model — one tower for the user, one for the item — trained on historical click data. At query time, I would run approximate nearest neighbour search using FAISS to retrieve the top candidates, then pass them through a ranking model. To handle latency I would add a Redis cache for popular queries and pre-compute user embeddings nightly. The ranking model would be a lightweight gradient-boosted tree to keep inference fast. This should comfortably handle 500 million users if we scale the serving fleet horizontally.
Sub-100ms p99 is the first constraint I design around, not the last. I would split the budget: roughly 20ms for two-tower ANN retrieval over a pre-built FAISS index, and 60ms for a lightweight ranker — leaving 20ms for network and feature serving overhead. User tower embeddings are pre-computed and pushed to a low-latency feature store refreshed every 15 minutes, balancing freshness against compute cost. Popular query results are cached with a TTL tuned to query frequency. If retrieval breaches its budget, the system falls back to a pre-ranked static set — the user still gets a result, just a less personalised one. I would instrument p99 at every stage boundary so we catch regressions before they compound.