Prep by Company
Software Dev Engineer SDE Product Manager PM Data Scientist DS Data Engineer DE ML Engineer MLE Technical PM TPM
Software Engineer SWE Product Manager PM Data Scientist DS Data Engineer DE ML Engineer MLE Technical PM TPM
Software Engineer SWE Product Manager PM Data Scientist DS Data Engineer DE ML Engineer MLE Technical PM TPM
Software Engineer SWE Product Manager PM Data Scientist DS Data Engineer DE ML Engineer MLE Technical PM TPM
Software Engineer SWE Product Manager PM Data Scientist DS Data Engineer DE ML Engineer MLE Technical PM TPM
Software Engineer SWE Product Manager PM Data Scientist DS Data Engineer DE ML Engineer MLE Technical PM TPM
Software Engineer SWE Product Manager PM Data Scientist DS Solutions Architect SA ML Engineer MLE Technical PM TPM
Guides About Get Your Playbook →
The Bar Raiser's Debrief · Amazon Machine Learning Engineer

"How would you design a system to serve personalised search rankings to 500 million users with sub-100ms p99 latency?"

Invent and Simplify Machine Learning Engineer 5–7 min
Why candidates fail: Candidates jump straight into model architecture without decomposing the problem into retrieval and ranking stages, revealing they have never operated a production ML system at scale.
Two voices. One question. The insider reaction you don't usually see.
Also on YouTube 5–7 min 2026
"How would you design a system to serve personalised search rankings to 500 million users with sub-100ms p99 latency?"
Competency tested
Invent and Simplify
Who asks it
Bar Raiser · HM · Peer
What they're really asking
Does latency shape your design from the start?
The answer that fails — and why
Candidate answer Does not raise the bar — Invent and Simplify

I would build a two-tower model — one tower for the user, one for the item — trained on historical click data. At query time, I would run approximate nearest neighbour search using FAISS to retrieve the top candidates, then pass them through a ranking model. To handle latency I would add a Redis cache for popular queries and pre-compute user embeddings nightly. The ranking model would be a lightweight gradient-boosted tree to keep inference fast. This should comfortably handle 500 million users if we scale the serving fleet horizontally.

Bar Raiser evaluation
Latency treated as afterthought — cache and fleet scaling added last
No retrieval/ranking budget split; p99 constraint is unquantified
Pre-compute cycle is daily — personalisation freshness risk unaddressed
No graceful degradation path when latency budget is exceeded
Prefer to hear it? Watch the video for the two-voice delivery with live reaction commentary.
Amazon debrief · MLE loop · Bar Raiser evaluation Below Bar
Leadership Principle: Invent and Simplify
Does not demonstrate Invent and Simplify.
Latency constraint named but never used to drive architectural decisions.
No explicit latency budget split between retrieval and ranking stages.
Daily pre-computation proposed without acknowledging personalisation freshness risk.
No graceful degradation strategy when p99 budget is breached at scale.
interview101.com · Invent and Simplify · Amazon MLE · Bar Raiser debrief reference
Now here's what a strong answer actually sounds like
The answer that works — in full
Strong answer Raises the bar — Invent and Simplify

Sub-100ms p99 is the first constraint I design around, not the last. I would split the budget: roughly 20ms for two-tower ANN retrieval over a pre-built FAISS index, and 60ms for a lightweight ranker — leaving 20ms for network and feature serving overhead. User tower embeddings are pre-computed and pushed to a low-latency feature store refreshed every 15 minutes, balancing freshness against compute cost. Popular query results are cached with a TTL tuned to query frequency. If retrieval breaches its budget, the system falls back to a pre-ranked static set — the user still gets a result, just a less personalised one. I would instrument p99 at every stage boundary so we catch regressions before they compound.

Bar Raiser evaluation
Latency budget decomposed explicitly across retrieval and ranking stages
Freshness vs compute tradeoff quantified — 15-minute refresh cycle justified
Graceful degradation path named and scoped to the retrieval stage
Instrumentation plan proposed proactively — shows production ownership instinct
Amazon debrief · MLE loop · Bar Raiser evaluation Raises Bar
Leadership Principle: Invent and Simplify
Strong signal. Raises the bar.
Decomposed p99 budget into per-stage allocations before choosing any model.
Articulated freshness versus compute tradeoff with a concrete refresh cadence.
Named and bounded a graceful degradation path — production ownership signal.
Proposed per-stage instrumentation proactively; did not wait to be asked.
interview101.com · Invent and Simplify · Amazon MLE · Bar Raiser debrief reference
Run your story through these three questions
1
Can you split your latency budget across every stage before naming a model?
If not, latency is decoration — not a design constraint — and the Bar Raiser will hear it.
2
Have you named the freshness versus compute tradeoff for your personalisation signals?
If not, your pre-computation story has an unacknowledged production risk sitting in it.
3
Do you have a graceful degradation path when the latency budget is exceeded?
If not, your system has no floor — and at 500 million users, the floor always gets tested.
Get your personalized report
How do your real stories score?
Get a personalized report scored against the interview rubric Amazon uses for your role.
Get your Amazon Machine Learning Engineer report →
Other questions from the same loop
Each video covers a different competency tested in the Amazon Machine Learning Engineer loop
Explore the full Amazon Machine Learning Engineer prep hub