I find Byzantine fault tolerance genuinely difficult. The idea that a distributed system has to keep working even when some nodes send actively misleading information — not just silent failures — is conceptually tricky. I spent time on the original paper and I get the core proof, but I still have to think carefully whenever I work through a specific consensus algorithm. It's one of those areas where I know the theory but I'd want more hands-on implementation experience before I'd say I fully own it.
Backpressure in distributed stream processing — specifically why naive implementations cause oscillation rather than stability. I understand the surface mechanism: a slow consumer signals the producer to slow down. What I can't yet reason through reliably is the control-theory side — why certain feedback loop configurations overshoot before settling, and how to tune those parameters without empirical trial and error. I've read two papers on it, built a toy model, and I can predict the failure mode, but I don't yet have a mental model that lets me design the right parameters from first principles. That gap bothers me, and I'm actively working through it.