Last quarter, our team released a feature that had a memory leak. It wasn't caught in testing because our test data was smaller. When we saw production issues, I immediately jumped in to help debug. We traced it to an unclosed connection in the caching layer. I worked with the team lead to implement proper connection pooling and added monitoring. It was a learning experience about the importance of production-scale testing.
I made a significant architectural decision that caused a 6-hour outage for 50,000 users. I chose to implement eventual consistency without fully understanding our payment flow requirements. When transactions started failing, I initially tried quick fixes instead of admitting the fundamental design was wrong. After we rolled back, I realized I had made the decision in isolation. Now I write one-page design docs for any cross-service changes and get explicit sign-off from domain experts before implementation. I also instituted a 'devil's advocate' review where someone challenges my assumptions.