We had a service that was showing elevated latency in our dashboards. I pulled the CloudWatch metrics and dug into the logs to identify where the slowdown was happening. After reviewing the data, I traced the issue to a database query that was running a full table scan. I worked with the team, we added an index, and latency dropped back to normal. The whole investigation took about two days and the fix was deployed the following week.
Our payment service showed a p99 latency spike every Tuesday at 11 AM. My team had already investigated twice and closed it as a traffic anomaly. I was not convinced. I pulled six weeks of raw request logs — not the aggregated CloudWatch metrics everyone had been using — and correlated them with our DynamoDB consumed-capacity data hour by hour. I found that a weekly batch job from a separate team was saturating our shared partition key. No one had made that cross-team connection. I brought the data to both teams, we repartitioned the key and isolated the batch workload, and p99 dropped from 800ms to 60ms permanently. I also documented the investigation pattern as a runbook so on-call engineers could detect cross-team contention automatically going forward.