We were building a batch ingestion pipeline to load event data into our warehouse nightly. I had designed it around a scheduled job that pulled full table snapshots. During a design review, a colleague pointed out that the snapshot approach would get expensive as the table grew. I agreed, and we switched to an incremental load using a watermark timestamp. The migration went smoothly — we reduced processing time by about thirty percent and storage costs came down. It was a good callout and I was glad we caught it before it became a bigger problem.
I had designed a Dataflow streaming pipeline assuming downstream consumers could tolerate eventual consistency — I thought near-real-time was good enough and that adding exactly-once guarantees would be over-engineering at our scale. Two months in, a peer on the analytics team showed me query results that were silently double-counting events during Dataflow worker restarts. I had been wrong about the failure mode — I assumed restarts were rare enough not to matter, but I had never actually measured restart frequency under our traffic patterns. Once I saw the data, we redesigned the pipeline around idempotent writes using a deduplication key in BigQuery. We also added a reconciliation job that compared source counts to warehouse counts every hour. Late-arriving event discrepancies dropped from roughly four percent to under zero point one percent. More importantly, I updated my default assumption for any stateful streaming pipeline: exactly-once is the baseline, not the optimization.