The Hidden Prometheus Histogram Trap That Created “Phantom” Kafka Lag Spikes
How a tiny clock skew and one missing `max(0, lag)` turned into a thrilling debugging adventure
A hands-on Software Architect and Researcher.
Deep knowledge in high-performance Java/Kotlin, distributed Big Data systems architecture, and their reliability. AI system development & integration of AI with external knowledge & tools. I like to find the most effective and simple solutions to complex problems and deliver results as soon as possible. Over 25 years of experience developing popular, high-load projects from scratch. Acted as a Lead Software Engineer, Team Lead, Founder, or sole full-stack developer on most projects. I know how to design and build complex, scalable, and reliable systems — and I've done so consistently.

This issue surfaced while I was refactoring part of the monitoring stack for Talos Trading, where I work on Coin Metrics products. It was one of those situations where a routine improvement led to a behavior so strange that it demanded investigation.
One afternoon, our recently updated Kafka consumer lag dashboards started showing significant spikes affecting multiple Kafka topics on the same broker. At first glance, the chart looked like we were falling dangerously behind in consuming messages.
But something was off.
The 99.99th percentile — completely clean
histogram_quantile(
0.9999,
sum by (le, topic) (
rate(api_trades_consumer_lag_seconds_bucket[1m])
)
)
While the “average lag” line shot upward, our high percentiles — including the 99.99th — remained absolutely flat and healthy 🤯.
If real lag had occurred, the percentile curves (at least one bucket) would have been the first to show it.
This contradiction alone made the situation feel impossible.
Yet the average lag chart told a different story.
🚨 Act I — The Mystery of the Impossible Lag
The suspicious part wasn’t just the spikes — it was how natural they appeared.
They didn’t look like random noise or periodic artifacts. They looked like real phenomena: rising smoothly, peaking, then falling.
Except they weren’t real, as our external monitoring showed.
There was no backlog.
No consumer slowdown.
No offset stalls.
No correlated CPU or network dips.
No smoking gun in logs.
But our average lag formula behind the chart insisted something dramatic was happening:
rate(api_trades_consumer_lag_seconds_sum[1m])
/
rate(api_trades_consumer_lag_seconds_count[1m])
This expression assumes that both *_sum and *_count behave like monotonically increasing counters.
Soon we would learn that they… didn’t.
🔎 Act II — The Histograms That Cried Wolf
The lags were recorded as a Prometheus histogram:
val tradesConsumerLag: Histogram =
Histogram
.builder()
.name("api_trades_consumer_lag_seconds")
.help("Distribution of trades Kafka consumer lag.")
.classicUpperBounds(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0)
.labelNames("topic")
.register(registry)
Note: The metrics and expressions in this post were simplified.
Histograms always expose:
<metric>_count<metric>_sum<metric>_bucket- bucket breakdowns
Prometheus describes *_sum and *_count as:
Counters …as long as all observations are non-negative.
That condition turned out to be the heart of the problem.
⏱️ Act III — When Clocks Drift, Lag Becomes Negative
Our lag was computed like this:
val lagSec = (System.currentTimeMillis() - message.creationTimestampMs) / 1000.0
// update the histogram
monitoring.tradesConsumerLag.labels(topic).observe(lagSec)
But time in distributed systems is never perfectly aligned:
producers and consumers live on different machines
clocks drift
NTP adjustments jump time forward or backward
containers inherit imperfect host time
message timestamps come from different system layers
So sometimes messages come from the future, even within the same cluster:
consumer_time < message_creation_timestamp
Which produces:
negative lag
Harmless from a math perspective. Disastrous for histogram semantics.
💥 Act IV — The Day *_sum Went Down
Negative lag values subtract from the histogram’s *_sum.
A counter is supposed to only increase — never decrease. So when Prometheus sees:
*_sum(t) < *_sum(t - 1)
it concludes:
“Counter reset detected — must have been a restart!”
Prometheus then applies counter-reset compensation logic inside increase() and rate().
This produces artificial, inflated spikes — the “phantom lag” we were trying to understand.
You can see such unexpected spikes on the “increase” chart:
increase(api_trades_consumer_lag_seconds_sum[1m])

But once we graphed the raw *_sum (see below), the truth became obvious: it occasionally dipped downward. Each dip corresponded to a negative lag sample.
api_trades_consumer_lag_seconds_sum

Such dips also corresponded to the positive spikes in the “increase” chart.
Prometheus wasn’t wrong. It was faithfully applying counter rules to a metric that violated monotonically increasing counter assumptions.
🧠 Act V — The Hidden Lesson About Histogram Semantics
Prometheus histograms are powerful. They let you:
understand latency distributions
compute percentiles
derive averages
track performance patterns
But they require something some engineers don’t explicitly think about:
✔️ All observations must be non-negative.
Violating that rule leads to:
*_sumbecoming non-monotonic*_countbecoming inconsistentincrease()andrate()misbehavingaverages producing nonsensical spikes
dashboards effectively lying
Most histogram use cases (latency, durations) naturally satisfy this.
Metrics based on time differences do not.
🛠️ Act VI — The One-Line Fix
After understanding the issue, the solution was beautifully simple:
// wrong
// val lagSec = (System.currentTimeMillis() - message.timestampMs) / 1000.0
// right!
val lagSec = max(0.0, (System.currentTimeMillis() - message.timestampMs) / 1000.0)
One small clamp.
No more negative observations. No more histogram resets. No more phantom spikes.
The average lag chart immediately became stable and accurate.
✨ Act VII — Lessons Learned
This was a great reminder that even mature, well-understood systems have edge cases that surface only under subtle circumstances.
If you use Prometheus histograms for:
lag
TTL
“age” metrics
“seconds since event”
time difference measurements
remember:
Clock skew + negative values = Prometheus hallucinations.
This experience illustrated a few broader truths:
Time across machines is never perfectly synchronized
Observability tools can amplify tiny inconsistencies
Histogram semantics must be respected
Counter math is ruthless when assumptions break
The most interesting bugs come from things that “can’t happen”
If you ever see an average histogram-derived metric spike unexpectedly:
Check for negative observations
Check for clock drift
Clamp the lag value with
max(0, value)if necessaryDon’t immediately blame Kafka, Prometheus, Kubernetes, or your deployment
Sometimes, the culprit is simply that time lied to you.
Happy monitoring — and may your clocks stay reasonably aligned. 🕒