Skip to main content

Command Palette

Search for a command to run...

The Hidden Prometheus Histogram Trap That Created “Phantom” Kafka Lag Spikes

How a tiny clock skew and one missing `max(0, lag)` turned into a thrilling debugging adventure

Updated
5 min read

This issue surfaced while I was refactoring part of the monitoring stack for Talos Trading, where I work on Coin Metrics products. It was one of those situations where a routine improvement led to a behavior so strange that it demanded investigation.

One afternoon, our recently updated Kafka consumer lag dashboards started showing significant spikes affecting multiple Kafka topics on the same broker. At first glance, the chart looked like we were falling dangerously behind in consuming messages.

But something was off.

The 99.99th percentile — completely clean

histogram_quantile(
  0.9999,
  sum by (le, topic) (
    rate(api_trades_consumer_lag_seconds_bucket[1m])
  )
)

While the “average lag” line shot upward, our high percentiles — including the 99.99th — remained absolutely flat and healthy 🤯.
If real lag had occurred, the percentile curves (at least one bucket) would have been the first to show it.

This contradiction alone made the situation feel impossible.

Yet the average lag chart told a different story.


🚨 Act I — The Mystery of the Impossible Lag

The suspicious part wasn’t just the spikes — it was how natural they appeared.
They didn’t look like random noise or periodic artifacts. They looked like real phenomena: rising smoothly, peaking, then falling.

Except they weren’t real, as our external monitoring showed.

  • There was no backlog.

  • No consumer slowdown.

  • No offset stalls.

  • No correlated CPU or network dips.

  • No smoking gun in logs.

But our average lag formula behind the chart insisted something dramatic was happening:

rate(api_trades_consumer_lag_seconds_sum[1m])
/
rate(api_trades_consumer_lag_seconds_count[1m])

This expression assumes that both *_sum and *_count behave like monotonically increasing counters.

Soon we would learn that they… didn’t.


🔎 Act II — The Histograms That Cried Wolf

The lags were recorded as a Prometheus histogram:

val tradesConsumerLag: Histogram =
  Histogram
    .builder()
    .name("api_trades_consumer_lag_seconds")
    .help("Distribution of trades Kafka consumer lag.")
    .classicUpperBounds(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0)
    .labelNames("topic")
    .register(registry)

Note: The metrics and expressions in this post were simplified.

Histograms always expose:

  • <metric>_count

  • <metric>_sum

  • <metric>_bucket - bucket breakdowns

Prometheus describes *_sum and *_count as:

Countersas long as all observations are non-negative.

That condition turned out to be the heart of the problem.


⏱️ Act III — When Clocks Drift, Lag Becomes Negative

Our lag was computed like this:

val lagSec = (System.currentTimeMillis() - message.creationTimestampMs) / 1000.0
// update the histogram
monitoring.tradesConsumerLag.labels(topic).observe(lagSec)

But time in distributed systems is never perfectly aligned:

  • producers and consumers live on different machines

  • clocks drift

  • NTP adjustments jump time forward or backward

  • containers inherit imperfect host time

  • message timestamps come from different system layers

So sometimes messages come from the future, even within the same cluster:

consumer_time < message_creation_timestamp

Which produces:

negative lag

Harmless from a math perspective. Disastrous for histogram semantics.


💥 Act IV — The Day *_sum Went Down

Negative lag values subtract from the histogram’s *_sum.

A counter is supposed to only increase — never decrease. So when Prometheus sees:

*_sum(t) < *_sum(t - 1)

it concludes:

“Counter reset detected — must have been a restart!”

Prometheus then applies counter-reset compensation logic inside increase() and rate().

This produces artificial, inflated spikes — the “phantom lag” we were trying to understand.

You can see such unexpected spikes on the “increase” chart:

increase(api_trades_consumer_lag_seconds_sum[1m])

But once we graphed the raw *_sum (see below), the truth became obvious: it occasionally dipped downward. Each dip corresponded to a negative lag sample.

api_trades_consumer_lag_seconds_sum

Such dips also corresponded to the positive spikes in the “increase” chart.

Prometheus wasn’t wrong. It was faithfully applying counter rules to a metric that violated monotonically increasing counter assumptions.


🧠 Act V — The Hidden Lesson About Histogram Semantics

Prometheus histograms are powerful. They let you:

  • understand latency distributions

  • compute percentiles

  • derive averages

  • track performance patterns

But they require something some engineers don’t explicitly think about:

✔️ All observations must be non-negative.

Violating that rule leads to:

  • *_sum becoming non-monotonic

  • *_count becoming inconsistent

  • increase() and rate() misbehaving

  • averages producing nonsensical spikes

  • dashboards effectively lying

Most histogram use cases (latency, durations) naturally satisfy this.

Metrics based on time differences do not.


🛠️ Act VI — The One-Line Fix

After understanding the issue, the solution was beautifully simple:

// wrong
// val lagSec = (System.currentTimeMillis() - message.timestampMs) / 1000.0

// right!
val lagSec = max(0.0, (System.currentTimeMillis() - message.timestampMs) / 1000.0)

One small clamp.

No more negative observations. No more histogram resets. No more phantom spikes.

The average lag chart immediately became stable and accurate.


✨ Act VII — Lessons Learned

This was a great reminder that even mature, well-understood systems have edge cases that surface only under subtle circumstances.

If you use Prometheus histograms for:

  • lag

  • TTL

  • “age” metrics

  • “seconds since event”

  • time difference measurements

remember:

Clock skew + negative values = Prometheus hallucinations.

This experience illustrated a few broader truths:

  • Time across machines is never perfectly synchronized

  • Observability tools can amplify tiny inconsistencies

  • Histogram semantics must be respected

  • Counter math is ruthless when assumptions break

  • The most interesting bugs come from things that “can’t happen”

If you ever see an average histogram-derived metric spike unexpectedly:

  1. Check for negative observations

  2. Check for clock drift

  3. Clamp the lag value with max(0, value) if necessary

  4. Don’t immediately blame Kafka, Prometheus, Kubernetes, or your deployment

Sometimes, the culprit is simply that time lied to you.

Happy monitoring — and may your clocks stay reasonably aligned. 🕒