<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Alexander Kharitonov's blog]]></title><description><![CDATA[A software engineer with a spark in the eyes. I like to build the best things in the world.]]></description><link>https://blog.alexanderkharitonov.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 29 Apr 2026 14:58:17 GMT</lastBuildDate><atom:link href="https://blog.alexanderkharitonov.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Humans are no longer your primary customers]]></title><description><![CDATA[For decades, product companies have obsessed over GUIs, click-through rates, and capturing human attention. But a massive paradigm shift is happening right now:

Before: Humans interacted directly wit]]></description><link>https://blog.alexanderkharitonov.com/humans-are-no-longer-your-primary-customers</link><guid isPermaLink="true">https://blog.alexanderkharitonov.com/humans-are-no-longer-your-primary-customers</guid><dc:creator><![CDATA[Alexander Kharitonov]]></dc:creator><pubDate>Mon, 23 Feb 2026 23:57:13 GMT</pubDate><content:encoded><![CDATA[<p>For decades, product companies have obsessed over GUIs, click-through rates, and capturing human attention. But a massive paradigm shift is happening right now:</p>
<ul>
<li><p>Before: Humans interacted directly with software.</p>
</li>
<li><p>Now: Humans delegate goals to AI Agents, and the Agents interact with the software on our behalf.</p>
</li>
</ul>
<p>The reality check: If your company doesn’t expose clean, accessible interfaces for AI Agents to use, your product will become completely invisible to the end-user. The companies that win the next decade won't just have the most intuitive UI—they will have the most accessible APIs for AI delegation. If an AI can't use your product, you will lose the market.</p>
<p>Will your product survive the AI-first era?</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/6917f52114dc406b0c6d8d32/0d9eb6e3-46a0-4aa7-ba03-3f775ebb3d74.png" alt="" style="display:block;margin:0 auto" />]]></content:encoded></item><item><title><![CDATA[The New Definition of Being Smart in the Age of AI]]></title><description><![CDATA[The bar keeps moving.

Yesterday’s breakthrough was building smarter machines.

Today’s breakthrough is doing the work those machines can’t do.


This is where human creativity, reasoning, and insight matter more than ever.]]></description><link>https://blog.alexanderkharitonov.com/the-new-definition-of-being-smart-in-the-age-of-ai</link><guid isPermaLink="true">https://blog.alexanderkharitonov.com/the-new-definition-of-being-smart-in-the-age-of-ai</guid><category><![CDATA[AI]]></category><category><![CDATA[Intelligence]]></category><category><![CDATA[human]]></category><dc:creator><![CDATA[Alexander Kharitonov]]></dc:creator><pubDate>Wed, 19 Nov 2025 14:00:19 GMT</pubDate><content:encoded><![CDATA[<p>The bar keeps moving.</p>
<ul>
<li><p>Yesterday’s breakthrough was building smarter machines.</p>
</li>
<li><p>Today’s breakthrough is doing the work those machines can’t do.</p>
</li>
</ul>
<p>This is where human creativity, reasoning, and insight matter more than ever.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763560741665/1eb9bd90-3e42-4149-89a0-26b234e12c48.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[The Hidden Prometheus Histogram Trap That Created “Phantom” Kafka Lag Spikes]]></title><description><![CDATA[This issue surfaced while I was refactoring part of the monitoring stack for Talos Trading, where I work on Coin Metrics products. It was one of those situations where a routine improvement led to a behavior so strange that it demanded investigation....]]></description><link>https://blog.alexanderkharitonov.com/the-hidden-prometheus-histogram-trap-that-created-phantom-kafka-lag-spikes</link><guid isPermaLink="true">https://blog.alexanderkharitonov.com/the-hidden-prometheus-histogram-trap-that-created-phantom-kafka-lag-spikes</guid><category><![CDATA[#prometheus]]></category><category><![CDATA[distributed systems]]></category><category><![CDATA[clock-synchronization]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[kafka]]></category><dc:creator><![CDATA[Alexander Kharitonov]]></dc:creator><pubDate>Sun, 16 Nov 2025 13:57:43 GMT</pubDate><content:encoded><![CDATA[<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763254313362/27ed45a6-fcbd-46f4-9737-1cdefbcc876c.png" alt /></p>
<p>This issue surfaced while I was refactoring part of the monitoring stack for <a target="_blank" href="https://talos.com">Talos Trading</a>, where I work on <a target="_blank" href="https://coinmetrics.io">Coin Metrics</a> products. It was one of those situations where a routine improvement led to a behavior so strange that it demanded investigation.</p>
<p>One afternoon, our recently updated Kafka consumer lag dashboards started showing <strong>significant spikes</strong> affecting multiple Kafka topics on the same broker. At first glance, the chart looked like we were falling dangerously behind in consuming messages.</p>
<p>But something was off.</p>
<h3 id="heading-the-9999th-percentile-completely-clean">The 99.99th percentile — completely clean</h3>
<pre><code class="lang-kotlin">histogram_quantile(
  <span class="hljs-number">0.9999</span>,
  sum <span class="hljs-keyword">by</span> (le, topic) (
    rate(api_trades_consumer_lag_seconds_bucket[<span class="hljs-number">1</span>m])
  )
)
</code></pre>
<p>While the “average lag” line shot upward, our <strong>high percentiles</strong> — including the 99.99th — remained absolutely flat and <strong>healthy</strong> 🤯.<br />If real lag had occurred, the percentile curves (at least one bucket) would have been the first to show it.</p>
<p>This contradiction alone made the situation feel <strong>impossible</strong>.</p>
<p>Yet the <strong>average lag</strong> chart told a different story.</p>
<hr />
<h2 id="heading-act-i-the-mystery-of-the-impossible-lag">🚨 Act I — The Mystery of the Impossible Lag</h2>
<p>The suspicious part wasn’t just the spikes — it was how <em>natural</em> they appeared.<br />They didn’t look like random noise or periodic artifacts. They looked like real phenomena: rising smoothly, peaking, then falling.</p>
<p>Except <strong>they weren’t real</strong>, as our external monitoring showed.</p>
<ul>
<li><p>There was no backlog.</p>
</li>
<li><p>No consumer slowdown.</p>
</li>
<li><p>No offset stalls.</p>
</li>
<li><p>No correlated CPU or network dips.</p>
</li>
<li><p>No smoking gun in logs.</p>
</li>
</ul>
<p>But our average lag formula behind the chart insisted something dramatic was happening:</p>
<pre><code class="lang-kotlin">rate(api_trades_consumer_lag_seconds_sum[<span class="hljs-number">1</span>m])
/
rate(api_trades_consumer_lag_seconds_count[<span class="hljs-number">1</span>m])
</code></pre>
<p>This expression assumes that both <code>*_sum</code> and <code>*_count</code> behave like <strong>monotonically increasing counters</strong>.</p>
<p>Soon we would learn that they… <strong>didn’t</strong>.</p>
<hr />
<h2 id="heading-act-ii-the-histograms-that-cried-wolf">🔎 Act II — The Histograms That Cried Wolf</h2>
<p>The lags were recorded as a Prometheus <strong>histogram</strong>:</p>
<pre><code class="lang-kotlin"><span class="hljs-keyword">val</span> tradesConsumerLag: Histogram =
  Histogram
    .builder()
    .name(<span class="hljs-string">"api_trades_consumer_lag_seconds"</span>)
    .help(<span class="hljs-string">"Distribution of trades Kafka consumer lag."</span>)
    .classicUpperBounds(<span class="hljs-number">0.1</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">1.0</span>, <span class="hljs-number">2.0</span>, <span class="hljs-number">5.0</span>, <span class="hljs-number">10.0</span>, <span class="hljs-number">30.0</span>, <span class="hljs-number">60.0</span>, <span class="hljs-number">120.0</span>)
    .labelNames(<span class="hljs-string">"topic"</span>)
    .register(registry)
</code></pre>
<blockquote>
<p>Note: The metrics and expressions in this post were simplified.</p>
</blockquote>
<p>Histograms always expose:</p>
<ul>
<li><p><code>&lt;metric&gt;_count</code></p>
</li>
<li><p><code>&lt;metric&gt;_sum</code></p>
</li>
<li><p><code>&lt;metric&gt;_bucket</code> - bucket breakdowns</p>
</li>
</ul>
<p>Prometheus describes <code>*_sum</code> and <code>*_count</code> as:</p>
<blockquote>
<p><strong>Counters</strong> …<em>as long as all observations are non-negative.</em></p>
</blockquote>
<p>That condition turned out to be the heart of the problem.</p>
<hr />
<h2 id="heading-act-iii-when-clocks-drift-lag-becomes-negative">⏱️ Act III — When Clocks Drift, Lag Becomes Negative</h2>
<p>Our lag was computed like this:</p>
<pre><code class="lang-kotlin"><span class="hljs-keyword">val</span> lagSec = (System.currentTimeMillis() - message.creationTimestampMs) / <span class="hljs-number">1000.0</span>
<span class="hljs-comment">// update the histogram</span>
monitoring.tradesConsumerLag.labels(topic).observe(lagSec)
</code></pre>
<p>But time in distributed systems is never perfectly aligned:</p>
<ul>
<li><p>producers and consumers live on different machines</p>
</li>
<li><p>clocks drift</p>
</li>
<li><p>NTP adjustments jump time forward or backward</p>
</li>
<li><p>containers inherit imperfect host time</p>
</li>
<li><p>message timestamps come from different system layers</p>
</li>
</ul>
<p>So sometimes messages come from the <strong>future,</strong> even within the same cluster:</p>
<pre><code class="lang-kotlin">consumer_time &lt; message_creation_timestamp
</code></pre>
<p>Which produces:</p>
<blockquote>
<p><strong>negative lag</strong></p>
</blockquote>
<p>Harmless from a math perspective. Disastrous for histogram semantics.</p>
<hr />
<h2 id="heading-act-iv-the-day-sum-went-down">💥 Act IV — The Day <code>*_sum</code> Went Down</h2>
<p>Negative lag values subtract from the histogram’s <code>*_sum</code>.</p>
<p>A counter is supposed to only increase — never decrease. So when Prometheus sees:</p>
<pre><code class="lang-kotlin">*_sum(t) &lt; *_sum(t - <span class="hljs-number">1</span>)
</code></pre>
<p>it concludes:</p>
<blockquote>
<p>“Counter reset detected — must have been a restart!”</p>
</blockquote>
<p>Prometheus then applies counter-reset compensation logic inside <code>increase()</code> and <code>rate()</code>.</p>
<p>This produces <strong>artificial, inflated spikes</strong> — the “phantom lag” we were trying to understand.</p>
<p>You can see such unexpected spikes on the “increase” chart:</p>
<pre><code class="lang-kotlin">increase(api_trades_consumer_lag_seconds_sum[<span class="hljs-number">1</span>m])
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763260058859/7af1036d-ee90-4391-9473-9aee42929c18.png" alt class="image--center mx-auto" /></p>
<p>But once we graphed the <strong>raw</strong> <code>*_sum</code> (see below), the truth became obvious: it occasionally <strong>dipped downward</strong>. Each dip corresponded to a <strong>negative lag sample</strong>.</p>
<pre><code class="lang-kotlin">api_trades_consumer_lag_seconds_sum
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763258411145/1a552d85-5860-4c18-9110-3b79f9572d55.png" alt class="image--center mx-auto" /></p>
<p>Such dips also corresponded to the positive spikes in the “increase” chart.</p>
<p>Prometheus wasn’t wrong. It was faithfully applying counter rules to a metric that <strong>violated monotonically increasing counter assumptions</strong>.</p>
<hr />
<h2 id="heading-act-v-the-hidden-lesson-about-histogram-semantics">🧠 Act V — The Hidden Lesson About Histogram Semantics</h2>
<p>Prometheus histograms are powerful. They let you:</p>
<ul>
<li><p>understand latency distributions</p>
</li>
<li><p>compute percentiles</p>
</li>
<li><p>derive averages</p>
</li>
<li><p>track performance patterns</p>
</li>
</ul>
<p>But they require something some engineers don’t explicitly think about:</p>
<h3 id="heading-all-observations-must-be-non-negative">✔️ <strong>All observations must be non-negative.</strong></h3>
<p>Violating that rule leads to:</p>
<ul>
<li><p><code>*_sum</code> becoming non-monotonic</p>
</li>
<li><p><code>*_count</code> becoming inconsistent</p>
</li>
<li><p><code>increase()</code> and <code>rate()</code> misbehaving</p>
</li>
<li><p><strong>averages producing nonsensical spikes</strong></p>
</li>
<li><p>dashboards effectively lying</p>
</li>
</ul>
<p>Most histogram use cases (latency, durations) naturally satisfy this.</p>
<p>Metrics based on <strong>time differences</strong> do not.</p>
<hr />
<h2 id="heading-act-vi-the-one-line-fix">🛠️ Act VI — The One-Line Fix</h2>
<p>After understanding the issue, the solution was beautifully simple:</p>
<pre><code class="lang-kotlin"><span class="hljs-comment">// wrong</span>
<span class="hljs-comment">// val lagSec = (System.currentTimeMillis() - message.timestampMs) / 1000.0</span>

<span class="hljs-comment">// right!</span>
<span class="hljs-keyword">val</span> lagSec = max(<span class="hljs-number">0.0</span>, (System.currentTimeMillis() - message.timestampMs) / <span class="hljs-number">1000.0</span>)
</code></pre>
<p>One small clamp.</p>
<p>No more negative observations. No more histogram resets. No more phantom spikes.</p>
<p>The average lag chart immediately became stable and accurate.</p>
<hr />
<h2 id="heading-act-vii-lessons-learned">✨ Act VII — Lessons Learned</h2>
<p>This was a great reminder that even mature, well-understood systems have edge cases that surface only under subtle circumstances.</p>
<p>If you use Prometheus histograms for:</p>
<ul>
<li><p>lag</p>
</li>
<li><p>TTL</p>
</li>
<li><p>“age” metrics</p>
</li>
<li><p>“seconds since event”</p>
</li>
<li><p>time difference measurements</p>
</li>
</ul>
<p>remember:</p>
<h3 id="heading-clock-skew-negative-values-prometheus-hallucinations"><strong>Clock skew + negative values = Prometheus hallucinations.</strong></h3>
<p>This experience illustrated a few broader truths:</p>
<ul>
<li><p>Time across machines is <strong>never perfectly synchronized</strong></p>
</li>
<li><p>Observability tools can <strong>amplify tiny inconsistencies</strong></p>
</li>
<li><p>Histogram semantics must be <strong>respected</strong></p>
</li>
<li><p>Counter math is ruthless when assumptions break</p>
</li>
<li><p>The most interesting bugs come from things that <strong>“can’t happen”</strong></p>
</li>
</ul>
<p>If you ever see an <strong>average histogram-derived metric spike unexpectedly</strong>:</p>
<ol>
<li><p>Check for <strong>negative observations</strong></p>
</li>
<li><p>Check for <strong>clock drift</strong></p>
</li>
<li><p>Clamp the lag value with <code>max(0, value)</code> if necessary</p>
</li>
<li><p>Don’t immediately blame Kafka, Prometheus, Kubernetes, or your deployment</p>
</li>
</ol>
<p>Sometimes, the culprit is simply that time lied to you.</p>
<p>Happy monitoring — and may your clocks stay reasonably aligned. 🕒</p>
]]></content:encoded></item></channel></rss>