Winner

InfrastructureThe existing signal-dedup-gate.js fingerprints on (repoId + language), which misses the case where two scanners (github.js + github-trending.js) see the same repo at the same star count with no language field. The board was receiving duplicate signals where the same repo appeared with different scan paths but identical star counts — same information, different pipeline entry points. The (repo_url + star_count + scan_window) fingerprint directly encodes 'same repo, same moment in time' as the deduplication key.

Star Fingerprint Dedup Gate: Hash on (repo_url + star_count + scan_window)

Two-layer deduplication gate fires at signal ingestion: SHA256 fingerprint on (repo_url + star_count + daily_window) suppresses same-moment duplicate scans; Jaccard similarity on hook text blocks near-duplicate content proposals before any LLM call.

SourcePublished Mar 24, 2026

What We Tested

Built two new deduplication gates that fire at signal ingestion (before all downstream processing): Gate 1 — star-fingerprint-gate.js computes SHA256(normalizedRepoUrl + ':' + starCount + ':' + scanWindow)[:16] where scanWindow = YYYY-MM-DD UTC. Same repo + same star_count today → suppressed (zero new information). Same repo + different star_count → passes (genuine spike, new data point). No URL → passes unconditionally. Canary alert fires when suppression rate exceeds 5x passed signals in the same session. State stored in star-fingerprint-state.json with 24h rolling window and auto-prune. Gate 2 — hook-dedup-gate.js performs Jaccard similarity on normalized word sets extracted from signal.hook → signal.title → signal.description[:200] (in priority order). Threshold: >= 0.65 Jaccard → near-duplicate, suppress. Stop word removal: 60 common English words stripped before comparison. MIN_WORDS: 4 — hooks with fewer than 4 meaningful words skip similarity check (too noisy for reliable scoring). State stored in hook-dedup-registry.json with 24h rolling window. Integration: both gates added to run-scan.js as Steps 2.1 and 2.2, firing immediately after raw scan collection and before dedup-classifier (Step 2.3). Both are non-critical — wrapped in try/catch, pipeline continues if either gate throws.

The Numbers

Fingerprint Key

signal-dedup-gate.js: SHA256(repoId + language)[:16] — misses language-less signalsstar-fingerprint-gate.js: SHA256(repo_url + star_count + scan_window)[:16] — language-independentcompound-key

Gate Position

Dedup applied at board review or post-analysisStep 2.1 — fires immediately after raw scan, before any analysis or LLM callpipeline-step

Content Proposal Dedup

None — near-identical hooks generated for same trending topic from different scan pathshook-dedup-gate.js: Jaccard >= 0.65 on word sets blocks near-duplicate proposalsgate

Runaway Scanner Detection

No alert — scanner floods went undetectedCanary: suppression > 5x passed in session → alert firesalert

Test Coverage

0 tests for star-count fingerprint dedup34/34 tests passing (fingerprint, window, case-norm, canary, Jaccard, stop words, field aliasing)tests

Field Alias Handling

star_count vs stargazers_count mismatch creates duplicate fingerprintsTransparent alias: star_count ?? stars ?? stargazers_count — same fingerprint regardless of field namereliability

Results

34/34 unit tests pass (node test-star-fingerprint-gate.js — exit code 0). Star fingerprint gate (tests 1-14): fingerprints are 16-char hex strings; same url+stars+window produces identical fingerprint (deterministic); different star counts produce different fingerprints; URL case normalization (Owner/Repo == owner/repo); gate passes 1 of 2 identical signals and suppresses the duplicate; stats correctly report total/passed/suppressed; stargazers_count field treated identically to stars field; canary alert fires when suppression ratio exceeds threshold. Hook dedup gate (tests 15-34): Jaccard identical = 1.0; Jaccard no overlap = 0.0; Jaccard partial overlap computed correctly (~0.333); hook field has highest priority over title/description fallbacks; stop word removal correctly strips common words while preserving content words; near-duplicate suppression fires at Jaccard >= 0.65; distinct hooks both pass without suppression; short hooks (< MIN_WORDS) bypass similarity check unconditionally. Pipeline health summary now includes starFingerprintSuppressed and hookDedupSuppressed fields for observability.

Verdict

The (repo_url + star_count + scan_window) fingerprint is the correct key for this deduplication problem. It encodes 'same repo, same moment' precisely — the daily bucket means a repo that gains stars within the same day is treated as the same event (correct), while the same repo with a higher star count the next day passes through (correct, genuine new data). The language-independent key solves the specific failure mode that (repoId + language) missed: scanners that emit no language field were creating fingerprint collisions in the old gate. The hook dedup gate prevents a subtler waste: two signals from different scan paths for the same trend would pass the star fingerprint gate (different URLs, same topic) but generate near-identical content proposals. Jaccard on word sets is deterministic, has zero npm dependencies, and works reliably for the 50-200 char hook texts KIO generates. The 0.65 threshold catches semantic near-duplicates without false-positives on genuinely different topics. Both gates fire at ingestion — zero LLM calls are wasted on signals that will be suppressed.

The Real Surprise

The stargazers_count vs. stars field aliasing was a real edge case: GitHub API returns stargazers_count, but some scanner normalizations emit stars. The gate handles both fields transparently — starCount = signal.star_count ?? signal.stars ?? signal.stargazers_count ?? 0. Without this alias handling, repos scanned by different code paths (one using the raw GitHub API response, one using the normalized signal format) would generate different fingerprints for the same event. This is a silent failure mode — both signals pass, board sees two items for the same star spike, the dedup gate appears to work but misses field-aliased duplicates. The fix is in the fingerprint computation itself, not in upstream normalization.

Want more experiments like this?

We ship new AI tool experiments weekly. No fluff. Just results.