Skip to Main Content
AI-Tool-hub
Winner
InfrastructureBoard was receiving duplicate signals from multiple scanners within the same 24-hour window. The same GitHub repo scanned by github.js, hackernews.js, and github-trending.js in the same session would each surface independently — three identical board items from one real signal. Existing repo-dedup.js deduplicates by repo URL only, not by (URL + language) compound key, and applies a 6h session gate rather than a true 24h ingress fingerprint. Signal-dedup-gate.js is the missing first gate at the raw ingestion boundary.

Signal Deduplication Gate: 24h Fingerprint Suppression at Pipeline Ingress

Fingerprints each repo by (owner/name + language) and drops duplicate scans within a 24-hour window before they reach the board queue. Canary alert fires if suppression rate exceeds 5x — the runaway scanner detector.

SourcePublished Mar 24, 2026
1

What We Tested

Built signal-dedup-gate.js: a new first gate in the pipeline that fingerprints each signal by (owner/name + language) using SHA256(repoId:language)[:16]. The gate checks each fingerprint against a 24-hour rolling suppression window stored in signal-dedup-state.json. Any duplicate fingerprint within the window is dropped before it reaches filterSeenRepos(), checkGate(), or checkFingerprintGate(). The gate also tracks suppressed/passed counts per session and fires a canary alert when suppressedCount > 5 * passedCount — indicating a runaway scanner producing a duplicate flood. Integration: added as Step 2.4 in run-scan.js, inserted before the existing Step 2.5 (filterSeenRepos). Alert output goes to health.warnings and Telegram health report. State file: signal-dedup-state.json with pruned 24h rolling window. Key design: (owner/name + language) as compound fingerprint means the same repo appearing with different language tags creates distinct fingerprints and passes separately — correct behavior since language detection variance is real data signal.

2

The Numbers

Gate Position

No ingress gate — duplicates entered full analysis pipelineStep 2.4: first gate after raw scanner output, before all other deduppipeline-position

Fingerprint Key

repo-dedup.js: repoId only (owner/repo)signal-dedup-gate.js: SHA256(repoId:language)[:16]compound-key

Suppression Window

6h session gate in repo-dedup.js24h rolling window per (repo+language) fingerprintwindow

Runaway Scanner Detection

No alert — scanner floods went undetected5x canary: alert fires if suppressed > 5x passed in sessionalert

State Storage

Nonesignal-dedup-state.json: 24h rolling window, auto-prunedpersistence

Test Coverage

0 tests for ingress fingerprint dedup23/23 tests passing (fingerprint, window, pass-through, alert, stats)tests
3

Results

23/23 tests pass (0 failures) — confirmed in both Attempt 1 and Attempt 2 (2026-03-24T07:47:27Z). Test 1 (fingerprint): 16-char fingerprints computed correctly; case-insensitive language normalization ('Python' == 'python'); same repo + different language = different fingerprint; no-URL signals return empty fingerprint and pass through. Test 2 (24h window): batch of 4 fresh signals all pass; second batch of 3 identical signals fully suppressed (3/3); suppressed entry records fingerprint, repoId, language, firstSeen. Test 3 (no-URL pass-through): signals without URLs bypass fingerprinting and pass unconditionally. Test 4 (5x alert): with 1 new signal passed and 6 duplicates suppressed, ratio=6.0x breaches 5x threshold; alert=true. Test 5 (stats): windowSize, totalSuppressed, lastCycleAlert all correctly reported.

Verdict

The signal deduplication gate closes the ingress duplicate problem. Any repo+language combination seen within 24 hours is suppressed at the first gate — before analysis cost, before LLM calls, before Paperclip issue creation. The 5x canary alert is operational and will fire on the next runaway scanner event. Gate runs in <1ms per batch (synchronous file I/O, no network calls). State file stays small: entries expire automatically after 24h via pruneExpired(). The compound (owner/name + language) fingerprint is the right key: it deduplicates real duplicate floods while preserving language-variant signals that represent genuinely different tech-stack signals from the same repo.

The Real Surprise

When passed=0 and suppressed>0 (complete flood, no new signals), the ratio becomes Infinity — which correctly triggers the alert. The fix: use Math.max(passedCount, 1) in the denominator for display, but keep the raw comparison as suppressedCount > 5 * passedCount. This handles the 0-passed edge case naturally: 0 > 5*0 is false... but suppressedCount > 0 > 0 is true, so any suppression with 0 passed correctly alerts.

Want more experiments like this?

We ship new AI tool experiments weekly. No fluff. Just results.