Star Fingerprint Dedup Gate: Hash on (repo_url + star_count + scan_window)
Two-layer deduplication gate fires at signal ingestion: SHA256 fingerprint on (repo_url + star_count + daily_window) suppresses same-moment duplicate scans; Jaccard similarity on hook text blocks near-duplicate content proposals before any LLM call.
What We Tested
Built two new deduplication gates that fire at signal ingestion (before all downstream processing): Gate 1 — star-fingerprint-gate.js computes SHA256(normalizedRepoUrl + ':' + starCount + ':' + scanWindow)[:16] where scanWindow = YYYY-MM-DD UTC. Same repo + same star_count today → suppressed (zero new information). Same repo + different star_count → passes (genuine spike, new data point). No URL → passes unconditionally. Canary alert fires when suppression rate exceeds 5x passed signals in the same session. State stored in star-fingerprint-state.json with 24h rolling window and auto-prune. Gate 2 — hook-dedup-gate.js performs Jaccard similarity on normalized word sets extracted from signal.hook → signal.title → signal.description[:200] (in priority order). Threshold: >= 0.65 Jaccard → near-duplicate, suppress. Stop word removal: 60 common English words stripped before comparison. MIN_WORDS: 4 — hooks with fewer than 4 meaningful words skip similarity check (too noisy for reliable scoring). State stored in hook-dedup-registry.json with 24h rolling window. Integration: both gates added to run-scan.js as Steps 2.1 and 2.2, firing immediately after raw scan collection and before dedup-classifier (Step 2.3). Both are non-critical — wrapped in try/catch, pipeline continues if either gate throws.
The Numbers
Fingerprint Key
Gate Position
Content Proposal Dedup
Runaway Scanner Detection
Test Coverage
Field Alias Handling
Results
34/34 unit tests pass (node test-star-fingerprint-gate.js — exit code 0). Star fingerprint gate (tests 1-14): fingerprints are 16-char hex strings; same url+stars+window produces identical fingerprint (deterministic); different star counts produce different fingerprints; URL case normalization (Owner/Repo == owner/repo); gate passes 1 of 2 identical signals and suppresses the duplicate; stats correctly report total/passed/suppressed; stargazers_count field treated identically to stars field; canary alert fires when suppression ratio exceeds threshold. Hook dedup gate (tests 15-34): Jaccard identical = 1.0; Jaccard no overlap = 0.0; Jaccard partial overlap computed correctly (~0.333); hook field has highest priority over title/description fallbacks; stop word removal correctly strips common words while preserving content words; near-duplicate suppression fires at Jaccard >= 0.65; distinct hooks both pass without suppression; short hooks (< MIN_WORDS) bypass similarity check unconditionally. Pipeline health summary now includes starFingerprintSuppressed and hookDedupSuppressed fields for observability.
Verdict
The (repo_url + star_count + scan_window) fingerprint is the correct key for this deduplication problem. It encodes 'same repo, same moment' precisely — the daily bucket means a repo that gains stars within the same day is treated as the same event (correct), while the same repo with a higher star count the next day passes through (correct, genuine new data). The language-independent key solves the specific failure mode that (repoId + language) missed: scanners that emit no language field were creating fingerprint collisions in the old gate. The hook dedup gate prevents a subtler waste: two signals from different scan paths for the same trend would pass the star fingerprint gate (different URLs, same topic) but generate near-identical content proposals. Jaccard on word sets is deterministic, has zero npm dependencies, and works reliably for the 50-200 char hook texts KIO generates. The 0.65 threshold catches semantic near-duplicates without false-positives on genuinely different topics. Both gates fire at ingestion — zero LLM calls are wasted on signals that will be suppressed.
The Real Surprise
The stargazers_count vs. stars field aliasing was a real edge case: GitHub API returns stargazers_count, but some scanner normalizations emit stars. The gate handles both fields transparently — starCount = signal.star_count ?? signal.stars ?? signal.stargazers_count ?? 0. Without this alias handling, repos scanned by different code paths (one using the raw GitHub API response, one using the normalized signal format) would generate different fingerprints for the same event. This is a silent failure mode — both signals pass, board sees two items for the same star spike, the dedup gate appears to work but misses field-aliased duplicates. The fix is in the fingerprint computation itself, not in upstream normalization.
Want more experiments like this?
We ship new AI tool experiments weekly. No fluff. Just results.