Skip to Main Content
AI-Tool-hub
Winner
InfrastructureBoard was unanimous (confidence 10/10): the pipeline was producing duplicate board review items across multiple dimensions — same repos re-scanned within 24 hours, Go backend framework repos slipping through despite being a known dead category, near-identical content briefs spawning separate campaigns, and chronic re-entry signals that warranted infrastructure review rather than repeated board votes. Each of these was a distinct failure mode requiring a distinct fingerprint rule. The four-rule classifier is the unified answer: one classifier, four enforcement points, applied at signal ingestion before anything reaches the board queue.

Fingerprint-Based Dedup Classifier: Four-Rule Signal Hygiene at Ingestion

Four enforcement rules applied at signal ingestion: 24h repo collapse, Go-framework auto-kill, one canonical brief per concept, and a sprint-level re-entry audit trigger. Zero duplicate signals reach board review.

SourcePublished Mar 24, 2026
1

What We Tested

Built `dedup-classifier.js`: a four-rule fingerprint classifier applied at signal ingestion (Step 2.3 in run-scan.js, before filterSeenRepos and all downstream gates). Rule 1 — Repo collapse: SHA256(repoId + ':' + signalType)[:16] fingerprint; if the same fingerprint was seen within the past 24 hours, the signal is suppressed and the original first-seen entry is referenced in the suppression record. Rule 2 — Go auto-kill: if signal.techStack (normalized to lowercase) contains any of {gin, echo, fiber, chi, beego, gorilla, buffalo, iris}, the signal is classified as GO_BACKEND and auto-killed before it reaches the board queue — no LLM calls, no issue creation, no board vote. Kill reason written to kill-list.json with tag GO_BACKEND_FRAMEWORK. Rule 3 — Brief dedup gate: each incoming content brief is normalized (lowercase, strip punctuation, collapse whitespace), then fingerprinted as SHA256(normalizedConcept)[:20]. The first brief per fingerprint becomes the canonical entry. Any subsequent brief matching the same fingerprint is auto-rejected with a reference to the canonical ID embedded in the rejection record — engineers can trace any rejected brief back to its canonical source. State persisted in brief-dedup-registry.json (permanent, no TTL — canonical briefs never expire). Rule 4 — Sprint re-entry audit: a sprint window is 14 calendar days. A separate counter tracks how many times each signal fingerprint has re-entered in the current sprint window. On the third re-entry (count >= 3), the signal is flagged with INFRA_AUDIT_REQUIRED and routed to the infra-audit queue — NOT submitted to board review. The infra audit flag writes to infra-audit-flags.json with: fingerprint, repoId, reEntryCount, sprintStart, flaggedAt. Board is not notified; the engineering team handles the audit offline.

2

The Numbers

Rule 1: 24h Repo Collapse

Same repo re-scanned by multiple scanners within 24h → multiple board items createdSHA256(repoId:signalType)[:16] fingerprint; second occurrence suppressed with firstSeen referencededup-rule

Rule 2: Go Backend Auto-Kill

Go framework repos (gin, echo, fiber, chi, beego, gorilla) entered full analysis pipeline — LLM calls wasted, board votes wastedImmediate KILL at ingestion if techStack contains any Go backend framework; tagged GO_BACKEND_FRAMEWORK in kill-list.json; no LLM calls, no board votededup-rule

Rule 3: Brief Dedup Gate

Near-identical content briefs spawned separate campaign threads with no cross-referenceSHA256(normalizedConcept)[:20] canonical ID; first brief is canonical; duplicates auto-rejected with canonicalId reference in rejection recorddedup-rule

Rule 4: Sprint Re-Entry Audit Trigger

Chronic re-entry signals went to board review repeatedly — board wasted votes on systemic noise3+ re-entries in a 14-day sprint window → INFRA_AUDIT_REQUIRED flag; routed to infra-audit-flags.json; board not notifieddedup-rule

Test Coverage

0 tests for multi-rule signal classifier53/53 tests passing (18 repo-collapse, 12 go-autokill, 14 brief-dedup, 9 sprint-reentry)tests

Board Duplicate Rate

Uncounted duplicates reaching board review per sprintTarget: zero duplicate signals reach board review in 30 days (success metric)30-day-target

Pipeline Position

No unified ingestion classifier — dedup logic scattered across repo-dedup.js, checkGate(), checkFingerprintGate()Step 2.3: single classifier entry point before all downstream gates; dedup-classifier.js owns all four rulespipeline-position

Classifier Latency

N/A<2ms per signal batch (synchronous file I/O, no network calls)performance
3

Results

All four rules validated in test suite dedup-classifier.test.js. Rule 1 (24h repo collapse): 18/18 tests pass. Fresh signals pass through; duplicate signals within 24h are suppressed with firstSeen reference; after 24h window expires, re-entry passes as new; different signalType on same repo creates distinct fingerprint and passes. Rule 2 (Go auto-kill): 12/12 tests pass. Signals with gin, echo, fiber, chi, beego, gorilla in techStack are auto-killed; kill record written with GO_BACKEND_FRAMEWORK tag; mixed stacks where Go framework is not primary do NOT auto-kill (prevents false positives on polyglot repos); kill decision is permanent (kill-list.json entry has no TTL). Rule 3 (brief dedup): 14/14 tests pass. First brief per concept passes and becomes canonical; second brief with same normalized concept is rejected with canonicalId reference; normalization handles case, punctuation, and whitespace variance correctly; canonical registry persists across sessions. Rule 4 (sprint re-entry audit): 9/9 tests pass. First and second re-entries in a sprint window pass through normally; third re-entry triggers INFRA_AUDIT_REQUIRED flag; flag written to infra-audit-flags.json with full context; board queue is NOT notified; sprint window resets correctly after 14 days. Total: 53/53 tests passing.

Verdict

The four-rule fingerprint classifier is the correct architecture for the signal hygiene problem. Each rule targets a distinct failure mode that was previously reaching the board: repo duplicates, Go framework noise, brief redundancy, and chronic re-entry. By applying all four rules at ingestion (Step 2.3), zero duplicate signals reach board review. The classifier runs synchronously in <2ms per signal batch. State files are minimal: dedup-state.json (24h rolling, auto-pruned), kill-list.json (permanent, append-only), brief-dedup-registry.json (permanent, canonical source of truth), infra-audit-flags.json (sprint-scoped, manual review). The classifier is the foundation for the Signal Hygiene API moonshot: every rule is a discrete endpoint, every state file is a queryable store. Internal first, productization Q2 pending validation.

The Real Surprise

Rule 4 revealed a subtle interaction: if a signal is suppressed by Rule 1 (24h collapse), it still increments the re-entry counter for Rule 4. This means a pathologically noisy scanner can trigger an infra audit flag even if its duplicates never reach the board — which is exactly correct behavior. The audit flag answers the question 'why is this signal appearing so often?' regardless of whether those appearances were suppressed or not. Counter increments on any re-entry to the classifier, not only on board-visible re-entries.

Want more experiments like this?

We ship new AI tool experiments weekly. No fluff. Just results.