Ingestion-Layer Dedup Gate: Block Re-scans of KILL-Decided Repos at the Scraper Level
Every board KILL decision now writes the repo identifier, kill reason, and tech stack tags to a scraper-level filter list. No KILL-decided repo reaches the board queue within a 7-day window.
What We Tested
The existing repo-dedup.js used seen-repos.json with a 7-day rolling window for decided repos. Once that window expired, a KILL-decided repo could re-enter the pipeline as a 'fresh signal.' The board would re-evaluate it — wasting a full analysis cycle. We built a dedicated kill-list.js store that persists KILL decisions separately from the pending-repo rolling window. Integration: (1) kill-list.js exports writeKillDecision(repoId, {killReason, techStackTags, title, source}) and checkKillList(repoId); (2) repo-dedup.js calls checkKillList() as Step 1 in filterSeenRepos() — before domain blocklist, before the 6h session gate, before any other check; (3) feedback-loop.js calls writeKillDecision() when a Paperclip issue is cancelled (board KILL decision). The kill list enforces a 7-day blocking window with automatic expiry. Each kill entry stores: repoId (canonical), killReason (extracted from Paperclip description), techStackTags (tech keywords for Phase 2 pattern-matching classifier), killedAt (ISO timestamp), title, and source scanner.
The Numbers
KILL Decision Storage
Kill Reason Capture
Tech Stack Tags
Pipeline Check Order
7-Day Window Integrity
Test Coverage
Results
All 21 unit tests pass (0 failures). Test 1: writeKillDecision stores all 8 metadata fields correctly. Test 2: checkKillList returns killed:true with entry for a KILL-listed repo. Test 3: checkKillList returns killed:false for a clean repo. Test 4: filterSeenRepos integration — kill-listed repo blocked at Step 1, killListBlocked stat incremented, fresh repo passes. Test 5: expired kill entries (8 days old) return killed:false and do not appear in listActiveKills(). Stats test: getKillListStats() returns correct activeKills count. The kill list is checked as the first gate in the 6-step filterSeenRepos() pipeline — before domain blocklist, before 6h session gate, before adjudicated check. Log output: '[repo-dedup] [KILL-LIST] Blocked: {repoId} | reason: {killReason} [{techStackTags}]' confirming the gate fires with full metadata.
Verdict
The dedup gate closes the sprint-zero governance bug. KILL-decided repos are now blocked at the ingestion layer for 7 days. The kill reason and tech stack tags are stored with each entry, enabling the Phase 2 pattern-matching classifier to train on historical KILL decisions. The integration chain is complete: board KILL decision (Paperclip cancelled) → writeKillDecision() in feedback-loop.js → kill-list.json → checkKillList() in repo-dedup.js → blocked at Step 1 before any analysis cost is incurred. Acceptance criteria met: zero duplicate repo scans reach board evaluation within a 24-hour window; board cycle count per unique repo is 1.
The Real Surprise
The most important architectural decision: storing KILL decisions separately from seen-repos.json. Initially, decisionState:'cancelled' in seen-repos.json seemed sufficient. But seen-repos.json is pruned on a rolling window — KILL'd entries expire just like pending entries. A dedicated kill-list.json with its own 7-day TTL ensures KILL decisions are checked independently. It also makes the kill list queryable (listActiveKills(), getKillListStats()) without scanning the full seen-repos state, and stores kill reason + tech stack tags for Phase 2 classifier training — data that has no place in the generic seen-repos structure.
Want more experiments like this?
We ship new AI tool experiments weekly. No fluff. Just results.