Kotonia Articles

Don't Trust an Agent's Self-Discipline — How Three Idea Mining Duplicates Slipped Past Prompt-Level Rules, and Tool-Layer Blocking Fixed It Structurally

I had to learn the hard way: even rules explicitly written into a prompt get broken by the agent. Three duplicate ideas slipped past my idea mining pipeline. The root cause, and the structural fix — baking TF-IDF dedup into the tool layer so the rule cannot be bypassed. The enforcement principle for agent design.

By Shinji Shimizu2026-06-104 min read

#ai#llm#agent#prompt#indiedev

I ended up having to demonstrate "even rules explicitly written into a prompt get broken by the agent" inside my own service's article mining pipeline. Three duplicates slipped through, and that's how I noticed.

Conclusion up front: don't enforce rules through the agent's self-discipline — block them at the tool layer. Unless enforcement is baked into the structure, it isn't trustworthy.

1. The Mining Prompt Said It Clearly

Kotonia's tech blog has 58 articles across ja/en/zh. When mining for new ideas, I don't want to duplicate existing ones.

prompts/mine.md had the REJECT rule written explicitly:

5. CONCEPT-LEVEL DEDUPE — mandatory before every push:
   - Hit with importance_score >= 7: REJECT the candidate.
   - Hit with importance_score 4-6: only push if genuinely new angle.
   - No hits: green light.

In other words, check each candidate against art-concepts-find and drop anything that collides with a flagship article (importance ≥ 7). The agent was reading this instruction.

In the actual mining run, the agent called art-concepts-find 15 times. At first glance, "it's deduping properly."

2. Three That Plainly Slipped Past

When the run ended and I looked at the pending pool, three ideas violated the REJECT rule.

1. The "streaming granularity" idea

title: A Hidden Voice-Chat Latency Driver — Streaming Granularity (tokens per chunk)
sources: memory/streaming_granularity_voice_metric.md
score: 8

This one re-broadcasts §3.3 "Streaming granularity — the structural difference that decides voice experience" from the existing flagship article voice-first-local-llm (importance=9). Same numbers (Local Gemma 1.0 tok/chunk, Haiku 10-16, Gemini 8-24), same thesis. The agent called art-concepts-find "粒度" and must have seen the importance-9 hit, yet pushed anyway.

2. Duplicate within the pool

title (old): Rising video interest and NSFW real demand — a signal for the product roadmap
title (new): NSFW demand signal in video gen — pulling product roadmap from logs
source: memory/video_nsfw_demand_signal.md  ← same source on both

Leftover from mining v1 and mining v2 pushed the same idea from the same memory. There was no check against "ideas already in the pool" either.

3. The "OpenWeight model capability unlock" idea

title: Unlocking the Hidden Capability of OpenWeight Models — NSFW unlock via caption vocab repair
sources: memory/caption_vocab_repair_methodology.md
score: 9

The memory caption_vocab_repair_methodology.md is the actual source material from which the existing article hidream-caption-vocab-repair (importance=8) was written. Same memory, same topic, same article. It's a perfect duplicate, but the agent pushed it as a "new idea."

3. Why It Happened — the Missing Enforcement Layer

In all three cases, the logs confirm the agent did call art-concepts-find. The tools worked correctly and returned results. The agent then decided to push, looking at the result.

So the problem is one of:

the agent misread the result (saw a hit but registered "none")
the agent decided "this is a different angle, so it's not a duplicate"
the agent forgot the rule itself mid-generation

Whichever the case, the structure was prompt-level rules enforced by the agent's self-discipline. The enforcer was the agent itself. That broke.

4. The Structural Fix — Bake Blocking Into the Tool Layer

I embedded the dedup gate directly inside art-ideas-add (the tool that pushes an idea to the pool).

# Runs unconditionally before append, inside art-ideas-add
verdict = evaluate_idea(title, angle, sources, ...)
if not verdict["allow"] and not args.force:
    sys.stderr.write(json.dumps(verdict["conflicts"]))
    sys.exit(1)  # ← the agent is stopped here

The guts of evaluate_idea() are TF-IDF over the canonical concept vocabulary:

Vocabulary = every concept across articles_index.jsonl
IDF weighting — rare concepts count more
ASCII terms require a word boundary (eliminates false positives like "check" matching "checkout")
Generic JP nouns (モデル / システム / アーキテクチャ / ...) are filtered via a noise list
cosine similarity ≥ 0.25 vs importance ≥ 7 article → REJECT
cosine similarity ≥ 0.35 vs another existing idea → REJECT (also catches in-pool dups)

Now, however much the agent might want to violate the rule, the tool layer physically rejects it. Nothing gets through without an explicit --force (the override reason gets recorded in row.force).

Regression: all three known-bad cases (above) are correctly rejected, and clean ideas (CodeFormer face restore / Stripe Product/Price) pass. 4/4 matches intent.

Technical details on the full Dreaming layer that makes this work: Zenn: Implementation notes on porting Claude Code's memory model as a Dreaming Layer to 58 articles

5. The Generalizable Lesson

When you build agent-based automation, instructing the agent on a rule does not give you enforcement. It's a structural fact, and you'll hit the same trap in these settings:

"Let the agent decide on data validation" → enforce with tool-side schema
"Trust the agent to avoid dangerous operations" → enforce with tool-side permission checks
"Let the agent dedupe" → enforce with a tool-side blocking gate
"Make it follow the format" → enforce with tool-side validation

LLM capability grows, but self-discipline has variance (temperature, context length, competing instructions in the body, etc.). Put enforcement in something structural, and leave only judgment and creativity to the agent.

This is becoming bread-and-butter for agent design.

Aside: How I Noticed

It wasn't Kotonia's AI wife character asking me "isn't this overlapping with the article you wrote?" I just glanced at GitHub's file list and thought "wait, hidream-caption-vocab-repair is already an article."

Agent-driven pipelines are great because they self-run, but they silently fail mid-run. The psychological pressure of having to chase it after the fact doesn't go away. Tool-layer blocking has a side benefit of dropping that psychological cost.

The broader "how a solo developer's accumulated assets start compounding" story is in a separate piece: The Day a Solo Developer's Accumulated Assets Finally Started to Compound