Kotonia Articles

The night I tried to build a specialized LoRA, ran headfirst into the collapse of legal NSFW datasets, and ended up inventing 3-stage captioning

Starting from the under-fit signal 'v1 nipple LoRA at @1.5 > @1.0', I structurally analyzed how legal NSFW dataset paths are closed off by CSAM contamination and DRM, then trained V2 LoRA using CivitAI RED + 3-stage captioning + informational value frame, confirming dramatic prompt adherence improvement and generalization preservation via A/B. A single-night research log.

By Shinji Shimizu2026-06-1611 min read

#LoRA#image-generation#HiDream#indie#AI-training

Also inJapanese Chinese

Figure 1: /studio after V2 deployment. Multiple LoRAs are selectively stacked to produce explicit content. The final artifact of this writeup.

Figure 1: The endpoint of this writeup. /studio's LoRA dropdown now includes the newly-trained nipple_v2, here stacked with the pre-existing kotonia03 (anime + aesthetic). The star-rating UI for dataset labeling is also visible bottom-right, running concurrently.

"The LoRA produces higher quality at strength @1.5 than at @1.0"

This sounds like a casual observation — "of course, turning up the dial makes it stronger" — but it's actually a quiet, ugly signal that the LoRA is under-fit.

I run kotonia.ai with an HiDream-O1-Image (open-weight image model) resident in fp8, with multiple self-trained LoRAs stacked on top to express specialized differences. One of them, nipple_v1, was meant to lift the rendering quality of anatomy detail. When users reported that strength 1.5 looked better than 1.0, I knew immediately: that's under-fit talking.

What followed was one session — scrape, redesign the caption pipeline, scale up 17×, train, A/B, deploy — and I want to write the whole thing down. Along the way I discovered that the entire NSFW dataset world is structurally locked down by CSAM contamination and DRM, so we'll start there and then climb to the methodology I ended up inventing: 3-stage captioning.

1. What "@1.5 > @1.0" was actually telling me

A LoRA adds a low-rank delta ΔW to the base model's weights, and at inference time you can apply a scalar α to that delta: W' = W + α · ΔW. Training is calibrated for α=1.0 as the "name plate" magnitude, with α=0.5 or 1.5 available as user-facing dials.

Normally @1.0 is the natural sweet spot, and @1.5 either over-pushes the effect, breaks anatomy, or feels unnatural. With kotonia.ai's nipple_v1, the opposite happened — @1.5 looked better than @1.0 in anatomy quality.

What this means: during training, the LoRA's magnitude never fully extended. The model converged early before reaching the optimal delta length. Concretely:

Rank was too low — representation space couldn't hold the data variability
LR was too conservative — each step moved too small a distance
Or the dataset was too small to even induce the variability in the first place

v1 was trained on 525 images, rank 32, 1500 steps. A smoke-test config that I never scaled to production. The user-side symptom — "use it at 1.5" — was the bill coming due.

Fixing it required raising rank, expanding data, running more steps, and fundamentally improving caption quality. The story starts with hunting for that data.

2. Stepping on the CSAM mine that closed the public NSFW dataset world

"I want a photoreal anatomy dataset to train a specialized LoRA" is a pretty pedestrian ML request, and historically you'd reach for NudeNet, LAION-NSFW subset, or academic releases like Figshare 14495367 (Image and video dataset for adult content detection).

In 2025 the air changed.

The Canadian Centre for Child Protection (C3P) audited the NudeNet classifier dataset (~700,000 images) and found 680 CSAM images, including 120 identified victims (cross-matched against law enforcement databases). Academic Torrents took the dataset down on notice. The archive.org URL is now a 404. 250+ academic papers citing NudeNet woke up to find themselves in a difficult position overnight.

This wasn't a one-off — it was structural. Stanford Internet Observatory had already found 3,200+ CSAM images in LAION-5B in 2023. The common pattern: scraped from porn sites. Every public NSFW dataset built that way inherits the same structural contamination risk.

Figshare 14495367 documents that it was assembled by "browsing porn sites." It hasn't been publicly audited, but with the same sourcing methodology you'd statistically expect a similar contamination rate (~0.1%). Across 30,800 images, that means roughly 30 CSAM images statistically expected.

The legal calculus is severe:

Mere possession of CSAM is a criminal offense in most jurisdictions, including Japan. Downloading for AI training counts as possession
"I'll filter it before use" doesn't help — the act of downloading already creates the violation, and you have no safe way to identify CSAM yourself
Traceability also exists. Membership inference attacks on LoRA weights can retroactively identify which dataset was used in training

At this point the public NSFW dataset path is structurally closed. Time to look elsewhere.

3. "But I bought it from FANZA" was only half-right

Hypothesis two: legally purchase a commercial Japanese adult video and extract frames. Under Japanese copyright law (Article 30 — private use), copying for personal use should be permitted, and individual frame extraction from a downloaded video falls within that scope. Right?

I bought a single S1-label title on FANZA. A .dcv file came down. Running file on it shows: ISO Media MP4 v1. Looks like an ordinary MP4.

But ffprobe -read_intervals "%+#1" -show_packets reveals:

[PACKET]
codec_type=video
flags=K__
[SIDE_DATA]
side_data_type=Encryption info    ← here
[/SIDE_DATA]

CENC (Common Encryption, MPEG-DASH-style DRM) is encrypting the video samples. The container (MP4 ftyp/moov) is readable, but every H.264 NAL unit is encrypted. The DMM Player retrieves a key from the DRM server at playback time and decrypts in memory (Widevine/PlayReady equivalent).

This collides with Article 30(1)(ii) and Article 30(2) of the Japanese Copyright Act (2012 amendment, strengthened 2018). Copying that requires circumventing technological protection measures is illegal even for private use — criminal penalty (up to 2 years / ¥2,000,000 fine). "I bought it" and "I'm only using it internally for AI training" do not cure it. The act of circumvention is itself the violation.

Physical media (Blu-ray, DVD) carries AACS/CSS DRM — same structure, same problem.

Which means the major commercial AV download channels are structurally closed off as AI training data sources. The 9.6 GB I bought is a sunk cost for personal viewing only. Frame extraction is off the table.

4. Arriving at CivitAI RED — the AI-generated path is what's left

Public NSFW datasets dead from CSAM. Commercial AV dead from DRM. Self-funded shoots aren't realistic at indie budget scale. What remained was CivitAI's library of AI-generated images.

CivitAI exposes community-generated images via API, classified by browsingLevel (1=PG ... 16=XXX). The structural properties that make this safe:

No CSAM by construction. AI-generated content has no identified victim — the concept doesn't apply
Explicit age policy. CivitAI enforces TOS removal of minor depictions
Authenticated bulk API access. Token-based, paginated
Clean provenance. Each image carries the originating prompt, model name, and creator

A common concern: "Training on AI-generated images causes mode collapse." That concern is about self-distillation — same model's outputs fed back into the same model. CivitAI is a heterogeneous mix of PonyDiffusion / Flux / SDXL / various community LoRAs, all distinct from HiDream. Cross-architecture transfer, no self-distillation.

I wrote civitai_collect.py and ran it with --target 20000 --weights "8:0.25,16:0.75" --min_hearts 30. The CivitAI API returned 500/503 several times during pagination, so I resumed from cursor multiple times. Final yield: 5,000 at level 8 + 4,707 at level 16 = 9,707 images. Short of the 20k target, but at 18.5× v1's 525, easily into useful scale.

Data secured. Now the caption design.

5. Scaling 17× alone wasn't enough — the caption mode collapse problem

The naive intuition "more data fixes under-fit" is only half right.

v1's 525-image captions were generated with a Grok 2-stage pipeline: Stage 1 extracted attributes as JSON; Stage 2 forced verbatim use of all of them in caption synthesis. When I scaled this same pipeline to 9,707 images, here's what happened.

Every caption was structurally identical. Because:

Stage 1 inventoried a fixed set of 7 attribute fields
Stage 2 used a single template prompt
Temperature 0.25 was nearly deterministic

Per-image uniqueness existed at the attribute level, but sentence structure, word choice, conjunctions, paragraph layout were essentially the same across all images. Captions had structural identity — a kind of duplication that TF-IDF cannot detect because lexical overlap looks low but structural overlap is total.

This was caption mode collapse, and once I recognized it, the diagnosis was clean: a LoRA trained on this corpus would over-fit to the specific syntactic patterns, leaving it brittle to prompt variation at inference.

The fix had to be diversity at generation time. Post-hoc dedup (TF-IDF / cosine clustering) only catches lexical overlap — it can't see structural duplication. You have to change the generation-time probability distribution.

That's where 3-stage captioning was born.

6. Designing 3-stage Captioning — the informational value frame

Design goals:

Extract more structural uniqueness fodder per image
Lower the probability of structurally identical outputs at generation time
Raise the vocabulary density in the nipple region (color + shape + state + areola, woven naturally)
Process all 9,707 in parallel via local LLM (Gemma 4 26B A4B Uncensored on vLLM, port 8899)

Three stages:

Stage 1: Factual inventory (T=0.1)

Structured JSON extraction over 12 fields. Beyond the original 7, added nipple_detail (color/shape/state/areola), breast_detail (size/shape/skin_finish), pose_framing, expression_mood, color_palette.

Stage 2: Analytical reasoning (T=0.4) ← new

Takes the Stage 1 JSON as input and asks the model: "in 2-3 sentences, analyze why this image is informative as training data." Explicitly:

Do NOT make subjective quality judgments — no "beautiful", "attractive", "well-composed", "high quality", "stunning". Describe informational value only.

This is the informational value frame. It structurally prevents subjective quality vocabulary from contaminating the reasoning. The reason matters: if "beautiful" gets baked into captions, the trained model learns "beautiful is a precondition for visible anatomy" — a brittleness you don't want.

Per-image reasoning traces differ — different images have different distinctive features — and this asymmetry becomes the diversity driver for the final caption.

Stage 3: Caption synthesis with skeleton rotation (T=0.7)

Takes Stage 1 attributes + Stage 2 reasoning, generates the final caption. Eight sentence skeletons assigned round-robin per image, forcing different syntactic structures:

S0: "Close-up of [subject + pose]. [nipple/breast detail]. [lighting], [style]."
S1: "[Style] image: [pose_framing], [breast detail]. [nipple detail]. [mood], [lighting]."
S2: "[Mood] [subject] in [pose]; visible anatomy [list]. [props]. [lighting], [style]."
... (S3-S7)

And explicit guidance for raising nipple-region vocabulary density:

For the nipple region, weave 2-4 complementary terms naturally into the caption. Use complementary attributes (color + shape + state + areola) rather than synonym-stuffing. Vary which subset appears across captions in this batch.

This produces lines like "pink, pointed nipples with prominent areolae" — four attributes woven naturally instead of booru-tag-style enumeration. HiDream's Qwen3VL text encoder is built for natural language captions, so this form actually lands.

Figure 2: Sample output from V2 LoRA stacked with kotonia03. The captions used during training described nipples with multi-vocab patterns ("pink, pointed nipples with prominent areolae"), and the resulting LoRA reproduces that anatomical specificity.

Figure 2: A single generation from the V2 LoRA stacked with kotonia03 (anime aesthetic). The natural-language multi-vocab caption patterns described earlier ("pink, pointed nipples with prominent areolae") translate into anatomical consistency at inference: areola texture and nipple shape land cleanly together.

A 525-image smoke test:

areola mentions: v1 was 0/15, V2 is 4/15 (new vocabulary unlocked)
erect mentions: 2 → 7 (3.5x)
pointed mentions: 0 → 3 (new)
unique trigram ratio: 0.873 → 0.919 (+5.3pp)
self-BLEU proxy: 0.018 → 0.008 (−55%, less self-similar)

At full 9,707-image scale, captions finished in 51 minutes with concurrency 16 (Gemma 26B uncensored's vLLM continuous batching pulls its weight).

7. Training and A/B — under-fit really did resolve

Training config:

9,138 images (9,707 minus 515 blocked by age safety filter and 54 below discriminator score 3.0)
rank 64 / alpha 64 (2× v1)
LR 1e-4 (2× v1)
30,000 steps / batch 1 (20× v1)
fresh from base (no continual training from v1, no anime bias inheritance)
patch_align preserves natural aspect ratio (batch>1 was abandoned due to aspect distortion)

On GPU 0 (Blackwell 6000, 96GB) with image_server / LTX-2 / Gemma 26B all stopped and dedicated, training completed in ~3 hours. Loss stable in 0.03-0.06 range, no divergence.

Figure 3: A/B comparison grid across 5 configs × 8 prompts. Columns left-to-right: base / v1@1.0 / v1@1.5 / v2@1.0 / v2@1.5. Same seed within each row, so config differences are isolated.

Figure 3: A/B evaluation grid (40 samples). Columns: base / [email protected] / [email protected] / [email protected] / [email protected]. Rows: 8 eval prompts. Visible in v1 columns: several prompts explicitly say "topless" but v1 outputs clothed subjects (prompt adherence failure). v2 columns: 8/8 correct topless interpretation. This is the direct visual evidence of the 3-stage caption pipeline's effect on prompt adherence.

A/B evaluation across 8 prompts × 5 configs:

config	observation
base (no LoRA)	anatomy rendering at base-model level, conservative
[email protected]	failed to interpret "topless" 4-5 times out of 8, output clothed instead
[email protected]	overshoot recovered quality, the practical v1 best
[email protected]	8/8 topless interpretation, clean nipple detail (areola/shape), aesthetic on par or better
[email protected]	barely distinguishable from [email protected] (small overshoot margin = under-fit resolved)

[email protected] ≈ [email protected] is the resolution signal. Contrast with [email protected] ≪ [email protected].

Then 8 prompts × 6 strength values for a sweep:

Nipple-focused prompts (4): 0.5 weak, 0.75 sweet, 1.0 strong, 1.25-1.5 body inflation appears
Generalization probes (4) (clothed woman, mountain landscape, anime school uniform, ramen bowl): landscape and food completely unaffected at every strength. Clothing intent preserved at all strengths (no stripping). Mild breast emphasis at high strengths but doesn't break the prompt's clothed framing.

This shows the LoRA's specialization is locally confined to the nipple/breast domain — unrelated domains (landscape, food) are not corrupted at all.

8. What this revealed about the indie compounding loop

Deployment: added nipple_v2 (anatomy quality + generalization preserved) to /studio's LoRA dropdown with default weight 0.75, registered in image_server's HIDREAM_LORAS env, rebuilt the frontend. ~10 minutes.

What this single session produced as reusable assets:

3-stage caption is methodologically transferable. The next specialized LoRA (hand quality / clothing detail / specific style) can reuse the same pipeline
Age safety filter + discriminator + caption-presence triple-filter is generic. It becomes the standard post-processing for CivitAI scrapes
A/B + strength sweep + generalization probe is a reusable evaluation framework for any LoRA

Beyond the artifacts, it surfaced something about indie tooling that's worth naming.

Public datasets closed by CSAM. Commercial AV closed by DRM. In that landscape, what's left is the path of hitting the API yourself, designing the caption pipeline yourself, burning the GPU yourself, deploying it yourself. This is movement that VC-backed dumping startups can't really imitate, and because external dependencies are zero, each attempt's output becomes input to the next attempt's precision. The compounding loop closes.

"Importing data" becomes "distilling data yourself," "purchasing captions" becomes "designing captions yourself," "outsourcing training" becomes "running training yourself." Each delta is small, but for an indie holding the full stack, the compounding from zero external dependence is real.

V3 carryover items are also clear:

Vulva bias root fix: exclude vulva from anatomy_visible at Stage 1, or filter vulva-visible images upstream with a detector
Lying pose elongation: CivitAI's lying poses skew toward portrait aspect ratios, suspected to have caused training-time bias. Aspect-ratio-weighted sampling should normalize
Stack rank mismatch (v2 rank 64 vs k03 rank 32): add_weighted_adapter linear requires same rank, so the workaround route is peft_model.base_model.set_adapter([list]) + per-adapter scaling

These are problems that yield to "caption and sampling design" rather than "more data" — V3 will be more methodological refinement than scale.

I started this session at night and had v2 deployed by morning. If I'm allowed one piece of self-congratulation, it's that the indie compounding loop closed an entire cycle in a single night.