⚠️ NSFW research note. Comparison images with explicit anatomical depictions appear in the closing sections. 18+ only. This is a research write-up.
TL;DR
While tuning HiDream-O1-Image (8B), a top-of-arena OpenWeight image generator, for an NSFW + anime-style specialization, we hit a wall: explicit anatomical depictions (a penis being the canonical case) refused to surface. The initial hypothesis was that the base model's safety alignment was burned into the pixel head and could only be reached by unfreezing it. The real cause was different: our captioner (Gemma-4 E4B) was rounding "fellatio" into "oral sex" and flattening explicit anatomical nouns, diluting the training signal at the source.
We swapped the captioner for Grok-4.3 and ran a two-stage prompt (multiple-choice concept extraction → mandatory-inclusion caption assembly) to recaption 1,250 images. The hit rate for "penis" jumped from 13% → 55%, "fellatio" from 3% → 19%. With bitsandbytes' 8-bit Adam we full-fine-tuned the 8B model on a single 96 GB GPU, 3,000 steps in 28 minutes. On the same training image whose old caption read "oral sex", the model failed to render a penis; with the recaptioned text "fellatio on an erect penis", the penis was rendered cleanly.
A high-quality general-purpose SFW model fell out as a byproduct. Even after 9,000 steps of NSFW-focused full FT, there was no catastrophic forgetting — SFW / mature / topless / fully nude / photoreal ⇔ anime style switching were all preserved.
One important correction, though: what we actually observed is two different learning regimes coexisting. The penis case is "learn-from-scratch" — base has effectively no concept, and we walk the classic trajectory of broken shapes → approximate shapes. The nipple case is "unlock" — base already has the concept but at a low quality floor, and our intervention barely moved it (details below).

Background
Our service, kotonia.ai, runs HiDream-O1-Image as a resident model at /studio, with our in-house LoRAs kotonia01/02 layered on top for anime-style quality boosts. kotonia02 was trained on 191 hand-picked images at rank 16 for 1,200 steps; it convincingly delivered the modern-anime / community-art aesthetic.
What didn't work:
- Removing tops / topless prompts came out fine
- But explicit genital or sex-act prompts would drift back into the base model's safety bias and dodge softly
We decided to attack that head-on. The tricks we'd been using through kotonia02 ("hide it in noisy images, push it through brute force") had clearly bottomed out.
Approach 1: just stack more LoRA (the failure timeline)
The first naive hypothesis was "we just need more data and more steps."
We collected 5,000 images from a public image-sharing platform via its authenticated API (its native 5-tier NSFW classification blended 15/15/20/25/25, sorted by top engagement, engagement count ≥ 50) using our own parallel downloader, and captioned all 5,000 with Gemma-4 E4B. The data side of the pipeline came together cleanly.
Three iterations on the training side:
- kotonia03: rank 32 / 3,000 steps / 5,000 imgs → clear quality gains, explicit content still wouldn't land
- kotonia04: resumed from kotonia03 + 6,000 more steps → more steps, but explicit acts were still dragged back toward the base prior
- kotonia05: also unfroze
final_layer2(the pixel projection head) and trained for 3,000 steps → the first real progress, "a man appeared in the frame" / "couple compositions emerged"
So the "layer hypothesis" (base safety burned into final_layer2) was half-right: unfreezing the head moved us forward. But the penis still wouldn't render. Looking at the actual training images for the same caption, the images contained plenty of penises. We just couldn't draw them. Something didn't add up.
The distortion that showed up when we actually looked at the data
The breakthrough started with a grep over the caption files:
- Out of n=1,250 lvl 16 (XXX) captions:
- captions containing
penis: 162 (13%) fellatio: 38 (3%)vaginal penetration: 196 (16%)cum/semen: 74 (4%)
- captions containing
Completely misaligned with the actual image content. The images clearly show penises, but captions had been rounded to neutral / superordinate terms like "oral sex" or "vaginal penetration". When we tried feeding E4B the same image plus a multiple-choice prompt (pick from sexual_acts: [fellatio | cunnilingus | vaginal penetration | ... | none]), E4B always returned ["none"] on explicit sex-act images. It would happily name anatomical nouns (nipples, vulva, etc.), but structurally refused to acknowledge an act.
Gemma-4 is a local model so refusal-style guardrails should be weak — but explicit act recognition was softened anyway, consistently. Multiple choice, stronger system prompts, user prompts embedded with explicit vocabulary — none of it moved the needle.
The training signal — "explicit penis pixels ↔ neutral token 'oral sex'" — was attenuated, and at inference time "oral sex" decayed back into the base's prior of "something explicit-ish ≈ kissing + fluids". The real cause wasn't the model; it was the captioner.
Hypothesis B: swap in Grok-4.3
We hit xAI's grok-4.3 with the same image and the same multiple-choice prompt:
{
"sexual_acts": ["fellatio", "ejaculation on face", "hand job"],
"anatomy_visible": ["penis", "testicles", "breasts", "tongue"],
"fluids_visible": ["semen", "saliva"]
}
On the exact image where E4B returned ["none"], every act and anatomy was named explicitly and correctly.
Adopted. recaption_two_stage.py implements the following two stages:
- Stage 1 (vocabulary extraction): concept enumeration forced through structured JSON output. The
sexual_actsfield is multiple-choice over a fixed label set (fellatio | cunnilingus | vaginal penetration | anal penetration | hand job | breast play / paizuri | female masturbation | male masturbation | kissing | ejaculation on face | ejaculation on body | none). Reframing the task as classification rather than description makes safety alignment easier to slip past - Stage 2 (mandatory inclusion): the Stage 1 vocabulary list is fed back in as required terms; the model writes a 1–3 sentence prose caption that uses every term verbatim. Grok-4.3 doesn't omit terms here either
Unlike vLLM-style continuous batching, Grok is an HTTP API, so we ran with concurrency 8 (httpx + ThreadPoolExecutor). 1,250 images in ~47 minutes for ~$30.
Result: dramatic improvement in term hit rate

| Term | old caption (E4B) | recapt (Grok-4.3) | improvement |
|---|---|---|---|
| penis | 13% | 55% | 4.2× |
| fellatio | 3% | 19% | 6.4× |
| semen | 4% | 41% | 9.4× |
| ejaculat | 0% | 28% | (new) |
| tongue | 2% | 29% | 16× |
| nipple | 19% | 60% | 3.1× |
| vulva | 13% | 42% | 3.2× |
| penetrat | 16% | 31% | 1.9× |
Given that these are high-NSFW (XXX) images, frequencies in this range are what "actually corresponds to reality" looks like. The E4B numbers were anomalously low because they weren't reflecting image content.
Full-FT implementation note: 8-bit Adam fits 8B into 96 GB
VRAM budget:
| Item | Capacity |
|---|---|
| Weights bf16 (8B) | 16 GB |
| Gradients bf16 | 16 GB |
| Adam state (fp32 default) | 64 GB ← the bottleneck |
| Activations + scratch (batch 1, res 1024) | ~7 GB |
| Total (naive) | ~103 GB → OOM |
bitsandbytes.optim.AdamW8bit compresses Adam state to 16 GB:
opt = bnb.optim.AdamW8bit(
(p for p in model.parameters() if p.requires_grad),
lr=args.lr, weight_decay=0.0,
)
Total resident: 54.6 GB. LR 1e-5, warmup 100 steps, 3,000 steps → 28 minutes; 9,000 steps → 84 minutes. Each model.save_pretrained() writes ~17 GB of checkpoint shards (standard HuggingFace format).
Same-caption reproduction test: the real killshot
The training image 16/34854503.jpg has these two captions, old vs. recapt:
Old (E4B):
...She is depicted in the act of oral sex, with drool and fluids visible around her mouth, set against a dark red, atmospheric background with small floating hearts. ...NSFW explicit, oral sex, cum on face, visible fluids.
Recapt (Grok-4.3):
...performs fellatio on an erect penis with visible tongue amid ejaculation on face, semen on face and saliva against a dark red background with floating pink hearts under dramatic side lighting with glossy highlights.
We send both to the same fine-tuned model (full_nsfw02_recapt):

- Left (old: "oral sex"): the composition (red braid + white lingerie + male torso + dark red bg + hearts) is perfect, but the act renders as "something kiss-shaped + drool" — no penis depicted
- Right (recapt: "fellatio on an erect penis"): same composition, but with a hand holding a penis and active fellatio, cum on face, white gloves
This is the strongest evidence in the run. The model has the visual information to draw a penis. The vocabulary gate was just closed.
2D matrix: training axis × prompt axis
When you vary the training-time vocabulary (E4B → Grok recapt + step count) and the inference-time prompt vocabulary independently and lay them out in 2D, the phenomenon becomes clear:

- Vertical axis (model evolution): LoRA (kotonia04) → full FT 3,000 step E4B → full FT 3,000 step recapt → full FT 9,000 step recapt
- Horizontal axis (inference prompt): old "oral sex" / recapt "fellatio on an erect penis"
Readings:
- The horizontal (prompt vocabulary) effect is large: at any model stage, a recapt-vocabulary prompt makes a penis appear. Even kotonia04 (a LoRA trained purely on E4B captions) draws a penis when the inference prompt uses recapt vocabulary
- The vertical (training-side) effect is finer-grained: recapt training + step count drive anatomical coherence and compositional stability (is the penis actually a penis, or a wobbly approximation?)
- Bottom-left / top-right: old prompt + old training (top-left) is fully penis-less. Recapt prompt + recapt training (bottom-right) renders explicit content most reliably
- Reading the left column top-to-bottom: even after recapt-ing the training data, the old prompt still doesn't elicit a penis. This is the asymmetry: fixing the training side alone is not enough — the prompt-side vocabulary also has to match
In short: prompt vocabulary and training vocabulary have to be aligned together for the depiction to land. Both kinds of "rounding" need to be removed; either one alone is incomplete.
Byproduct: a surprisingly strong general-purpose SFW model
After 9,000 steps of NSFW-focused full FT on 2,500 images (lvl 8 + 16), not one SFW capability was broken:

- Rooftop sunset (photoreal): photo-quality European city — doesn't fall into anime style. SFW prompts behave as expected
- Café reading (anime): watercolor anime, extremely cute, and the model spontaneously slapped "KOTONIA" onto the book cover (trigger-word easter egg)
- Beach bikini: anime semi-realistic, natural cleavage, no failure on Mature prompts
- Forest sundress: anime, deep cleavage + mystical lighting, anatomy on point
The "photoreal ⇔ anime style auto-switching based on prompt context" behavior is preserved. This is better than we hoped: unlocking NSFW capability didn't cost any general performance. What we called a "byproduct" in the original plan turned out to be a fairly strong general-purpose SFW model.
The boundary of NSFW capability (with anatomical reference)
To measure what's been acquired and what's still out of reach, we ran 8 explicit prompts. We're including the cleanest result (a paizuri / breast-play scene, with the most coherent anatomy):

- Acquired: penis (standalone), fellatio (paizuri-type), handjob, ejaculation on face, cum visible, topless, fully nude, suggestive pose
- Still entangled: vulva ↔ oral cavity (the vulva still tends to hallucinate as a "mouth with a tongue in it"), cunnilingus geometry (the face-to-genitals contact structure)
- Overfit signs: cum is often rendered red (the dark-red background + heart icons in the training set baked a color prior into the cum tokens)
The vulva and cum issues are within range of vocabulary repair: recapt-ing lvl 8 (X tier) and supplementing more background-color diversity should both help. Methodologically, this isn't unknown territory anymore.
As a methodology: generalizing caption vocabulary repair
Patterns that generalize beyond this specific project:
- OpenWeight foundation models often have the visual capability: to be competitive at the top of arena leaderboards, they've seen massive amounts of data (including NSFW). Safety alignment is a layer applied on top; the underlying capability is asleep in the weights
- The vocabulary gate hides the capability: general-purpose captioners almost always round explicit act nouns (fellatio, penetration, etc.) into neutral superordinate terms (oral sex, act, intimate moment, etc.). This produces a weak signal at fine-tuning time
- Captioner choice dominates the entire training pipeline: if you want a fine-tuned model to express its NSFW capability, the first problem to solve is captioner vocabulary fidelity, before any model-side tuning
- Multiple choice + mandatory inclusion bypasses safety alignment structurally: where natural-language description gets softened, reframing the task as classification + verbatim inclusion lets it through
The crucial point: the captioner is not refusing. E4B returns captions cheerfully. But the captions it returns are softened. "Zero refusals" is not a valid health indicator for the pipeline — that's one lesson from this run.
Things we wish we'd done differently
- The three iterations on hypothesis A (kotonia03 → 04 → 05) ate real time. We should have started with
grep -c penis dataset/*.txt, which takes one minute. Six hours of GPU time later, we did exactly that, and the answer fell out immediately - "Zero refusals" from E4B lulled us into a false sense of security. Safety alignment shows up not only as refusals but also as quiet softening / rounding
- That said, unfreezing
final_layer2in kotonia05 wasn't a waste: it confirmed the layer problem partially exists, that the base model's safety is also somewhat encoded into the pixel head. Hypothesis A wasn't refuted — it now coexists with hypothesis B
Revisiting the hypothesis: "unlock" and "from-scratch" are not the same thing
Mid-way through, the simple story of "vocab repair unlocks the model's hidden capability" started feeling incomplete. What we actually observed is two distinct learning regimes coexisting.
The observed asymmetry
| Concept | Behavior on base | Behavior after FT | Learning type |
|---|---|---|---|
| Nipple | The base, given prompt "topless", renders nipples from the first attempt (anatomy is crude, but they're there) | After 9,000 steps of FT, the quality floor is roughly the same. 60% dataset coverage doesn't move the needle | Unlock-type — the concept exists within capacity, but quality improvement is a separate problem |
| Penis | The base, given prompt "fellatio", renders no penis at all. Only when the recapt vocabulary is forced does a shape begin to appear | As steps accumulate: "deformed → odd protrusion → object close to a penis → anatomically valid" — a textbook progression of a new concept being acquired | From-scratch type — possibly deliberately scrubbed at pretrain stage |
Why the asymmetry exists
The user's own hypothesis — "selective scrub at pretrain stage" — is the most compelling explanation:
- Penis classifiers are high-demand and high-quality: in the commercial moderation market, a penis is a clear single-feature target, and discriminators can be built with high precision. Pretrain datasets can be cleanly excised
- Nipple classifiers are structurally hard: men have nipples, paintings universally feature them, medical imagery includes them, anatomy atlases require them → it's not realistic to remove every nipple-containing image from pretraining. So the base ships with a crude nipple concept intact
Under this hypothesis, the penis emerging in the bottom-right cell of the matrix (full FT 9,000 step + recapt prompt) is not "unlock" but "effectively additional pretraining-equivalent concept teaching". That's why the penis trajectory looks like a typical new-concept-being-acquired curve, not an unlock curve.
The nipple sits in "already learned at pretrain, but with a low quality floor and no further pressure" — and a full FT on a dataset where nipple appears in 60% of captions barely moved the quality. The likely root cause: the caption contains no quality judgment. The discriminator between "well-rendered nipple" and "broken nipple" is not present in the training signal.
This is also consistent with our other finding that MSE pixel loss is weakly correlated with anatomical quality: for dataset coverage numbers (penis 13% → 55%) to translate into rendering capability, a minimum concept representation has to exist in the weights. Nipples satisfied that prerequisite but didn't improve in quality; penises didn't satisfy it but moved forward via new-concept learning.
Related: the "concept ablation" lineage in NLP
Recent work on concept editing / activation engineering (ROME / MEMIT / Anthropic's representation engineering family) identifies the vector direction of specific concepts and scales them down at activation level. It's plausible that OpenAI / Anthropic post-training pipelines include similar mechanisms, and if OpenWeight models have undergone something like it, you get a "still in weights but scaled down at activation" state. The penis behavior — "won't come out unless you push hard, but once it starts coming out it follows a new-concept-learning trajectory" — fits this category.
Correcting the article's main claim
The original "vocab repair unlocks hidden capability" story does not apply to the nipple case. Nipples were visible from the start; the issue is quality. Vocab repair can't fix that (the caption vocabulary contains no axis for "beautiful nipple vs. broken nipple", so the training signal is the 1-bit "present / not present").
For the penis case, vocab repair did open the door to learning a new concept — but that's strictly "enable", not "unlock". If the base had completely scrubbed explicit anatomy, this run would have failed. It scrubbed most of it, so recapt vocabulary + 9,000 steps were enough to bring the shape to life. That's the more precise statement.
Outstanding work and homework for the next phase
- Nipple quality: a GAN-style approach
- Existing dataset training doesn't fix the quality floor, so next iteration will try user-driven reward labeling (200–500 images rated 1–5 stars) → lightweight CNN discriminator → reward fine-tune on the diffusion side (Diffusion-DPO family)
- A "kotonia-preferences discriminator" falls out as a byproduct and can be reused for article-cover auto-selection
- Replace Grok with an uncensored captioner
- Grok-4.3 cost ~$30 (lvl 16) + $0.5 (pilot), workable. But scaling to lvl 8 and beyond gets heavy. Next time we'll try JoyCaption alpha-2 or community uncensored derivatives of CogVLM2 / InternVL-2.5 locally, aiming for $0 / unlimited
- Recapt lvl 8 too: run X-tier 1,250 images through the same pipeline. Anatomical noun diversity + nipple/vulva coverage both improve
- Background-color diversity: to escape the cum=red overfit, deliberately collect images with light backgrounds + cum on body
- Cunnilingus geometry: collect more samples showing clean face ↔ genitals contact
- kotonia.ai integration: multi-model VRAM deployment
- The nsfw03 checkpoint is currently standalone;
/studiostill ships kotonia01/02 LoRAs on the base - On a 96 GB GPU we can keep base (fp8, 8.79 GB) and nsfw03 (bf16, 16 GB) resident simultaneously, enabling per-request model selection with no cold start (peak generation included, total still ~40 GB)
- The nsfw03 checkpoint is currently standalone;
- Article distribution strategy: still deciding whether to bet on Google's tolerance for anatomical content in academic context and let this article be indexed, or to keep it
noindexfor SafeSearch reasons
Related code
Self-hosted monorepo, private:
collect.py— parallel download of 5,000 images using a public image-sharing platform's authenticated API, top-engagement / all-time / NSFW-tier mixcaption_v1.py— E4B legacy captioner (where the rounding problem was originally found)recaption_two_stage.py— Grok-4.3 vocab-repair captioner with--backend xai|e4bswitchingcombine_for_training.py— symlinks.recapt.txtover.txtwhen--use_recaptis settrain_full.py— bnb 8-bit Adam,--init_fromresume,save_pretrained()HF formatinfer_full.py— full-FT model inferencemaintenance_shim.py+start_maintenance.sh/stop_maintenance.sh— drop-in replacement for/studio+ LTX endpoints during training, returning 503 with localized maintenance messages (3 languages, dynamic ETA)
That's the research note. Next direction: lvl 8 recapt + background-color diversity + the three outstanding items above.