The hypothesis: drop the safeguards and the niche demand might be large
While looking at the analytics for my own platform, one thing stood out: video-generation actions dwarf every other feature in volume.
That suggested a bet. General AI chat and expert-collaboration tooling are a cost race against suppliers — Anthropic, OpenAI, Google. A solo developer can't win that fight on the same field. But there's one space the frontier labs structurally can't enter: free creation with the safety guardrails removed. They won't touch it for policy reasons. Overlay that with the unusually high interest in video generation, and there may be a niche that's small in breadth but deep in demand.
To test it, I ran the uncensored community fine-tunes myself and compared them. What follows is the experiment log and how I ended up making I2V the mainstay.
(This article stays SFW. The explicit examples and raw detail live behind a gate at the end.)
The experiment: what I compared, and how
On a single local GPU (RTX PRO 6000 Blackwell, 96 GB) I lined up a base checkpoint against community fine-tunes and compared them fairly: identical prompt set × multiple models. It's the same A/B method I'd already used for image generation (curated prompts → side-by-side contact sheets → identify where each model wins), ported to video.
Two axes:
- Model A/B (general base vs community fine-tune) — who wins where.
- Generation path (T2V = text→video vs I2V = still→video) — which is viable for reliable output.
What I found
| axis | result |
|---|---|
| base vs community fine-tune | base has stronger physical coherence (fewer breakdowns). Fine-tunes are more expressive but trade away generalization, so anatomy breaks more easily |
| T2V (text→video) | seed gacha + anatomy breakdown; not viable for reliable output. You're rolling until you hit a good seed |
| multi-person scenes | two or more interacting bodies collapse (limbs mix between people). Establish solo quality first |
| resolution | higher resolution = better detail and anatomy. Downscaling-then-upscaling doesn't recover the lost information |
| anatomy negatives / seed | picking a clean seed + an "extra fingers / fused limbs / distorted anatomy …" negative noticeably stabilizes solo output |
Conclusion: I2V is the mainstay
No amount of prompt + negative tuning gets T2V's gacha to "reliable." So I flipped the approach.
Build a clean still first, then animate it gently with I2V. Because the starting frame's anatomy is already correct, breakdown is structurally unlikely across the clip — frame by frame, the source's coherence holds for several seconds. The gacha is dodged by path design, not luck.
It also matches the actual roleplay use case: animate a user's existing character. So: I2V is the mainstay.
The bottleneck wasn't "not enough hardware" — it was an assumption
Running I2V at high resolution made VRAM spike: 83.5 GB at 1024×768, OOM when co-resident with other services. I half-gave-up — "maybe 96 GB just isn't enough."
It wasn't even a hardware problem. One line in the inference harness — a lazy decode running outside a no_grad block — was holding a 54 GB autograd graph. Fixing it took the peak from 83.5 GB → 29.5 GB, and 2048×1536 (3.1 MP) now runs in 33.6 GB. "Higher resolution is justice" without flinching at the VRAM ceiling.
(The full debugging — phase-by-phase profiling to localize the peak, and the VAE-tiling dead end I tried first — is in a separate technical write-up → My high-res image-to-video kept OOMing — turns out I was decoding outside no_grad)
Solo dev work is resource-constrained, which is exactly why accepting "not enough" as a measurement artifact means giving up things you could actually do. This one drove that home.
Running experiments in production without stalling users — GPU traffic control
This is the unglamorous bit that earns its keep. There's only one GPU, and it's contested by user-facing generation (image / video / voice) and my own experiments + benchmarks. Run two heavy diffusion jobs at once and you OOM instantly, taking a user's request down with you. If you're going to experiment in production, you have to manage this.
What I did is traffic control:
- A single lock for heavy jobs (semaphore, concurrency 1) serializes all of it — image gen, video gen, benchmarks — through one slot. Only one heavy diffusion job runs on the GPU at a time, so OOM is structurally impossible.
- Priority: if a paying user's job is waiting, my experiment (treated as "free") yields the slot. Diffusion can't be preempted mid-run, so instead of interrupting, the next free slot always goes to the user.
- External experiments take the same lock: benchmarks I drive locally acquire the same slot over an HTTP permit broker, releasing per prompt so user requests can slip in between.
- The unglamorous cleanup: a benchmark once crashed while holding a permit and hung all generation. So I added a 5-minute TTL reaper that reclaims abandoned permits. The classic "add it after it bites you."
This traffic control is what makes "experiment in production without stalling users" actually true. For a solo operation, downtime = lost opportunity, so this is not the place to cut corners.
Where it stands now
- Clean still → I2V runs stably in the 30 GB range, co-existing with the image/voice stacks without contention.
- Peak is roughly flat as resolution climbs, so high resolution is usable day-to-day.
- Per-request model switching for A/B is wired into production, and I'm tuning the recipes that land live, in production.
A niche the big labs walked away from, attacked by one person on one GPU. Every time a "bottleneck" turns out to have been an assumption, the odds on this bet get a little better.
The uncensored version (NSFW)
Everything above is SFW. Which community models (base / Sulphur / 10Eros), what the outputs actually look like, and how usable it is in practice — there's an uncensored companion with comparison clips and raw detail on a separate page. For those who actually want to see it.
🔞 See the uncensored version (NSFW) → Contains adult examples. Do not open if you are under 18 or do not wish to see explicit material.
This platform (kotonia.ai) is a solo-built, solo-run immersive voice × video roleplay service.