Kotonia
ログイン今すぐ始める

Kotonia Articles

Betting on the video niche the big labs walked away from — model A/B to making I2V the mainstay

Is there niche demand in free creation with the guardrails off? A solo dev on one local GPU: from model A/B to making high-res I2V the mainstay.

By 5 min read
#solo-dev#generative-ai#video-generation#local-gpu#i2v
Also inJapanese

The hypothesis: drop the safeguards and the niche demand might be large

While looking at the analytics for my own platform, one thing stood out: video-generation actions dwarf every other feature in volume.

That suggested a bet. General AI chat and expert-collaboration tooling are a cost race against suppliers — Anthropic, OpenAI, Google. A solo developer can't win that fight on the same field. But there's one space the frontier labs structurally can't enter: free creation with the safety guardrails removed. They won't touch it for policy reasons. Overlay that with the unusually high interest in video generation, and there may be a niche that's small in breadth but deep in demand.

To test it, I ran the uncensored community fine-tunes myself and compared them. What follows is the experiment log and how I ended up making I2V the mainstay.

(This article stays SFW. The explicit examples and raw detail live behind a gate at the end.)

The experiment: what I compared, and how

On a single local GPU (RTX PRO 6000 Blackwell, 96 GB) I lined up a base checkpoint against community fine-tunes and compared them fairly: identical prompt set × multiple models. It's the same A/B method I'd already used for image generation (curated prompts → side-by-side contact sheets → identify where each model wins), ported to video.

Two axes:

  1. Model A/B (general base vs community fine-tune) — who wins where.
  2. Generation path (T2V = text→video vs I2V = still→video) — which is viable for reliable output.

What I found

axisresult
base vs community fine-tunebase has stronger physical coherence (fewer breakdowns). Fine-tunes are more expressive but trade away generalization, so anatomy breaks more easily
T2V (text→video)seed gacha + anatomy breakdown; not viable for reliable output. You're rolling until you hit a good seed
multi-person scenestwo or more interacting bodies collapse (limbs mix between people). Establish solo quality first
resolutionhigher resolution = better detail and anatomy. Downscaling-then-upscaling doesn't recover the lost information
anatomy negatives / seedpicking a clean seed + an "extra fingers / fused limbs / distorted anatomy …" negative noticeably stabilizes solo output

Conclusion: I2V is the mainstay

No amount of prompt + negative tuning gets T2V's gacha to "reliable." So I flipped the approach.

Build a clean still first, then animate it gently with I2V. Because the starting frame's anatomy is already correct, breakdown is structurally unlikely across the clip — frame by frame, the source's coherence holds for several seconds. The gacha is dodged by path design, not luck.

It also matches the actual roleplay use case: animate a user's existing character. So: I2V is the mainstay.

The bottleneck wasn't "not enough hardware" — it was an assumption

Running I2V at high resolution made VRAM spike: 83.5 GB at 1024×768, OOM when co-resident with other services. I half-gave-up — "maybe 96 GB just isn't enough."

It wasn't even a hardware problem. One line in the inference harness — a lazy decode running outside a no_grad block — was holding a 54 GB autograd graph. Fixing it took the peak from 83.5 GB → 29.5 GB, and 2048×1536 (3.1 MP) now runs in 33.6 GB. "Higher resolution is justice" without flinching at the VRAM ceiling.

(The full debugging — phase-by-phase profiling to localize the peak, and the VAE-tiling dead end I tried first — is in a separate technical write-up → My high-res image-to-video kept OOMing — turns out I was decoding outside no_grad)

Solo dev work is resource-constrained, which is exactly why accepting "not enough" as a measurement artifact means giving up things you could actually do. This one drove that home.

Running experiments in production without stalling users — GPU traffic control

This is the unglamorous bit that earns its keep. There's only one GPU, and it's contested by user-facing generation (image / video / voice) and my own experiments + benchmarks. Run two heavy diffusion jobs at once and you OOM instantly, taking a user's request down with you. If you're going to experiment in production, you have to manage this.

What I did is traffic control:

  • A single lock for heavy jobs (semaphore, concurrency 1) serializes all of it — image gen, video gen, benchmarks — through one slot. Only one heavy diffusion job runs on the GPU at a time, so OOM is structurally impossible.
  • Priority: if a paying user's job is waiting, my experiment (treated as "free") yields the slot. Diffusion can't be preempted mid-run, so instead of interrupting, the next free slot always goes to the user.
  • External experiments take the same lock: benchmarks I drive locally acquire the same slot over an HTTP permit broker, releasing per prompt so user requests can slip in between.
  • The unglamorous cleanup: a benchmark once crashed while holding a permit and hung all generation. So I added a 5-minute TTL reaper that reclaims abandoned permits. The classic "add it after it bites you."

This traffic control is what makes "experiment in production without stalling users" actually true. For a solo operation, downtime = lost opportunity, so this is not the place to cut corners.

Where it stands now

  • Clean still → I2V runs stably in the 30 GB range, co-existing with the image/voice stacks without contention.
  • Peak is roughly flat as resolution climbs, so high resolution is usable day-to-day.
  • Per-request model switching for A/B is wired into production, and I'm tuning the recipes that land live, in production.

A niche the big labs walked away from, attacked by one person on one GPU. Every time a "bottleneck" turns out to have been an assumption, the odds on this bet get a little better.

The uncensored version (NSFW)

Everything above is SFW. Which community models (base / Sulphur / 10Eros), what the outputs actually look like, and how usable it is in practice — there's an uncensored companion with comparison clips and raw detail on a separate page. For those who actually want to see it.

🔞 See the uncensored version (NSFW) → Contains adult examples. Do not open if you are under 18 or do not wish to see explicit material.


This platform (kotonia.ai) is a solo-built, solo-run immersive voice × video roleplay service.

Kotonia brings voice AI, AI chat, image generation, and team collaboration into one AI workspace.

Try Kotonia