Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture

By
#gpu#python#machinelearning#ai

When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.

Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: LTX-2 official repo and bitsandbytes 0.49.1.

What I Was Trying to Do

A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses A2VidPipelineTwoStage:

prompt + audio_path + image
   ↓ stage_1 (generate video latent at low resolution, audio fixed)
   ↓ spatial upsample 2x
   ↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)
   ↓ video VAE decode + embed original input audio
mp4 output

The official pipeline builds → runs → frees each component inside every __call__, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.

Dead-End 1: VRAM Breakdown in Persistent Mode

Loading every LTX-2 component into VRAM at once (all bf16):

ComponentVRAM
embeddings processor5.91 GiB
Gemma3-12B text encoder22.78 GiB
stage_1 transformer35.38 GiB
stage_2 transformer (distilled LoRA applied)35.38 GiB
video VAE encoder0.60 GiB
audio VAE encoder0.04 GiB
spatial upsampler0.92 GiB
video decoder0.76 GiB
Total101.77 GiB

102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with CUDA out of memory. Tried to allocate 128.00 MiB.

Dead-End 2: "Gemma Is Small" Is a Misconception

My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.

The model filename is gemma-3-12b-it-qat-q4_0-unquantized. Here, qat-q4_0 means it was trained with Quantization-Aware Training for q4_0, and unquantized means the weights are stored as pre-quantization bf16. If you're using it as intended, you should load it in q4_0. Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.

Fix 1: 4-bit Loading with bitsandbytes

LTX-2's Gemma loader uses transformers.Gemma3ForConditionalGeneration internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use from_pretrained directly:

from transformers import BitsAndBytesConfig, Gemma3ForConditionalGeneration

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
model = Gemma3ForConditionalGeneration.from_pretrained(
    gemma_root,
    quantization_config=quant_config,
    device_map={"": "cuda:0"},
    torch_dtype=torch.bfloat16,  # ← dtype for non-quantized layers (embeddings, etc.)
    local_files_only=True,
)

If you omit torch_dtype, embeddings load as fp16 and clash with Linear4bit's bnb_4bit_compute_dtype (bf16): mat1 and mat2 must have the same dtype, but got Half and BFloat16. I hit that too.

The patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call create_and_populate(encoder). Since bnb quantization only replaces nn.Linear, Embedding layers and buffers pass through untouched.

Result: Gemma's VRAM drops from 22.78 GiB → 7.26 GiB. That's 15 GiB freed.

Dead-End 3: Even With That, Persistent Mode Can't Coexist

With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, nvidia-smi shows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes OOM inevitable no matter how you slice it.

Three options:

  1. Offload TTS+Ditto (voice chat unavailable while A2V runs)
  2. Keep only one transformer resident (still leaves OOM risk)
  3. Cold-start: build → run → free all weights per request

Since I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a "cinematic" feature, I went with option 3.

Fix 2: Cold-Start Architecture

The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the A2VidPipelineTwoStage instance in memory, and let the official implementation's context-manager-per-component build → run → free on every __call__.

class PersistentA2VPipeline:
    def __init__(self, ..., cold_start: bool):
        self.pipeline = A2VidPipelineTwoStage(...)  # builder only, nearly zero VRAM
        if cold_start:
            return  # done here
        # persistent mode only: start preloading components from here

    def _generate_cold(self, ...):
        # pipeline.__call__ handles component build/free internally
        video, audio = self.pipeline(prompt=..., audio_path=..., images=...)
        encode_video(video, audio, output_path, ...)

Since stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: 39.50 GiB. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).

[mode] cold-start: components load per-request (slow first call, low idle VRAM)
[cuda] cold-start startup (no preload): allocated=0.00GiB
...
[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB

While voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.

Gotcha: Audio VAE Preprocessing

The A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead from Conv2d.

Also, if the input audio is shorter than num_frames / frame_rate, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.

Both handled with a single ffmpeg call:

# mono → stereo + silence padding in one pass
ffmpeg -y -i input.wav -ac 2 -af apad -t 2.041667 output.wav

On the server side, check channels and duration with av, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.

Numbers and Tradeoffs

MetricPersistentCold-Start
Idle VRAM86 GiB0 GiB
Peak VRAM during generation91 GiB40 GiB
Time per request~17s (inference only)~60s (including disk I/O)
TTS+Ditto coexistenceImpossible (OOM)Possible
OS page cache effectNone~25-30s from 2nd request onward

The cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions."

Strategic Role

I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.

The revised split:

  • Real-time conversation: MuseTalk + multilingual TTS (TTFA ~930ms, already running)
  • Async cinematic moments: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable

The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around.


We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at /articles.

Kotonia brings voice AI, AI chat, image generation, and team collaboration into one AI workspace.

Try Kotonia →