Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture
When integrating LTX-2.3 (a 22B audio-to-video model) into a voice roleplay product, I ran straight into a VRAM wall. The classic dead-end: running it as a persistent server ate 86 GiB, instantly OOM-ing the TTS / Ditto / MuseTalk stack sharing the same GPU. This is the story of switching to a cold-start design that idles at 0 GiB and peaks at 40 GiB.
Hardware: RTX Pro 6000 Blackwell Max-Q (94.97 GiB). Software: LTX-2 official repo and bitsandbytes 0.49.1.
What I Was Trying to Do
A2V (audio-to-video) mode generates lip-sync video from audio + a reference image + a text prompt. Specifically, it uses A2VidPipelineTwoStage:
prompt + audio_path + image
↓ stage_1 (generate video latent at low resolution, audio fixed)
↓ spatial upsample 2x
↓ stage_2 (refinement at high resolution, distilled LoRA-384 applied)
↓ video VAE decode + embed original input audio
mp4 output
The official pipeline builds → runs → frees each component inside every __call__, which means ~50 seconds of disk I/O per request. I wanted to keep everything resident in memory.
Dead-End 1: VRAM Breakdown in Persistent Mode
Loading every LTX-2 component into VRAM at once (all bf16):
| Component | VRAM |
|---|---|
| embeddings processor | 5.91 GiB |
| Gemma3-12B text encoder | 22.78 GiB |
| stage_1 transformer | 35.38 GiB |
| stage_2 transformer (distilled LoRA applied) | 35.38 GiB |
| video VAE encoder | 0.60 GiB |
| audio VAE encoder | 0.04 GiB |
| spatial upsampler | 0.92 GiB |
| video decoder | 0.76 GiB |
| Total | 101.77 GiB |
102 GiB doesn't fit in 96 GiB. It died mid-way through loading the stage_2 transformer with CUDA out of memory. Tried to allocate 128.00 MiB.
Dead-End 2: "Gemma Is Small" Is a Misconception
My intuition was "a 12B text encoder can't be that heavy" — but it actually loads at 22.78 GiB. With 12B parameters in bf16, that's exactly what you'd expect.
The model filename is gemma-3-12b-it-qat-q4_0-unquantized. Here, qat-q4_0 means it was trained with Quantization-Aware Training for q4_0, and unquantized means the weights are stored as pre-quantization bf16. If you're using it as intended, you should load it in q4_0. Loading it in bf16 is technically valid but wasteful — like running a quantized model at full precision.
Fix 1: 4-bit Loading with bitsandbytes
LTX-2's Gemma loader uses transformers.Gemma3ForConditionalGeneration internally, so bnb 4-bit works cleanly. I bypass the LTX-2 custom loader path and use from_pretrained directly:
from transformers import BitsAndBytesConfig, Gemma3ForConditionalGeneration
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = Gemma3ForConditionalGeneration.from_pretrained(
gemma_root,
quantization_config=quant_config,
device_map={"": "cuda:0"},
torch_dtype=torch.bfloat16, # ← dtype for non-quantized layers (embeddings, etc.)
local_files_only=True,
)
If you omit torch_dtype, embeddings load as fp16 and clash with Linear4bit's bnb_4bit_compute_dtype (bf16): mat1 and mat2 must have the same dtype, but got Half and BFloat16. I hit that too.
The patches LTX-2 applies to Gemma (RoPE inv_freq / embed_scale / position_ids register_buffer) still work fine — just call create_and_populate(encoder). Since bnb quantization only replaces nn.Linear, Embedding layers and buffers pass through untouched.
Result: Gemma's VRAM drops from 22.78 GiB → 7.26 GiB. That's 15 GiB freed.
Dead-End 3: Even With That, Persistent Mode Can't Coexist
With Gemma at 4-bit, the total persistent footprint is 86.26 GiB allocated (reserved 88.27 GiB, nvidia-smi shows 91 GiB). Headroom: 4 GiB. Inference workspace during generation (with CFG, roughly +5 GiB) blows past that, peaking at 91 GiB. Adding TTS (3.4 GiB) + Ditto (3.0 GiB) = 6.4 GiB on top makes OOM inevitable no matter how you slice it.
Three options:
- Offload TTS+Ditto (voice chat unavailable while A2V runs)
- Keep only one transformer resident (still leaves OOM risk)
- Cold-start: build → run → free all weights per request
Since I wanted to keep real-time conversation (MuseTalk + TTS, TTFA ~930ms) running while using LTX-2 as a "cinematic" feature, I went with option 3.
Fix 2: Cold-Start Architecture
The key insight: the pipeline object itself is lightweight — the Builder only mmaps, it doesn't load actual weights into VRAM. So I hold the A2VidPipelineTwoStage instance in memory, and let the official implementation's context-manager-per-component build → run → free on every __call__.
class PersistentA2VPipeline:
def __init__(self, ..., cold_start: bool):
self.pipeline = A2VidPipelineTwoStage(...) # builder only, nearly zero VRAM
if cold_start:
return # done here
# persistent mode only: start preloading components from here
def _generate_cold(self, ...):
# pipeline.__call__ handles component build/free internally
video, audio = self.pipeline(prompt=..., audio_path=..., images=...)
encode_video(video, audio, output_path, ...)
Since stage_1 and stage_2 run sequentially, only one transformer is in VRAM at a time. Measured peak: 39.50 GiB. After generation completes, everything is freed — back to allocated 0.01 GiB / nvidia-smi 0.55 GiB (CUDA context only).
[mode] cold-start: components load per-request (slow first call, low idle VRAM)
[cuda] cold-start startup (no preload): allocated=0.00GiB
...
[cuda] after cold-start generate: allocated=0.01GiB peak=39.50GiB
While voice chat runs (TTS 3.4 + Ditto 3.0 = 6.4 GiB), LTX is at 0 GiB. When an A2V request comes in, it spikes to 40 GiB and drops back to 0 about 60 seconds later — fully dynamic allocation.
Gotcha: Audio VAE Preprocessing
The A2V audio VAE encoder expects a 2-channel (stereo) waveform, but TTS output is typically mono. Passing mono gives you expected input[1, 1, 207, 66] to have 2 channels, but got 1 channels instead from Conv2d.
Also, if the input audio is shorter than num_frames / frame_rate, the encoded audio latent ends up shorter than expected and causes a shape mismatch at the transformer input.
Both handled with a single ffmpeg call:
# mono → stereo + silence padding in one pass
ffmpeg -y -i input.wav -ac 2 -af apad -t 2.041667 output.wav
On the server side, check channels and duration with av, run the ffmpeg subprocess only when needed, and pass the temp file. If both conditions are already satisfied, pass the original file directly with zero copying.
Numbers and Tradeoffs
| Metric | Persistent | Cold-Start |
|---|---|---|
| Idle VRAM | 86 GiB | 0 GiB |
| Peak VRAM during generation | 91 GiB | 40 GiB |
| Time per request | ~17s (inference only) | ~60s (including disk I/O) |
| TTS+Ditto coexistence | Impossible (OOM) | Possible |
| OS page cache effect | None | ~25-30s from 2nd request onward |
The cost of cold-start is disk I/O time (reading 73 GB from NVMe, ~40 seconds). First request: ~60s. After OS page cache warms up: ~25-30s. Not suitable for rapid-fire generation, but perfectly fine for "one cinematic shot every 1-2 minutes" or "inserted at scene transitions."
Strategic Role
I originally planned to use LTX-2 as the main real-time avatar for live conversation. The idea was to generate at low resolution and upscale for speed — but when I tested 256×256, quality fell apart (out of the training bucket distribution). AI upscaling from degraded input can't restore lip-sync accuracy.
The revised split:
- Real-time conversation: MuseTalk + multilingual TTS (TTFA ~930ms, already running)
- Async cinematic moments: LTX-2 for scene transitions, emotional peaks, travel-sequence avatars — anywhere a 60-second generation wait is acceptable
The cold-start design only makes sense under the premise that "the wait is part of the production value." That's what this architecture is built around.
We're continuing to develop voice roleplay × multilingual high-quality TTS × lip-sync avatar systems. Engineering posts on LTX-2 integration, how we compressed Qwen3-TTS VRAM from 15 GB to 7 GB, and more are at /articles.