NAVA — Audio-Video Generator (ZeroGPU)

Single H200 · FP8 · Default 5s @ 24fps · 25 steps. ~5 minutes per request when the queue is short.

Tip: ① type a short prompt (Chinese or English). ② optionally upload a first-frame image — I2V mode auto-enables, aspect ratio auto-switches. ③ click Rewrite Prompt — Qwen3 expands your input into the long Chinese caption NAVA was trained on, and (when an image is uploaded) Qwen3-VL captions the scene and composes it into the rewrite. Wrap any spoken line in <S>...<E> — the rewriter preserves these verbatim.

Prompt (原始输入)

VL Caption (上传图片时自动生成；纯文本时为空)

Rewritten Prompt (点击 Rewrite 后填充；不点则用原始输入)

Speech 检查

Image (optional — uploads enable I2V mode)

First Frame Image

Speaker Reference (optional, max 2)

Speaker 1 WAV

Speaker 2 WAV

Inference Steps

10 100

Duration (seconds, 24 fps) — values above 6s may exceed the 330s ZeroGPU budget; 10s is very slow

2 10

Aspect Ratio (auto-set when you upload an image)

Generated Video