# lipsync-wasm v3 — GestureVRM (Body Motion)

**⚠️ INTERNAL TEST ONLY — no encryption, do not deploy externally.**

Audio → 20 VRM bone-rotation frames @ 30 fps, ~4.3 s per generated clip.
Companion to v1/v2 (face/blendshape models); v3 covers body motion.

```
audio (16 kHz mono PCM)
  ├─ onset + amplitude (JS, no WASM)             → (T_audio, 2)
  └─ modality_encoder.onnx                       → at_feat  (1, 32, 768)
      ├─ denoiser.onnx × 4  (CFG × n_steps=2)    → flow_v   (1, 384, 1, 32)
      │   x_{t+1} = x_t + dt · (uncond + G·(cond − uncond))
      └─ split + scale × 5
          rvqvae_spine.onnx  → (1, 128, 36)  spine 6 bones × 6D
          rvqvae_arms.onnx   → (1, 128, 48)  arms  8 bones × 6D
          rvqvae_legs.onnx   → (1, 128, 39)  legs  6 bones × 6D + 3 trans
              denorm + reassemble (20 × 6D) → (T, 20, 4) quaternion
```

## Files

```
v3/
├── js/
│   ├── lipsync-gesture-wrapper.js   public API + JS audio onset/amplitude
│   └── orchestrator.js              CFG diffusion loop + RVQVAE decode + 6D→quat
├── models/                          symlinks → ../../motion/GestureVRM/outputs/onnx/
│   ├── modality_encoder.onnx[+.data]   15 MB
│   ├── denoiser.onnx[+.data]           117 MB
│   ├── rvqvae_{spine,arms,legs}.onnx[+.data]   3× 37 MB
│   ├── null_cond_embed.npy             97 KB  — CFG null branch
│   ├── tpose_seed.npy                  6 KB   — first 4 latent frames (constant)
│   └── mean_std.json                   5 KB   — per-part denormalization
├── scripts/
│   └── build_tpose_seed.py          regenerates tpose_seed.npy from training clip
├── demo/
│   └── index.html                   minimal upload → generate → log timings
├── serve.py                         tiny COOP/COEP static server
└── README.md
```

Total model weight footprint: **~244 MB FP32**. Reasons not quantized:
- Denoiser FP16 hits FP32/FP16 mix in the Shortcut `Timesteps` + `time_embedding` sub-graph (auto-converter can't reconcile).
- RVQVAE FP16 rejected by ONNX opset 17 `Resize`.
- Dynamic INT8 destroys conv-heavy WavEncoder & RVQVAE decoder activations (no calibration data).

Static QDQ + calibration is the production path (deferred). See
`docs/02-design/onnx-export-and-wasm-integration-v1.md` §10.5 for full notes.

## Run the demo

```bash
cd package/lipsync-wasm/v3
python3 serve.py                # http://localhost:8090/demo/
```

1. Click **Initialize** — loads ORT-Web from CDN + fetches the 244 MB of model
   bytes (first run only; ORT-Web caches in memory for the session).
2. Pick an audio file (≥ 4.3 s; shorter clips are tail-padded).
3. Click **Generate** — logs per-stage latency and the final
   `(128 × 20 × 4)` quaternion buffer.

Open devtools and `window.gesture.last` to inspect the output.

## Wiring into a VRM viewer (three-vrm sketch)

```js
import { LipSyncGestureWrapper } from './lipsync-wasm/v3/js/lipsync-gesture-wrapper.js';

const wrapper = new LipSyncGestureWrapper();
await wrapper.init({ ort, modelsUrl: '/lipsync-wasm/v3/models/' });

const { quaternions, trans, frameRate, boneNames } =
    await wrapper.generateGesture(pcm16k);
// quaternions: Float32Array (128 frames × 20 bones × 4)   [x, y, z, w]
// trans:       Float32Array (128 frames × 3)              hips world position

// Drive a VRM bone graph
for (let f = 0; f < 128; f++) {
    for (let b = 0; b < 20; b++) {
        const i = (f * 20 + b) * 4;
        vrm.humanoid.setBoneRotation(boneNames[b], [
            quaternions[i + 0], quaternions[i + 1],
            quaternions[i + 2], quaternions[i + 3],
        ]);
    }
    vrm.humanoid.setBoneTranslation('hips', [
        trans[f * 3], trans[f * 3 + 1], trans[f * 3 + 2],
    ]);
    await waitFrame();   // ~1/30s
}
```

(Real integration: schedule by absolute time, not frame waits, and crossfade
to idle when the clip ends — same pattern as v1/v2 bone animation.)

## Security note

This sub-crate uses **plain, unencrypted ONNX files** served over HTTP for
internal testing convenience. Before any external deploy:

1. Port the v1/v2 `include_bytes!` + AES-256-GCM pattern (`shared/src/crypto.rs`).
2. Gate decryption behind the license-server token (`shared/src/license.rs`).
3. Drop the `serve.py` debug entry; ship via the dev gateway proxy only.

The current `INTERNAL_TEST_ONLY` markers in v1/v2 (license bypass) are unrelated
to this v3 ship path — v3 simply skips the encryption layer entirely.

## TODO (P5–P7)

- [ ] OPFS persistent cache for the 244 MB of model bytes (currently no cache).
- [ ] Streaming mode: ingest audio chunks via AudioWorklet, sliding 128-frame
      window over the latent x, similar to v2's StreamingFeatureExtractor.
- [ ] Static QDQ INT8 quantization with calibration → drop footprint to ~80 MB.
- [ ] Optional Rust/WASM module for librosa-accurate onset detection
      (current JS port uses a simple energy-ratio fallback).
- [ ] Encrypt + license-gate before any external exposure.