Streaming Protocol

OpenHuman supports two streaming transports for real-time animation: a WebSocket binary protocol for low-latency bidirectional streams (live lip sync, mocap, remote control), and an HTTP chunked stream for pre-recorded or server-push scenarios. Both are consumed by the same StreamingClient API on the client.


When to Use Each Transport

WebSocketHTTP Chunked
Latency target< 50ms end-to-end100–500ms typical
DirectionBidirectionalServer → client only
Best forLive AI TTS lip sync, real-time mocap, remote puppeteeringPre-recorded speech playback, server-sequenced animations
ReconnectManual or automaticAutomatic (fetch retry)
Binary frames✅ Native✅ Via ReadableStream
JSON frames✅ Supported✅ Newline-delimited

Architecture Overview


Transport A - WebSocket Binary Stream

Connecting

main.js
import { OpenHuman, StreamingClient } from "@openhuman/sdk"
 
const human = await OpenHuman.load("character.ohb", canvas)
 
const client = new StreamingClient({
    transport: "websocket",
    url: "wss://your-server.example.com/animation-stream",
    jitterBuffer: 80, // ms - smooths latency spikes (default: 80)
    reconnect: true, // auto-reconnect on drop (default: true)
    reconnectDelay: 1000, // ms between reconnect attempts (default: 1000)
})
 
// Attach to the character - poses are applied automatically each frame
client.attach(human)
 
// Open the connection
await client.connect()

Once attach() is called, every frame arriving from the server is pulled from the jitter buffer and applied to the character with no additional code needed.


Binary Frame Format

Each WebSocket message is a single binary frame (ArrayBuffer). The layout is a tightly-packed binary struct:

Offset     Size      Type       Field
──────     ────      ──────     ──────────────────────────────────────────────
0          4         f32        timestamp     - server clock (seconds)
4          2         u16        jointCount    - number of joints in this frame
6          2         u16        facsCount     - number of FACS weights (0 or 52)
8          n×32      f32×8      joints[]      - joint data (see below)
8 + n×32   m×2       i16×m      facs[]        - quantised FACS weights

Joint data - 8 × f32 = 32 bytes per joint:

Offset   Field
──────   ─────────────────────────────────────────
0–11     position    vec3 (x, y, z) - world space, metres
12–27    rotation    quaternion (x, y, z, w) - unit quaternion
28–31    scale       f32 - uniform scale (1.0 = no scale)

FACS weights are transmitted as 16-bit signed integers quantised over [-32 768, 32 767] mapping to [-1.0, 1.0]. The engine dequantises on receipt:

weight_f32 = facs_i16 / 32767.0

This cuts FACS bandwidth by 50% compared to sending raw f32 values (104 bytes vs 208 bytes per frame for 52 targets).


Frame Size Reference

ContentBytes per frame
Header only8
256 joints, no FACS8 + 256 × 32 = 8 200 B
FACS only, no joints8 + 52 × 2 = 112 B
256 joints + 52 FACS8 312 B

At 30 fps with 256 joints + FACS, raw bandwidth is approximately 2 Mbps. Enable WebSocket per-message deflate on your server to cut this by ~60%.


Reference Server - Node.js

A minimal WebSocket server that streams poses at 30 fps:

server/stream-server.js
import { WebSocketServer } from "ws"
 
const wss = new WebSocketServer({ port: 8080 })
 
const JOINT_COUNT = 256
const FACS_COUNT = 52
const FRAME_BYTES = 8 + JOINT_COUNT * 32 + FACS_COUNT * 2
 
wss.on("connection", (ws) => {
    console.log("Client connected")
 
    const interval = setInterval(() => {
        const buf = new ArrayBuffer(FRAME_BYTES)
        const view = new DataView(buf)
        let offset = 0
 
        // Header
        view.setFloat32(offset, performance.now() / 1000, true)
        offset += 4
        view.setUint16(offset, JOINT_COUNT, true)
        offset += 2
        view.setUint16(offset, FACS_COUNT, true)
        offset += 2
 
        // Joint data - replace with your actual pose source
        for (let i = 0; i < JOINT_COUNT; i++) {
            view.setFloat32(offset, 0, true) // position.x
            view.setFloat32(offset + 4, 0, true) // position.y
            view.setFloat32(offset + 8, 0, true) // position.z
            view.setFloat32(offset + 12, 0, true) // rotation.x
            view.setFloat32(offset + 16, 0, true) // rotation.y
            view.setFloat32(offset + 20, 0, true) // rotation.z
            view.setFloat32(offset + 24, 1, true) // rotation.w (identity)
            view.setFloat32(offset + 28, 1, true) // scale
            offset += 32
        }
 
        // FACS weights - replace with your TTS / phoneme output
        for (let i = 0; i < FACS_COUNT; i++) {
            const weight = 0.0 // float in range 0.0–1.0
            view.setInt16(offset, Math.round(weight * 32767), true)
            offset += 2
        }
 
        ws.send(buf)
    }, 1000 / 30) // 30 fps
 
    ws.on("close", () => clearInterval(interval))
})
 
console.log("Streaming server on ws://localhost:8080")

Transport B - HTTP Chunked Stream

HTTP chunked streaming uses the browser ReadableStream API - no persistent connection required. Ideal for server-sequenced animation produced alongside AI-generated audio.

Connecting

main.js
import { OpenHuman, StreamingClient } from "@openhuman/sdk"
 
const human = await OpenHuman.load("character.ohb", canvas)
 
const client = new StreamingClient({
    transport: "http",
    url: "https://your-server.example.com/animation-stream",
    format: "ndjson", // 'ndjson' (default) or 'binary'
})
 
client.attach(human)
await client.connect()

Newline-delimited JSON (NDJSON) format

Each chunk is a UTF-8 JSON object terminated by \n - one object per animation frame:

{"t":1.033,"facs":[0,0,0,0,0,0,0,0,0.41,0,0,0,0,0,0,0,0,0,0,0,0,0,0.61,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}
{"t":1.066,"facs":[0,0,0,0,0,0,0,0,0.55,0,0,0,0,0,0,0,0,0,0,0,0,0,0.71,0,0.3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}
FieldTypeDescription
tnumberServer timestamp in seconds
jointsnumber[]Optional - flattened joint array (8 floats × jointCount)
facsnumber[]Optional - 52 FACS weights in canonical order

Binary chunked format

For lower CPU overhead, set format: 'binary'. The server writes length-prefixed binary frames into the HTTP response body. Each frame is identical to the WebSocket binary format, prefixed with a 4-byte u32 frame length:

[frameLength: u32][timestamp: f32][jointCount: u16][facsCount: u16][joints...][facs...]

Reference Server - Node.js

server/http-stream-server.js
import http from "http"
 
http.createServer((req, res) => {
    if (req.url !== "/animation-stream") return res.end()
 
    res.writeHead(200, {
        "Content-Type": "application/x-ndjson",
        "Transfer-Encoding": "chunked",
        "Access-Control-Allow-Origin": "*",
    })
 
    let t = 0
    const interval = setInterval(() => {
        const facs = new Array(52).fill(0)
        facs[22] = Math.sin(t) * 0.5 + 0.5 // jawOpen oscillates (demo)
 
        res.write(JSON.stringify({ t, facs }) + "\n")
        t += 1 / 30
    }, 1000 / 30)
 
    req.on("close", () => clearInterval(interval))
}).listen(8080, () => console.log("HTTP stream server on http://localhost:8080"))

Jitter Buffer

Network packets rarely arrive at a perfectly uniform interval. The jitter buffer absorbs bursts and gaps so the engine always has a frame ready at render time.

Server sends:   ──▮──▮▮───▮──▮▮▮──▮──   (irregular)
Jitter buffer:  ──▮──▮──▮──▮──▮──▮──    (uniform 33 ms output)

Configuration

const client = new StreamingClient({
    transport: "websocket",
    url: "wss://...",
    jitterBuffer: 80, // target buffer depth in ms (default: 80)
    maxBuffer: 300, // drop frames older than this ms (default: 300)
    extrapolate: true, // predict pose on buffer underrun (default: true)
})
OptionTypeDefaultDescription
jitterBuffernumber80Target depth in ms. Higher = smoother but more latency.
maxBuffernumber300Discard frames older than this many ms. Prevents unbounded memory growth on stalls.
extrapolatebooleantrueLinearly extrapolate the last known pose from its velocity rather than freezing when the buffer runs dry.

Tuning guide

Network conditionRecommended jitterBuffer
Local / same datacenter40–60 ms
Cross-region (< 100 ms RTT)80 ms (default)
Intercontinental (100–300 ms RTT)120–160 ms
Unreliable mobile network200 ms

Setting jitterBuffer lower than your actual network jitter causes frequent buffer underruns, visible as momentary pose freezes. If you see stuttering, increase the value before looking for other causes.


Facial-Only Streaming (Lip Sync)

For AI TTS lip sync you typically only need FACS weights - no joint data required. Set jointCount = 0 in server frames and the engine skips the skinning upload, saving ~8 KB per frame.

const client = new StreamingClient({
    transport: "websocket",
    url: "wss://tts-server.example.com/lipsync",
    mode: "facs", // skip joint processing entirely
    smoothing: 0.7, // exponential moving average α (default: 0.7)
    jitterBuffer: 60,
})
 
client.attach(human)
await client.connect()

The smoothing parameter applies an exponential moving average to all incoming FACS weights before forwarding them to human.morph:

smoothed[i] = α × previous[i] + (1 − α) × incoming[i]
smoothingEffect
0.0No smoothing - raw values applied instantly
0.5Moderate smoothing, fast response
0.7Default - natural speech feel
0.9Heavy smoothing, very soft transitions

Events

client.on("connected", () => console.log("Stream connected"))
client.on("disconnected", ({ code, reason }) => console.warn("Stream dropped:", reason))
client.on("reconnecting", ({ attempt }) => console.log("Reconnect attempt", attempt))
client.on("frame", (pose) => {
    // pose: { timestamp, joints?: Float32Array, facs?: Float32Array }
    // Fired after jitter buffer output - before GPU upload
})
client.on("bufferUnderrun", () => console.warn("Jitter buffer ran dry"))
client.on("bufferOverflow", () => console.warn("Buffer full - frames dropped"))

Manual Pose Application

If you need to pre-process pose data before it reaches the character (retargeting, blending, filtering), skip client.attach() and handle frames yourself:

// Do NOT call client.attach(human)
 
client.on("frame", ({ joints, facs }) => {
    if (joints) {
        // Retarget from a different skeleton before applying
        const retargeted = myRetargeter.apply(joints)
        human.skeleton.setPose(retargeted)
    }
 
    if (facs) {
        // Blend 50/50 with a local procedural expression
        const blended = facs.map((w, i) => w * 0.5 + localFacs[i] * 0.5)
        human.morph.setFromArray(blended)
    }
})
 
await client.connect()

Complete Example - AI TTS Avatar

A full end-to-end integration connecting an AI speech backend to a live avatar:

ai-avatar.js
import { OpenHuman, StreamingClient } from "@openhuman/sdk"
 
// 1. Load character
const human = await OpenHuman.load("character.ohb", canvas)
human.animation.play("idle")
 
// 2. Create lip sync streaming client
const lipsync = new StreamingClient({
    transport: "websocket",
    url: "wss://tts-backend.example.com/lipsync",
    mode: "facs",
    jitterBuffer: 80,
    smoothing: 0.7,
    reconnect: true,
})
 
lipsync.attach(human)
 
// 3. Wire UI → TTS → stream
const input = document.getElementById("user-input")
const btn = document.getElementById("send-btn")
 
btn.addEventListener("click", async () => {
    const text = input.value.trim()
    if (!text) return
 
    // Transition to talking state
    human.animation.crossFadeTo("talk", 0.3)
    human.morph.applyPreset("neutral")
 
    // POST to TTS backend - it begins streaming FACS over WebSocket
    await fetch("https://tts-backend.example.com/speak", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text }),
    })
})
 
// 4. Return to idle when speech ends
lipsync.on("speechEnd", () => {
    human.animation.crossFadeTo("idle", 0.5)
    human.morph.animateTo({}, { duration: 0.3 })
})
 
// 5. Diagnostics
lipsync.on("bufferUnderrun", () => {
    console.warn("Lip sync buffer underrun - consider increasing jitterBuffer")
})
 
setInterval(() => {
    const s = lipsync.getStats()
    console.log(`Buffer: ${s.bufferDepth}ms | Latency: ${s.estimatedLatency}ms | FPS: ${s.fps}`)
}, 2000)
 
await lipsync.connect()

Troubleshooting

SymptomLikely causeFix
Pose freezes briefly every few secondsJitter buffer underrunsIncrease jitterBuffer to 120–160 ms
Visible lag between audio and lip movementBuffer too deepDecrease jitterBuffer to 40–60 ms
Character snaps between posessmoothing too lowIncrease smoothing to 0.7–0.9
WebSocket drops after ~30 sProxy / load-balancer timeoutSend a ping frame every 20 s or configure keep-alive
High CPU at 60 fpsParsing large JSON frames each tickSwitch from NDJSON to binary transport
bufferOverflow warningsServer sending faster than realtimeThrottle server to target fps
CORS error on HTTP streamMissing response headersAdd Access-Control-Allow-Origin: * to your stream response

API Reference

StreamingClient options

OptionTypeDefaultDescription
transport'websocket' | 'http''websocket'Network transport
urlstring-wss:// or https:// endpoint (required)
format'binary' | 'ndjson''binary' for WS, 'ndjson' for HTTPFrame encoding
mode'full' | 'facs''full''facs' skips joint processing for lip-sync-only streams
jitterBuffernumber80Buffer depth in ms
maxBuffernumber300Max frame age in ms before discard
extrapolatebooleantruePredict pose on buffer underrun
smoothingnumber0.7EMA α for FACS weights (0 = off, 1 = frozen)
reconnectbooleantrueAuto-reconnect on disconnect
reconnectDelaynumber1000Ms between reconnect attempts

StreamingClient methods

MethodSignatureDescription
attach(human: OpenHuman)Bind to a character - poses applied automatically each frame
connect() → Promise<void>Open the connection
disconnect()Close the connection and flush the buffer
pause()Pause consuming from the jitter buffer (character holds last pose)
resume()Resume consuming
on(event, handler)Subscribe to a lifecycle event
getStats() → StreamStatsReturn latency, buffer depth, and frame rate diagnostics

StreamStats object

interface StreamStats {
    connected: boolean
    bufferDepth: number // current jitter buffer depth in ms
    framesReceived: number // total frames received since connect
    framesDropped: number // frames discarded (too old or overflow)
    estimatedLatency: number // ms - server timestamp vs local clock delta
    fps: number // observed incoming frame rate
}

Next Steps