OpenHuman - TTS / Lip Sync Integration Guide


Overview

Lip sync in OpenHuman works by converting TTS audio into a sequence of viseme events, then mapping each viseme to a weighted blend of FACS morph targets applied to the character's face over time.

Two integration paths are supported:

PathLatencyQualityUse Case
Server-side streaming~50msHighestProduction - server sends MORPH frames via WebSocket
Client-side viseme API~80msHighSimple integration - no custom server needed

Path 1: Client-Side Viseme API

The simplest path. You call the TTS API, receive timing data, and feed it directly into the SDK's VisemeScheduler.

Supported TTS Providers (Client-Side)

ProviderTiming DataFormat
Azure Cognitive ServicesWord + viseme eventsViseme IDs (0–21)
ElevenLabsAlignment dataCharacter timestamps
Web Speech APIWord boundaries onlyLimited accuracy
ParrotSpeechPhoneme timestampsIPA phonemes
Resemble AIPhoneme timestampsARPABET

Azure Cognitive Services (Recommended - best timing accuracy)

Azure TTS returns viseme events with millisecond-accurate timing via SpeechSynthesizer.

Setup

npm install microsoft-cognitiveservices-speech-sdk
import * as sdk from "microsoft-cognitiveservices-speech-sdk"
import { OpenHuman } from "@openhuman/sdk"
 
const human = new OpenHuman({ canvas })
await human.loadCharacter("./character.ohb")
 
const speechConfig = sdk.SpeechConfig.fromSubscription("YOUR_AZURE_KEY", "YOUR_AZURE_REGION")
speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural"
 
const synthesizer = new sdk.SpeechSynthesizer(speechConfig)

Viseme Event Integration

// Collect viseme events before audio starts
const visemeQueue: Array<{ audioOffsetMs: number; visemeId: number }> = []
 
synthesizer.visemeReceived = (s, e) => {
    visemeQueue.push({
        audioOffsetMs: e.audioOffset / 10000, // convert 100ns ticks → ms
        visemeId: e.visemeId,
    })
}
 
synthesizer.speakTextAsync("Hello, how can I help you today?", (result) => {
    // Audio is ready - schedule visemes then play
    const audioData = result.audioData
    playAudioWithLipSync(audioData, visemeQueue)
})

Playing Audio + Scheduling Visemes

async function playAudioWithLipSync(audioData: ArrayBuffer, visemes: Array<{ audioOffsetMs: number; visemeId: number }>) {
    const ctx = new AudioContext()
    const buffer = await ctx.decodeAudioData(audioData)
    const source = ctx.createBufferSource()
    source.buffer = buffer
    source.connect(ctx.destination)
 
    const startTime = ctx.currentTime
 
    // Schedule each viseme as the audio plays
    visemes.forEach(({ audioOffsetMs, visemeId }) => {
        const triggerAt = startTime + audioOffsetMs / 1000
 
        // Schedule via SDK - handles blend-in/out automatically
        human.scheduleViseme({
            visemeId, // Azure viseme ID 0–21
            triggerAt: triggerAt, // AudioContext time
            audioContext: ctx,
        })
    })
 
    source.start(startTime)
}

Azure Viseme ID → OpenHuman Mapping

Azure uses 22 viseme IDs (0–21). OpenHuman maps these to FACS morph weights:

Azure IDDescriptionPrimary MorphsWeight
0Silence(all at rest)-
1æ, ə, ʌjawOpen=0.6, mouthSmileLeft=0.1, mouthSmileRight=0.1-
2ɑjawOpen=0.8, mouthFunnel=0.1-
3ɔjawOpen=0.5, mouthFunnel=0.5-
4ɛ, ʊjawOpen=0.4, mouthSmileLeft=0.2, mouthSmileRight=0.2-
5ɝjawOpen=0.3, mouthFunnel=0.3-
6j, i, ɪjawOpen=0.2, mouthSmileLeft=0.5, mouthSmileRight=0.5-
7w, ujawOpen=0.2, mouthPucker=0.9-
8ojawOpen=0.4, mouthFunnel=0.7-
9jawOpen=0.7, mouthFunnel=0.2-
10ɔɪjawOpen=0.5, mouthFunnel=0.4-
11jawOpen=0.7, mouthSmileLeft=0.2, mouthSmileRight=0.2-
12hjawOpen=0.3-
13ɹmouthFunnel=0.4, mouthPucker=0.2-
14ljawOpen=0.25, tongueOut=0.1-
15s, zjawOpen=0.15, mouthShrugUpper=0.2-
16ʃ, tʃ, dʒ, ʒjawOpen=0.3, mouthFunnel=0.6-
17ðtongueOut=0.5, jawOpen=0.2-
18f, vmouthLowerDownLeft=0.4, mouthLowerDownRight=0.4-
19d, t, njawOpen=0.2, tongueOut=0.2-
20k, g, ŋjawOpen=0.4-
21p, b, mmouthClose=1.0, mouthPressLeft=0.3, mouthPressRight=0.3-

ElevenLabs Integration

ElevenLabs provides character-level timestamp alignment via the /v1/text-to-speech/{voice_id}/with-timestamps endpoint.

async function elevenLabsWithLipSync(text: string, voiceId: string) {
    const response = await fetch(`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/with-timestamps`, {
        method: "POST",
        headers: {
            "xi-api-key": "YOUR_ELEVENLABS_KEY",
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            text,
            model_id: "eleven_turbo_v2",
            voice_settings: { stability: 0.5, similarity_boost: 0.75 },
        }),
    })
 
    const data = await response.json()
 
    // data.alignment: { characters[], character_start_times_seconds[], character_end_times_seconds[] }
    // data.audio_base64: base64-encoded MP3
 
    const phonemes = convertAlignmentToPhonemes(data.alignment)
    const audioBuffer = base64ToArrayBuffer(data.audio_base64)
 
    await playAudioWithPhonemes(audioBuffer, phonemes)
}
 
function convertAlignmentToPhonemes(alignment) {
    // ElevenLabs gives character-level timing - map to approximate visemes
    return alignment.characters.map((char, i) => ({
        char,
        startMs: alignment.character_start_times_seconds[i] * 1000,
        endMs: alignment.character_end_times_seconds[i] * 1000,
        viseme: charToViseme(char), // simple character → viseme mapping
    }))
}

Character → Viseme Mapping (ElevenLabs)

const CHAR_TO_VISEME: Record<string, string> = {
    a: "AA",
    e: "EE",
    i: "II",
    o: "OO",
    u: "UU",
    p: "PP",
    b: "PP",
    m: "PP",
    f: "FF",
    v: "FF",
    s: "SS",
    z: "SS",
    r: "RR",
    l: "NN",
    n: "NN",
    k: "KK",
    g: "KK",
    t: "DD",
    d: "DD",
    " ": "sil",
    ".": "sil",
    ",": "sil",
}

ParrotSpeech Integration

ParrotSpeech returns IPA phoneme timestamps natively - the highest-accuracy option for OpenHuman lip sync.

import { ParrotSpeechClient } from "@parrotspeech/client"
 
const parrot = new ParrotSpeechClient({ apiKey: "YOUR_KEY" })
 
const result = await parrot.synthesize({
    text: "Hello, how can I help you today?",
    voice: "nova_en_f",
    format: "wav",
    phonemes: true, // request IPA phoneme timestamps
})
 
// result.phonemes: [{ ipa: 'h', startMs: 0, endMs: 60 }, ...]
// result.audioBuffer: ArrayBuffer
 
result.phonemes.forEach(({ ipa, startMs, endMs }) => {
    human.scheduleVisemeByIPA({
        ipa,
        startMs,
        endMs,
        audioContext: ctx,
        audioStartTime: ctx.currentTime,
    })
})
 
playAudio(ctx, result.audioBuffer)

VisemeScheduler API

The VisemeScheduler is the core of the client-side lip sync system. It receives viseme events (from any TTS provider), enqueues them, and drives morph weight animations frame-by-frame.

scheduleViseme()

human.scheduleViseme({
  visemeId: number,              // Azure viseme ID (0–21)
  triggerAt: number,             // AudioContext.currentTime (seconds)
  audioContext: AudioContext,
  blendInMs?: number,            // default: 60ms
  blendOutMs?: number,           // default: 80ms
  weight?: number,               // peak weight (default: 1.0)
})

scheduleVisemeByName()

human.scheduleVisemeByName({
  viseme: 'AA' | 'EE' | 'II' | 'OO' | 'UU' | 'PP' | 'FF' |
          'TH' | 'DD' | 'KK' | 'CH' | 'SS' | 'NN' | 'RR' | 'sil',
  triggerAt: number,
  audioContext: AudioContext,
  blendInMs?: number,
  blendOutMs?: number,
  weight?: number,
})

scheduleVisemeByIPA()

human.scheduleVisemeByIPA({
    ipa: string, // IPA phoneme string
    startMs: number, // ms from audio start
    endMs: number, // ms from audio start
    audioContext: AudioContext,
    audioStartTime: number, // AudioContext time when audio started
})

clearVisemeQueue()

// Call when interrupting speech (e.g. user interrupts the character)
human.clearVisemeQueue()
human.resetMorphWeights()

Blend Timing

Each viseme schedules a smooth weight curve: blend in → hold peak → blend out.

When two visemes overlap in time, their weights are additively blended and the result is normalized. This produces smooth coarticulation - the mouth shape for "pr" is a blend of PP and RR, not a hard cut.

Tuning Blend Times

human.setVisemeDefaults({
    blendInMs: 50, // faster = more articulate, lower = smoother
    blendOutMs: 70,
    minWeight: 0.0,
    maxWeight: 1.0,
    coarticulationWindow: 2, // how many visemes ahead to pre-blend
})

Path 2: Server-Side Streaming (Production)

For production deployments, phoneme extraction and morph weight computation run on the server, and the results are streamed to the client via the OHP WebSocket protocol as MORPH or POSE_MORPH frames.

Server-Side Morph Frame Generation

// Node.js server - using @openhuman/server-sdk
import { OHPSession, MorphFrame } from "@openhuman/server-sdk"
 
async function streamLipSync(session, characterId, text) {
    // 1. Get TTS + phoneme timing
    const tts = await parrot.synthesize({ text, phonemes: true })
 
    // 2. Build morph weight timeline (60fps)
    const timeline = buildMorphTimeline(tts.phonemes, 60)
 
    // 3. Play audio on client
    session.sendAudio(characterId, tts.audioBuffer)
 
    // 4. Stream morph frames in sync
    const startTime = Date.now()
    for (const frame of timeline) {
        const elapsed = Date.now() - startTime
        const delay = frame.timeMs - elapsed
        if (delay > 0) await sleep(delay)
        session.sendMorphFrame(characterId, frame.weights)
    }
}
 
function buildMorphTimeline(phonemes, fps) {
    const frameDuration = 1000 / fps
    const frames = []
 
    phonemes.forEach(({ ipa, startMs, endMs }) => {
        const visemeMorphs = IPA_TO_FACS[ipa] || {}
 
        // Blend in/out across overlapping frames
        for (let t = startMs - 60; t <= endMs + 80; t += frameDuration) {
            const weight = computeVisemeWeight(t, startMs, endMs, 60, 80)
            frames.push({ timeMs: t, weights: scaleWeights(visemeMorphs, weight) })
        }
    })
 
    return mergeAndSortFrames(frames)
}

When to Use Server-Side vs Client-Side

FactorClient-SideServer-Side
Setup complexityLowMedium
Latency~80ms~50ms
TTS API exposureKey in clientKey stays on server
Custom phoneme pipelineLimitedFull control
Works offlineNoNo
Recommended forPrototypes, demosProduction

Emotion + Speech Blending

Lip sync morphs and emotional expression morphs blend additively. You can drive both simultaneously:

// Set a happy expression while talking
human.setMorphWeights({
  mouthSmileLeft:  0.6,
  mouthSmileRight: 0.6,
  cheekSquintLeft: 0.3,
  cheekSquintRight: 0.3,
  browInnerUp: 0.2,
})
 
// Lip sync runs on top - jawOpen, mouthFunnel etc. add to the above
human.scheduleViseme({ viseme: 'AA', triggerAt: ..., audioContext: ctx })

The MorphController normalizes the sum of all weights per-morph to prevent values exceeding 1.0:

human.setMorphBlendMode("additive_clamped") // default - clamp sum to 1.0
human.setMorphBlendMode("additive") // allow overflow (for stylized looks)

Procedural Head Motion

During speech, subtle head motion improves realism. Enable the built-in procedural head motion system:

human.setHeadMotion({
    enabled: true,
    intensity: 0.4, // 0.0–1.0 - scale of motion
    noddingRate: 0.3, // nods per second during speech
    swayAmplitude: 0.02, // meters - lateral sway
    tiltAmplitude: 1.5, // degrees - head tilt range
})

Head motion is driven via IK on the neck + head joint chain and is synchronized with the speech audio envelope (louder speech = slightly more motion).


Lip Sync Events

human.on("lipsync:start", () => {}) // speech begins
human.on("lipsync:viseme", (viseme, weight) => {}) // each viseme fires
human.on("lipsync:end", () => {}) // speech ends
human.on("lipsync:silence", () => {}) // mid-speech silence

Use lipsync:end to trigger the character returning to idle:

human.on("lipsync:end", () => {
    human.setParam("isTalking", false)
    human.animateMorphWeights({}, { duration: 300 }) // smooth reset to neutral
})

Full Example: Azure TTS End-to-End

import * as sdk from "microsoft-cognitiveservices-speech-sdk"
import { OpenHuman } from "@openhuman/sdk"
 
async function setup() {
    // 1. Init character
    const human = new OpenHuman({ canvas: document.getElementById("canvas") })
    await human.loadCharacter("./character.ohb")
    human.setParam("isTalking", false)
    human.setBlink({ enabled: true })
 
    // 2. Azure TTS setup
    const speechConfig = sdk.SpeechConfig.fromSubscription(KEY, REGION)
    speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural"
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null)
 
    // 3. Speak function
    async function speak(text) {
        return new Promise((resolve, reject) => {
            const visemes = []
            const audioCtx = new AudioContext()
 
            synthesizer.visemeReceived = (s, e) => {
                visemes.push({
                    audioOffsetMs: e.audioOffset / 10000,
                    visemeId: e.visemeId,
                })
            }
 
            synthesizer.speakTextAsync(
                text,
                async (result) => {
                    human.setParam("isTalking", true)
 
                    const buffer = await audioCtx.decodeAudioData(result.audioData)
                    const source = audioCtx.createBufferSource()
                    source.buffer = buffer
                    source.connect(audioCtx.destination)
 
                    const startTime = audioCtx.currentTime + 0.1 // 100ms lead time
                    visemes.forEach(({ audioOffsetMs, visemeId }) => {
                        human.scheduleViseme({
                            visemeId,
                            triggerAt: startTime + audioOffsetMs / 1000,
                            audioContext: audioCtx,
                        })
                    })
 
                    source.start(startTime)
                    source.onended = () => {
                        human.setParam("isTalking", false)
                        resolve()
                    }
                },
                reject
            )
        })
    }
 
    return { human, speak }
}
 
// Usage
const { human, speak } = await setup()
await speak("Hello! I am Sarah, your AI assistant.")
await speak("How can I help you today?")

Troubleshooting

IssueLikely CauseFix
Mouth moves but out of syncAudioContext not started before viseme schedulingPass audioContext to scheduleViseme - it reads currentTime
No mouth movementViseme queue not connected to audio timingCheck triggerAt values match audio playback time
Mouth snaps between shapesblendInMs / blendOutMs too shortIncrease to 60–80ms
Mouth barely movesweight too low or morph weights overridden by streamCheck setStreamMergeMode - use 'masked' to allow local morphs
Plosives (p, b, m) not visibleAzure viseme 21 not in queueVerify visemeReceived event fires - some voices suppress plosive events
Speech sounds fine, face frozenisTalking param not setCall human.setParam('isTalking', true) before speaking

Next Steps