OpenHuman - TTS / Lip Sync Integration Guide

Overview

Lip sync in OpenHuman works by converting TTS audio into a sequence of viseme events, then mapping each viseme to a weighted blend of FACS morph targets applied to the character's face over time.

Two integration paths are supported:

Path	Latency	Quality	Use Case
Server-side streaming	~50ms	Highest	Production - server sends MORPH frames via WebSocket
Client-side viseme API	~80ms	High	Simple integration - no custom server needed

Path 1: Client-Side Viseme API

The simplest path. You call the TTS API, receive timing data, and feed it directly into the SDK's VisemeScheduler.

Supported TTS Providers (Client-Side)

Provider	Timing Data	Format
Azure Cognitive Services	Word + viseme events	Viseme IDs (0–21)
ElevenLabs	Alignment data	Character timestamps
Web Speech API	Word boundaries only	Limited accuracy
ParrotSpeech	Phoneme timestamps	IPA phonemes
Resemble AI	Phoneme timestamps	ARPABET

Azure Cognitive Services (Recommended - best timing accuracy)

Azure TTS returns viseme events with millisecond-accurate timing via SpeechSynthesizer.

Setup

npm install microsoft-cognitiveservices-speech-sdk

import * as sdk from "microsoft-cognitiveservices-speech-sdk"
import { OpenHuman } from "@openhuman/sdk"
 
const human = new OpenHuman({ canvas })
await human.loadCharacter("./character.ohb")
 
const speechConfig = sdk.SpeechConfig.fromSubscription("YOUR_AZURE_KEY", "YOUR_AZURE_REGION")
speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural"
 
const synthesizer = new sdk.SpeechSynthesizer(speechConfig)

Viseme Event Integration

// Collect viseme events before audio starts
const visemeQueue: Array<{ audioOffsetMs: number; visemeId: number }> = []
 
synthesizer.visemeReceived = (s, e) => {
    visemeQueue.push({
        audioOffsetMs: e.audioOffset / 10000, // convert 100ns ticks → ms
        visemeId: e.visemeId,
    })
}
 
synthesizer.speakTextAsync("Hello, how can I help you today?", (result) => {
    // Audio is ready - schedule visemes then play
    const audioData = result.audioData
    playAudioWithLipSync(audioData, visemeQueue)
})

Playing Audio + Scheduling Visemes

async function playAudioWithLipSync(audioData: ArrayBuffer, visemes: Array<{ audioOffsetMs: number; visemeId: number }>) {
    const ctx = new AudioContext()
    const buffer = await ctx.decodeAudioData(audioData)
    const source = ctx.createBufferSource()
    source.buffer = buffer
    source.connect(ctx.destination)
 
    const startTime = ctx.currentTime
 
    // Schedule each viseme as the audio plays
    visemes.forEach(({ audioOffsetMs, visemeId }) => {
        const triggerAt = startTime + audioOffsetMs / 1000
 
        // Schedule via SDK - handles blend-in/out automatically
        human.scheduleViseme({
            visemeId, // Azure viseme ID 0–21
            triggerAt: triggerAt, // AudioContext time
            audioContext: ctx,
        })
    })
 
    source.start(startTime)
}

Azure Viseme ID → OpenHuman Mapping

Azure uses 22 viseme IDs (0–21). OpenHuman maps these to FACS morph weights:

Azure ID	Description	Primary Morphs	Weight
0	Silence	(all at rest)	-
1	æ, ə, ʌ	`jawOpen=0.6`, `mouthSmileLeft=0.1`, `mouthSmileRight=0.1`	-
2	ɑ	`jawOpen=0.8`, `mouthFunnel=0.1`	-
3	ɔ	`jawOpen=0.5`, `mouthFunnel=0.5`	-
4	ɛ, ʊ	`jawOpen=0.4`, `mouthSmileLeft=0.2`, `mouthSmileRight=0.2`	-
5	ɝ	`jawOpen=0.3`, `mouthFunnel=0.3`	-
6	j, i, ɪ	`jawOpen=0.2`, `mouthSmileLeft=0.5`, `mouthSmileRight=0.5`	-
7	w, u	`jawOpen=0.2`, `mouthPucker=0.9`	-
8	o	`jawOpen=0.4`, `mouthFunnel=0.7`	-
9	aʊ	`jawOpen=0.7`, `mouthFunnel=0.2`	-
10	ɔɪ	`jawOpen=0.5`, `mouthFunnel=0.4`	-
11	aɪ	`jawOpen=0.7`, `mouthSmileLeft=0.2`, `mouthSmileRight=0.2`	-
12	h	`jawOpen=0.3`	-
13	ɹ	`mouthFunnel=0.4`, `mouthPucker=0.2`	-
14	l	`jawOpen=0.25`, `tongueOut=0.1`	-
15	s, z	`jawOpen=0.15`, `mouthShrugUpper=0.2`	-
16	ʃ, tʃ, dʒ, ʒ	`jawOpen=0.3`, `mouthFunnel=0.6`	-
17	ð	`tongueOut=0.5`, `jawOpen=0.2`	-
18	f, v	`mouthLowerDownLeft=0.4`, `mouthLowerDownRight=0.4`	-
19	d, t, n	`jawOpen=0.2`, `tongueOut=0.2`	-
20	k, g, ŋ	`jawOpen=0.4`	-
21	p, b, m	`mouthClose=1.0`, `mouthPressLeft=0.3`, `mouthPressRight=0.3`	-

ElevenLabs Integration

ElevenLabs provides character-level timestamp alignment via the /v1/text-to-speech/{voice_id}/with-timestamps endpoint.

async function elevenLabsWithLipSync(text: string, voiceId: string) {
    const response = await fetch(`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/with-timestamps`, {
        method: "POST",
        headers: {
            "xi-api-key": "YOUR_ELEVENLABS_KEY",
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            text,
            model_id: "eleven_turbo_v2",
            voice_settings: { stability: 0.5, similarity_boost: 0.75 },
        }),
    })
 
    const data = await response.json()
 
    // data.alignment: { characters[], character_start_times_seconds[], character_end_times_seconds[] }
    // data.audio_base64: base64-encoded MP3
 
    const phonemes = convertAlignmentToPhonemes(data.alignment)
    const audioBuffer = base64ToArrayBuffer(data.audio_base64)
 
    await playAudioWithPhonemes(audioBuffer, phonemes)
}
 
function convertAlignmentToPhonemes(alignment) {
    // ElevenLabs gives character-level timing - map to approximate visemes
    return alignment.characters.map((char, i) => ({
        char,
        startMs: alignment.character_start_times_seconds[i] * 1000,
        endMs: alignment.character_end_times_seconds[i] * 1000,
        viseme: charToViseme(char), // simple character → viseme mapping
    }))
}

Character → Viseme Mapping (ElevenLabs)

const CHAR_TO_VISEME: Record<string, string> = {
    a: "AA",
    e: "EE",
    i: "II",
    o: "OO",
    u: "UU",
    p: "PP",
    b: "PP",
    m: "PP",
    f: "FF",
    v: "FF",
    s: "SS",
    z: "SS",
    r: "RR",
    l: "NN",
    n: "NN",
    k: "KK",
    g: "KK",
    t: "DD",
    d: "DD",
    " ": "sil",
    ".": "sil",
    ",": "sil",
}

ParrotSpeech Integration

ParrotSpeech returns IPA phoneme timestamps natively - the highest-accuracy option for OpenHuman lip sync.

import { ParrotSpeechClient } from "@parrotspeech/client"
 
const parrot = new ParrotSpeechClient({ apiKey: "YOUR_KEY" })
 
const result = await parrot.synthesize({
    text: "Hello, how can I help you today?",
    voice: "nova_en_f",
    format: "wav",
    phonemes: true, // request IPA phoneme timestamps
})
 
// result.phonemes: [{ ipa: 'h', startMs: 0, endMs: 60 }, ...]
// result.audioBuffer: ArrayBuffer
 
result.phonemes.forEach(({ ipa, startMs, endMs }) => {
    human.scheduleVisemeByIPA({
        ipa,
        startMs,
        endMs,
        audioContext: ctx,
        audioStartTime: ctx.currentTime,
    })
})
 
playAudio(ctx, result.audioBuffer)

VisemeScheduler API

The VisemeScheduler is the core of the client-side lip sync system. It receives viseme events (from any TTS provider), enqueues them, and drives morph weight animations frame-by-frame.

scheduleViseme()

human.scheduleViseme({
  visemeId: number,              // Azure viseme ID (0–21)
  triggerAt: number,             // AudioContext.currentTime (seconds)
  audioContext: AudioContext,
  blendInMs?: number,            // default: 60ms
  blendOutMs?: number,           // default: 80ms
  weight?: number,               // peak weight (default: 1.0)
})

scheduleVisemeByName()

human.scheduleVisemeByName({
  viseme: 'AA' | 'EE' | 'II' | 'OO' | 'UU' | 'PP' | 'FF' |
          'TH' | 'DD' | 'KK' | 'CH' | 'SS' | 'NN' | 'RR' | 'sil',
  triggerAt: number,
  audioContext: AudioContext,
  blendInMs?: number,
  blendOutMs?: number,
  weight?: number,
})

scheduleVisemeByIPA()

human.scheduleVisemeByIPA({
    ipa: string, // IPA phoneme string
    startMs: number, // ms from audio start
    endMs: number, // ms from audio start
    audioContext: AudioContext,
    audioStartTime: number, // AudioContext time when audio started
})

clearVisemeQueue()

// Call when interrupting speech (e.g. user interrupts the character)
human.clearVisemeQueue()
human.resetMorphWeights()

Blend Timing

Each viseme schedules a smooth weight curve: blend in → hold peak → blend out.

When two visemes overlap in time, their weights are additively blended and the result is normalized. This produces smooth coarticulation - the mouth shape for "pr" is a blend of PP and RR, not a hard cut.

Tuning Blend Times

human.setVisemeDefaults({
    blendInMs: 50, // faster = more articulate, lower = smoother
    blendOutMs: 70,
    minWeight: 0.0,
    maxWeight: 1.0,
    coarticulationWindow: 2, // how many visemes ahead to pre-blend
})

Path 2: Server-Side Streaming (Production)

For production deployments, phoneme extraction and morph weight computation run on the server, and the results are streamed to the client via the OHP WebSocket protocol as MORPH or POSE_MORPH frames.

Server-Side Morph Frame Generation

// Node.js server - using @openhuman/server-sdk
import { OHPSession, MorphFrame } from "@openhuman/server-sdk"
 
async function streamLipSync(session, characterId, text) {
    // 1. Get TTS + phoneme timing
    const tts = await parrot.synthesize({ text, phonemes: true })
 
    // 2. Build morph weight timeline (60fps)
    const timeline = buildMorphTimeline(tts.phonemes, 60)
 
    // 3. Play audio on client
    session.sendAudio(characterId, tts.audioBuffer)
 
    // 4. Stream morph frames in sync
    const startTime = Date.now()
    for (const frame of timeline) {
        const elapsed = Date.now() - startTime
        const delay = frame.timeMs - elapsed
        if (delay > 0) await sleep(delay)
        session.sendMorphFrame(characterId, frame.weights)
    }
}
 
function buildMorphTimeline(phonemes, fps) {
    const frameDuration = 1000 / fps
    const frames = []
 
    phonemes.forEach(({ ipa, startMs, endMs }) => {
        const visemeMorphs = IPA_TO_FACS[ipa] || {}
 
        // Blend in/out across overlapping frames
        for (let t = startMs - 60; t <= endMs + 80; t += frameDuration) {
            const weight = computeVisemeWeight(t, startMs, endMs, 60, 80)
            frames.push({ timeMs: t, weights: scaleWeights(visemeMorphs, weight) })
        }
    })
 
    return mergeAndSortFrames(frames)
}

When to Use Server-Side vs Client-Side

Factor	Client-Side	Server-Side
Setup complexity	Low	Medium
Latency	~80ms	~50ms
TTS API exposure	Key in client	Key stays on server
Custom phoneme pipeline	Limited	Full control
Works offline	No	No
Recommended for	Prototypes, demos	Production

Emotion + Speech Blending

Lip sync morphs and emotional expression morphs blend additively. You can drive both simultaneously:

// Set a happy expression while talking
human.setMorphWeights({
  mouthSmileLeft:  0.6,
  mouthSmileRight: 0.6,
  cheekSquintLeft: 0.3,
  cheekSquintRight: 0.3,
  browInnerUp: 0.2,
})
 
// Lip sync runs on top - jawOpen, mouthFunnel etc. add to the above
human.scheduleViseme({ viseme: 'AA', triggerAt: ..., audioContext: ctx })

The MorphController normalizes the sum of all weights per-morph to prevent values exceeding 1.0:

human.setMorphBlendMode("additive_clamped") // default - clamp sum to 1.0
human.setMorphBlendMode("additive") // allow overflow (for stylized looks)

Procedural Head Motion

During speech, subtle head motion improves realism. Enable the built-in procedural head motion system:

human.setHeadMotion({
    enabled: true,
    intensity: 0.4, // 0.0–1.0 - scale of motion
    noddingRate: 0.3, // nods per second during speech
    swayAmplitude: 0.02, // meters - lateral sway
    tiltAmplitude: 1.5, // degrees - head tilt range
})

Head motion is driven via IK on the neck + head joint chain and is synchronized with the speech audio envelope (louder speech = slightly more motion).

Lip Sync Events

human.on("lipsync:start", () => {}) // speech begins
human.on("lipsync:viseme", (viseme, weight) => {}) // each viseme fires
human.on("lipsync:end", () => {}) // speech ends
human.on("lipsync:silence", () => {}) // mid-speech silence

Use lipsync:end to trigger the character returning to idle:

human.on("lipsync:end", () => {
    human.setParam("isTalking", false)
    human.animateMorphWeights({}, { duration: 300 }) // smooth reset to neutral
})

Full Example: Azure TTS End-to-End

import * as sdk from "microsoft-cognitiveservices-speech-sdk"
import { OpenHuman } from "@openhuman/sdk"
 
async function setup() {
    // 1. Init character
    const human = new OpenHuman({ canvas: document.getElementById("canvas") })
    await human.loadCharacter("./character.ohb")
    human.setParam("isTalking", false)
    human.setBlink({ enabled: true })
 
    // 2. Azure TTS setup
    const speechConfig = sdk.SpeechConfig.fromSubscription(KEY, REGION)
    speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural"
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null)
 
    // 3. Speak function
    async function speak(text) {
        return new Promise((resolve, reject) => {
            const visemes = []
            const audioCtx = new AudioContext()
 
            synthesizer.visemeReceived = (s, e) => {
                visemes.push({
                    audioOffsetMs: e.audioOffset / 10000,
                    visemeId: e.visemeId,
                })
            }
 
            synthesizer.speakTextAsync(
                text,
                async (result) => {
                    human.setParam("isTalking", true)
 
                    const buffer = await audioCtx.decodeAudioData(result.audioData)
                    const source = audioCtx.createBufferSource()
                    source.buffer = buffer
                    source.connect(audioCtx.destination)
 
                    const startTime = audioCtx.currentTime + 0.1 // 100ms lead time
                    visemes.forEach(({ audioOffsetMs, visemeId }) => {
                        human.scheduleViseme({
                            visemeId,
                            triggerAt: startTime + audioOffsetMs / 1000,
                            audioContext: audioCtx,
                        })
                    })
 
                    source.start(startTime)
                    source.onended = () => {
                        human.setParam("isTalking", false)
                        resolve()
                    }
                },
                reject
            )
        })
    }
 
    return { human, speak }
}
 
// Usage
const { human, speak } = await setup()
await speak("Hello! I am Sarah, your AI assistant.")
await speak("How can I help you today?")

Troubleshooting

Issue	Likely Cause	Fix
Mouth moves but out of sync	AudioContext not started before viseme scheduling	Pass `audioContext` to `scheduleViseme` - it reads `currentTime`
No mouth movement	Viseme queue not connected to audio timing	Check `triggerAt` values match audio playback time
Mouth snaps between shapes	`blendInMs` / `blendOutMs` too short	Increase to 60–80ms
Mouth barely moves	`weight` too low or morph weights overridden by stream	Check `setStreamMergeMode` - use `'masked'` to allow local morphs
Plosives (p, b, m) not visible	Azure viseme 21 not in queue	Verify `visemeReceived` event fires - some voices suppress plosive events
Speech sounds fine, face frozen	`isTalking` param not set	Call `human.setParam('isTalking', true)` before speaking

Next Steps

Animation Graph Reference - isTalking param, talking state
Streaming Protocol Spec - server-side MORPH frames
FACS Morph Target Index - all 52 morph names
Error Code Reference - SDK error codes for audio / lip sync failures