OpenHuman - TTS / Lip Sync Integration Guide
Overview
Lip sync in OpenHuman works by converting TTS audio into a sequence of viseme events, then mapping each viseme to a weighted blend of FACS morph targets applied to the character's face over time.
Two integration paths are supported:
| Path | Latency | Quality | Use Case |
|---|---|---|---|
| Server-side streaming | ~50ms | Highest | Production - server sends MORPH frames via WebSocket |
| Client-side viseme API | ~80ms | High | Simple integration - no custom server needed |
Path 1: Client-Side Viseme API
The simplest path. You call the TTS API, receive timing data, and feed it directly into the SDK's VisemeScheduler.
Supported TTS Providers (Client-Side)
| Provider | Timing Data | Format |
|---|---|---|
| Azure Cognitive Services | Word + viseme events | Viseme IDs (0–21) |
| ElevenLabs | Alignment data | Character timestamps |
| Web Speech API | Word boundaries only | Limited accuracy |
| ParrotSpeech | Phoneme timestamps | IPA phonemes |
| Resemble AI | Phoneme timestamps | ARPABET |
Azure Cognitive Services (Recommended - best timing accuracy)
Azure TTS returns viseme events with millisecond-accurate timing via SpeechSynthesizer.
Setup
npm install microsoft-cognitiveservices-speech-sdkimport * as sdk from "microsoft-cognitiveservices-speech-sdk"
import { OpenHuman } from "@openhuman/sdk"
const human = new OpenHuman({ canvas })
await human.loadCharacter("./character.ohb")
const speechConfig = sdk.SpeechConfig.fromSubscription("YOUR_AZURE_KEY", "YOUR_AZURE_REGION")
speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural"
const synthesizer = new sdk.SpeechSynthesizer(speechConfig)Viseme Event Integration
// Collect viseme events before audio starts
const visemeQueue: Array<{ audioOffsetMs: number; visemeId: number }> = []
synthesizer.visemeReceived = (s, e) => {
visemeQueue.push({
audioOffsetMs: e.audioOffset / 10000, // convert 100ns ticks → ms
visemeId: e.visemeId,
})
}
synthesizer.speakTextAsync("Hello, how can I help you today?", (result) => {
// Audio is ready - schedule visemes then play
const audioData = result.audioData
playAudioWithLipSync(audioData, visemeQueue)
})Playing Audio + Scheduling Visemes
async function playAudioWithLipSync(audioData: ArrayBuffer, visemes: Array<{ audioOffsetMs: number; visemeId: number }>) {
const ctx = new AudioContext()
const buffer = await ctx.decodeAudioData(audioData)
const source = ctx.createBufferSource()
source.buffer = buffer
source.connect(ctx.destination)
const startTime = ctx.currentTime
// Schedule each viseme as the audio plays
visemes.forEach(({ audioOffsetMs, visemeId }) => {
const triggerAt = startTime + audioOffsetMs / 1000
// Schedule via SDK - handles blend-in/out automatically
human.scheduleViseme({
visemeId, // Azure viseme ID 0–21
triggerAt: triggerAt, // AudioContext time
audioContext: ctx,
})
})
source.start(startTime)
}Azure Viseme ID → OpenHuman Mapping
Azure uses 22 viseme IDs (0–21). OpenHuman maps these to FACS morph weights:
| Azure ID | Description | Primary Morphs | Weight |
|---|---|---|---|
| 0 | Silence | (all at rest) | - |
| 1 | æ, ə, ʌ | jawOpen=0.6, mouthSmileLeft=0.1, mouthSmileRight=0.1 | - |
| 2 | ɑ | jawOpen=0.8, mouthFunnel=0.1 | - |
| 3 | ɔ | jawOpen=0.5, mouthFunnel=0.5 | - |
| 4 | ɛ, ʊ | jawOpen=0.4, mouthSmileLeft=0.2, mouthSmileRight=0.2 | - |
| 5 | ɝ | jawOpen=0.3, mouthFunnel=0.3 | - |
| 6 | j, i, ɪ | jawOpen=0.2, mouthSmileLeft=0.5, mouthSmileRight=0.5 | - |
| 7 | w, u | jawOpen=0.2, mouthPucker=0.9 | - |
| 8 | o | jawOpen=0.4, mouthFunnel=0.7 | - |
| 9 | aʊ | jawOpen=0.7, mouthFunnel=0.2 | - |
| 10 | ɔɪ | jawOpen=0.5, mouthFunnel=0.4 | - |
| 11 | aɪ | jawOpen=0.7, mouthSmileLeft=0.2, mouthSmileRight=0.2 | - |
| 12 | h | jawOpen=0.3 | - |
| 13 | ɹ | mouthFunnel=0.4, mouthPucker=0.2 | - |
| 14 | l | jawOpen=0.25, tongueOut=0.1 | - |
| 15 | s, z | jawOpen=0.15, mouthShrugUpper=0.2 | - |
| 16 | ʃ, tʃ, dʒ, ʒ | jawOpen=0.3, mouthFunnel=0.6 | - |
| 17 | ð | tongueOut=0.5, jawOpen=0.2 | - |
| 18 | f, v | mouthLowerDownLeft=0.4, mouthLowerDownRight=0.4 | - |
| 19 | d, t, n | jawOpen=0.2, tongueOut=0.2 | - |
| 20 | k, g, ŋ | jawOpen=0.4 | - |
| 21 | p, b, m | mouthClose=1.0, mouthPressLeft=0.3, mouthPressRight=0.3 | - |
ElevenLabs Integration
ElevenLabs provides character-level timestamp alignment via the /v1/text-to-speech/{voice_id}/with-timestamps endpoint.
async function elevenLabsWithLipSync(text: string, voiceId: string) {
const response = await fetch(`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/with-timestamps`, {
method: "POST",
headers: {
"xi-api-key": "YOUR_ELEVENLABS_KEY",
"Content-Type": "application/json",
},
body: JSON.stringify({
text,
model_id: "eleven_turbo_v2",
voice_settings: { stability: 0.5, similarity_boost: 0.75 },
}),
})
const data = await response.json()
// data.alignment: { characters[], character_start_times_seconds[], character_end_times_seconds[] }
// data.audio_base64: base64-encoded MP3
const phonemes = convertAlignmentToPhonemes(data.alignment)
const audioBuffer = base64ToArrayBuffer(data.audio_base64)
await playAudioWithPhonemes(audioBuffer, phonemes)
}
function convertAlignmentToPhonemes(alignment) {
// ElevenLabs gives character-level timing - map to approximate visemes
return alignment.characters.map((char, i) => ({
char,
startMs: alignment.character_start_times_seconds[i] * 1000,
endMs: alignment.character_end_times_seconds[i] * 1000,
viseme: charToViseme(char), // simple character → viseme mapping
}))
}Character → Viseme Mapping (ElevenLabs)
const CHAR_TO_VISEME: Record<string, string> = {
a: "AA",
e: "EE",
i: "II",
o: "OO",
u: "UU",
p: "PP",
b: "PP",
m: "PP",
f: "FF",
v: "FF",
s: "SS",
z: "SS",
r: "RR",
l: "NN",
n: "NN",
k: "KK",
g: "KK",
t: "DD",
d: "DD",
" ": "sil",
".": "sil",
",": "sil",
}ParrotSpeech Integration
ParrotSpeech returns IPA phoneme timestamps natively - the highest-accuracy option for OpenHuman lip sync.
import { ParrotSpeechClient } from "@parrotspeech/client"
const parrot = new ParrotSpeechClient({ apiKey: "YOUR_KEY" })
const result = await parrot.synthesize({
text: "Hello, how can I help you today?",
voice: "nova_en_f",
format: "wav",
phonemes: true, // request IPA phoneme timestamps
})
// result.phonemes: [{ ipa: 'h', startMs: 0, endMs: 60 }, ...]
// result.audioBuffer: ArrayBuffer
result.phonemes.forEach(({ ipa, startMs, endMs }) => {
human.scheduleVisemeByIPA({
ipa,
startMs,
endMs,
audioContext: ctx,
audioStartTime: ctx.currentTime,
})
})
playAudio(ctx, result.audioBuffer)VisemeScheduler API
The VisemeScheduler is the core of the client-side lip sync system. It receives viseme events (from any TTS provider), enqueues them, and drives morph weight animations frame-by-frame.
scheduleViseme()
human.scheduleViseme({
visemeId: number, // Azure viseme ID (0–21)
triggerAt: number, // AudioContext.currentTime (seconds)
audioContext: AudioContext,
blendInMs?: number, // default: 60ms
blendOutMs?: number, // default: 80ms
weight?: number, // peak weight (default: 1.0)
})scheduleVisemeByName()
human.scheduleVisemeByName({
viseme: 'AA' | 'EE' | 'II' | 'OO' | 'UU' | 'PP' | 'FF' |
'TH' | 'DD' | 'KK' | 'CH' | 'SS' | 'NN' | 'RR' | 'sil',
triggerAt: number,
audioContext: AudioContext,
blendInMs?: number,
blendOutMs?: number,
weight?: number,
})scheduleVisemeByIPA()
human.scheduleVisemeByIPA({
ipa: string, // IPA phoneme string
startMs: number, // ms from audio start
endMs: number, // ms from audio start
audioContext: AudioContext,
audioStartTime: number, // AudioContext time when audio started
})clearVisemeQueue()
// Call when interrupting speech (e.g. user interrupts the character)
human.clearVisemeQueue()
human.resetMorphWeights()Blend Timing
Each viseme schedules a smooth weight curve: blend in → hold peak → blend out.
When two visemes overlap in time, their weights are additively blended and the result is normalized. This produces smooth coarticulation - the mouth shape for "pr" is a blend of PP and RR, not a hard cut.
Tuning Blend Times
human.setVisemeDefaults({
blendInMs: 50, // faster = more articulate, lower = smoother
blendOutMs: 70,
minWeight: 0.0,
maxWeight: 1.0,
coarticulationWindow: 2, // how many visemes ahead to pre-blend
})Path 2: Server-Side Streaming (Production)
For production deployments, phoneme extraction and morph weight computation run on the server, and the results are streamed to the client via the OHP WebSocket protocol as MORPH or POSE_MORPH frames.
Server-Side Morph Frame Generation
// Node.js server - using @openhuman/server-sdk
import { OHPSession, MorphFrame } from "@openhuman/server-sdk"
async function streamLipSync(session, characterId, text) {
// 1. Get TTS + phoneme timing
const tts = await parrot.synthesize({ text, phonemes: true })
// 2. Build morph weight timeline (60fps)
const timeline = buildMorphTimeline(tts.phonemes, 60)
// 3. Play audio on client
session.sendAudio(characterId, tts.audioBuffer)
// 4. Stream morph frames in sync
const startTime = Date.now()
for (const frame of timeline) {
const elapsed = Date.now() - startTime
const delay = frame.timeMs - elapsed
if (delay > 0) await sleep(delay)
session.sendMorphFrame(characterId, frame.weights)
}
}
function buildMorphTimeline(phonemes, fps) {
const frameDuration = 1000 / fps
const frames = []
phonemes.forEach(({ ipa, startMs, endMs }) => {
const visemeMorphs = IPA_TO_FACS[ipa] || {}
// Blend in/out across overlapping frames
for (let t = startMs - 60; t <= endMs + 80; t += frameDuration) {
const weight = computeVisemeWeight(t, startMs, endMs, 60, 80)
frames.push({ timeMs: t, weights: scaleWeights(visemeMorphs, weight) })
}
})
return mergeAndSortFrames(frames)
}When to Use Server-Side vs Client-Side
| Factor | Client-Side | Server-Side |
|---|---|---|
| Setup complexity | Low | Medium |
| Latency | ~80ms | ~50ms |
| TTS API exposure | Key in client | Key stays on server |
| Custom phoneme pipeline | Limited | Full control |
| Works offline | No | No |
| Recommended for | Prototypes, demos | Production |
Emotion + Speech Blending
Lip sync morphs and emotional expression morphs blend additively. You can drive both simultaneously:
// Set a happy expression while talking
human.setMorphWeights({
mouthSmileLeft: 0.6,
mouthSmileRight: 0.6,
cheekSquintLeft: 0.3,
cheekSquintRight: 0.3,
browInnerUp: 0.2,
})
// Lip sync runs on top - jawOpen, mouthFunnel etc. add to the above
human.scheduleViseme({ viseme: 'AA', triggerAt: ..., audioContext: ctx })The MorphController normalizes the sum of all weights per-morph to prevent values exceeding 1.0:
human.setMorphBlendMode("additive_clamped") // default - clamp sum to 1.0
human.setMorphBlendMode("additive") // allow overflow (for stylized looks)Procedural Head Motion
During speech, subtle head motion improves realism. Enable the built-in procedural head motion system:
human.setHeadMotion({
enabled: true,
intensity: 0.4, // 0.0–1.0 - scale of motion
noddingRate: 0.3, // nods per second during speech
swayAmplitude: 0.02, // meters - lateral sway
tiltAmplitude: 1.5, // degrees - head tilt range
})Head motion is driven via IK on the neck + head joint chain and is synchronized with the speech audio envelope (louder speech = slightly more motion).
Lip Sync Events
human.on("lipsync:start", () => {}) // speech begins
human.on("lipsync:viseme", (viseme, weight) => {}) // each viseme fires
human.on("lipsync:end", () => {}) // speech ends
human.on("lipsync:silence", () => {}) // mid-speech silenceUse lipsync:end to trigger the character returning to idle:
human.on("lipsync:end", () => {
human.setParam("isTalking", false)
human.animateMorphWeights({}, { duration: 300 }) // smooth reset to neutral
})Full Example: Azure TTS End-to-End
import * as sdk from "microsoft-cognitiveservices-speech-sdk"
import { OpenHuman } from "@openhuman/sdk"
async function setup() {
// 1. Init character
const human = new OpenHuman({ canvas: document.getElementById("canvas") })
await human.loadCharacter("./character.ohb")
human.setParam("isTalking", false)
human.setBlink({ enabled: true })
// 2. Azure TTS setup
const speechConfig = sdk.SpeechConfig.fromSubscription(KEY, REGION)
speechConfig.speechSynthesisVoiceName = "en-US-JennyNeural"
const synthesizer = new sdk.SpeechSynthesizer(speechConfig, null)
// 3. Speak function
async function speak(text) {
return new Promise((resolve, reject) => {
const visemes = []
const audioCtx = new AudioContext()
synthesizer.visemeReceived = (s, e) => {
visemes.push({
audioOffsetMs: e.audioOffset / 10000,
visemeId: e.visemeId,
})
}
synthesizer.speakTextAsync(
text,
async (result) => {
human.setParam("isTalking", true)
const buffer = await audioCtx.decodeAudioData(result.audioData)
const source = audioCtx.createBufferSource()
source.buffer = buffer
source.connect(audioCtx.destination)
const startTime = audioCtx.currentTime + 0.1 // 100ms lead time
visemes.forEach(({ audioOffsetMs, visemeId }) => {
human.scheduleViseme({
visemeId,
triggerAt: startTime + audioOffsetMs / 1000,
audioContext: audioCtx,
})
})
source.start(startTime)
source.onended = () => {
human.setParam("isTalking", false)
resolve()
}
},
reject
)
})
}
return { human, speak }
}
// Usage
const { human, speak } = await setup()
await speak("Hello! I am Sarah, your AI assistant.")
await speak("How can I help you today?")Troubleshooting
| Issue | Likely Cause | Fix |
|---|---|---|
| Mouth moves but out of sync | AudioContext not started before viseme scheduling | Pass audioContext to scheduleViseme - it reads currentTime |
| No mouth movement | Viseme queue not connected to audio timing | Check triggerAt values match audio playback time |
| Mouth snaps between shapes | blendInMs / blendOutMs too short | Increase to 60–80ms |
| Mouth barely moves | weight too low or morph weights overridden by stream | Check setStreamMergeMode - use 'masked' to allow local morphs |
| Plosives (p, b, m) not visible | Azure viseme 21 not in queue | Verify visemeReceived event fires - some voices suppress plosive events |
| Speech sounds fine, face frozen | isTalking param not set | Call human.setParam('isTalking', true) before speaking |
Next Steps
- Animation Graph Reference -
isTalkingparam, talking state - Streaming Protocol Spec - server-side MORPH frames
- FACS Morph Target Index - all 52 morph names
- Error Code Reference - SDK error codes for audio / lip sync failures