What is Viseme?
The visual mouth shape that corresponds to a phoneme (a unit of sound). English uses roughly a dozen distinct visemes — for example, the "OO" lip-rounding shape covers several different phonemes that look the same on camera. AI lip-sync engines map audio to visemes frame-by-frame to produce believable mouth movement.
Related Terms
Phoneme
The smallest unit of sound that distinguishes one word from another in a language. Speech-recognition and lip-sync systems first decompose audio into phonemes, then map each phoneme to a viseme (mouth shape) for animation. English has roughly 44 phonemes.
Lip Sync / Lip Syncing
The process of matching mouth movements to audio speech. In AI video, lip sync algorithms analyze audio waveforms and generate realistic mouth shapes frame-by-frame. Puppetry uses LivePortrait + Wav2Lip for production-quality lip sync across 65+ languages.
Wav2Lip
A neural network that generates accurate lip movements from audio input. It takes a face image and audio waveform, then produces video frames with perfectly synced mouth movements. Known for high accuracy across languages and accents.
Speech-to-Text (STT)
The reverse of TTS: converting spoken audio into written text. Modern STT systems like OpenAI's Whisper handle accents, background noise, and many languages. Puppetry uses STT internally to align speech to visemes for accurate lip sync.