What is Speech-to-Text (STT)?
The reverse of TTS: converting spoken audio into written text. Modern STT systems like OpenAI's Whisper handle accents, background noise, and many languages. Puppetry uses STT internally to align speech to visemes for accurate lip sync.
Related Terms
Text-to-Speech (TTS)
Technology that converts written text into spoken audio. Modern TTS systems produce natural-sounding voices with emotion, pacing, and accent control. Puppetry offers 500+ AI voices across 65+ languages.
Phoneme
The smallest unit of sound that distinguishes one word from another in a language. Speech-recognition and lip-sync systems first decompose audio into phonemes, then map each phoneme to a viseme (mouth shape) for animation. English has roughly 44 phonemes.
Lip Sync / Lip Syncing
The process of matching mouth movements to audio speech. In AI video, lip sync algorithms analyze audio waveforms and generate realistic mouth shapes frame-by-frame. Puppetry uses LivePortrait + Wav2Lip for production-quality lip sync across 65+ languages.
Viseme
The visual mouth shape that corresponds to a phoneme (a unit of sound). English uses roughly a dozen distinct visemes — for example, the "OO" lip-rounding shape covers several different phonemes that look the same on camera. AI lip-sync engines map audio to visemes frame-by-frame to produce believable mouth movement.