What is Speech-to-Text (STT)?

The reverse of TTS: converting spoken audio into written text. Modern STT systems like OpenAI's Whisper handle accents, background noise, and many languages. Puppetry uses STT internally to align speech to visemes for accurate lip sync.

How STT powers our lip sync →

← Digital Human Synthetic Media →

Related Terms

Text-to-Speech (TTS)

Technology that converts written text into spoken audio. Modern TTS systems produce natural-sounding voices with emotion, pacing, and accent control. Puppetry offers 500+ AI voices across 65+ languages.

Phoneme

The smallest unit of sound that distinguishes one word from another in a language. Speech-recognition and lip-sync systems first decompose audio into phonemes, then map each phoneme to a viseme (mouth shape) for animation. English has roughly 44 phonemes.

Lip Sync / Lip Syncing

The process of matching mouth movements to audio speech. In AI video, lip sync algorithms analyze audio waveforms and generate realistic mouth shapes frame-by-frame. Puppetry uses LivePortrait + Wav2Lip for production-quality lip sync across 65+ languages.

Viseme

The visual mouth shape that corresponds to a phoneme (a unit of sound). English uses roughly a dozen distinct visemes — for example, the "OO" lip-rounding shape covers several different phonemes that look the same on camera. AI lip-sync engines map audio to visemes frame-by-frame to produce believable mouth movement.

← Back to full glossary