Glossary · workflow

What is Transcription (speech-to-text)?

Transcription is the process of converting spoken audio into written text. Automatic Speech Recognition (ASR) systems do this with neural networks — OpenAI's Whisper, Google's Cloud Speech API, Apple's Speech framework are the major engines. YouTube uses ASR for auto-generated captions on every video.

Also called:asr · speech recognition · speech-to-text · whisper

ASR accuracy depends on audio quality, language, accent, and the model used. English with clean audio: ~95%+ accuracy. Heavy accents or background music: 70-85%. Non-English: 80-95% for major languages, lower for smaller ones.

For YouTube specifically: every video gets auto-transcribed for caption purposes. The transcripts are publicly accessible — click the three-dot menu on any video → "Show transcript". This is the same data we expose more cleanly through [/youtube-subtitle-downloader](/youtube-subtitle-downloader) and [/youtube-transcribe](/youtube-transcribe).

The 2026 state of the art is Whisper Large Turbo — runs entirely in the browser via WebGPU, transcribes a 10-minute video in ~3-5 minutes, handles non-English audio significantly better than YouTube's auto-generated captions. Our Plus tier uses Large Turbo; free tier uses Whisper Base (faster, less accurate).

Common questions

How accurate is YouTube's auto-transcription?
For English with clean audio, ~92-95% word-level accuracy. For non-English, 70-90% depending on language. For accented English, music, or technical jargon, accuracy can drop to 60-75%. Always proofread auto-captions before publishing.

Related terms

VidPickr is a free, browser-based YouTube downloader. Every term in this glossary either describes how YouTube delivers video or why your downloads behave the way they do. Try the downloader →