The Great Dictation Boom Is Here

The dream of computers that can type for you is finally coming true.

The Great Dictation Boom Is Here

As a little girl, I often found myself in my family’s basement, doing battle with a dragon. I wasn’t gaming or playing pretend: My dragon was a piece of enterprise voice-dictation software called Dragon Naturally Speaking, launched in 1997 (and purchased by my dad, an early adopter).

As a kid, I was enchanted by the idea of a computer that could type for you. The premise was simple: Wear a headset, pull up the software, and speak. Your words would fill a document on-screen without your hands having to bear the indignity of actually typing. But no matter how much I tried to enunciate, no matter how slowly I spoke, the program simply did not register my tiny, high-pitched voice. The page would stay mostly blank, occasionally transcribing the wrong words. Eventually, I’d get frustrated, give up, and go play with something else.

Much has changed in the intervening decades. Voice recognition—the computer-science term for the ability of a machine to accurately transcribe what is being said—is improving rapidly thanks in part to recent advances in AI. Today, I’m a voice-texting wizard, often dictating obnoxiously long paragraphs on my iPhone to friends and family while walking my dog or driving. I find myself speaking into my phone’s text box all the time now, simply because I feel like it. Apple updated its dictation software last year, and it’s great. So are many other programs. The dream of accurate speech-to-text—long held not just in my parents’ basement but by people all over the world—is coming together. The dragon has nearly been slain.

“All of these things that we’ve been working on are suddenly working,” Mark Hasegawa-Johnson, a professor of electrical and computer engineering at the University of Illinois Urbana-Champaign, told me. Scientists have been researching speech-recognition tools since at least the mid-20th century; early examples include the IBM Shoebox, a rudimentary computer housed within a wooden box that could measure sounds from a microphone and associate them with 16 different preprogrammed words. By the end of the 1980s, voice-dictation models could process thousands of words. And by the late ’90s, as the personal-computing boom was in full swing, dictation software was beginning to reach consumers. These programs were joined in the 2010s by digital assistants such as Siri, but even these more advanced tools were far from perfect.

“For a long time, we were making gradual, incremental progress, and then suddenly things started to get better much faster,” Hasegawa-Johnson said. Experts pointed me to a few different factors that helped accelerate this technology over the past decade. First, researchers had more digitized speech to work with. Large open-source data sets were compiled, including LibriSpeech, which contains 1,000 hours of recorded speech from public-domain audiobooks. Consumers also started regularly using voice tools such as Alexa and Siri, which likely gave private companies more data to train on. Data is key to quality: The more speech data that a model has access to, the better it can recognize what’s being said—“water,” say, not “daughter” or “squatter.” Models were once trained on just a few thousand hours of speech; now they are trained on a lifetime’s worth.

The models themselves also got more sophisticated as part of larger, industry-wide advancements in machine learning and AI. The rise of end-to-end neural networks—networks that could directly pair audio with words rather than trying to transcribe by breaking them down into syllables—has also accelerated models’ accuracy. And hardware has improved to allow more units of processing power on our personal devices, which allows bigger and fancier models to run in the palm of your hand.

Of course, the tools are not yet perfect. For starters, their quality can depend on who is speaking: Voice-recognition models have been found to have higher error rates for Black speakers compared with white speakers, and they also sometimes struggle to understand people with dysarthric, or irregular, speech, such as those with Parkinson’s disease. (Hasegawa-Johnson, who compiles stats related to these issues, is the principal researcher at the Speech Accessibility Project, which aims to train models on more dysarthric speech to improve their outputs.)

The future of voice dictation will also be further complicated by the rise of generative AI. Large language models of the sort that power ChatGPT can also be used with audio, which would allow a program to better predict which word should come next in a sequence. For example, when transcribing, such an audio tool might reason that, based on the context, a person is likely saying that their dog—not their frog—needs to go for its morning walk.

Yet like their text counterparts, voice-recognition tools that use large language models can “hallucinate,” transcribing words that were never actually spoken. A team of scholars that recently documented violent and unsavory hallucinations, as well as those that perpetuate harmful stereotypes, coming from OpenAI’s new audio model, Whisper. (In response to a request for comment about this research, a spokesperson for OpenAI said, in part, “We continually conduct research on how we can improve the accuracy of our models, including how we can reduce hallucinations.”)

So goes the AI boom: The technology is both creating impressive new things and introducing new problems. In voice dictation, the chasm between two once-distinct mediums, audio and text, is closing, leaving us to appreciate the marvel available in our hands—and to proceed with caution.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow