What Happened When I Cloned My Own Voice

The promises and perils of AI voice software

What Happened When I Cloned My Own Voice

Recently my colleague Charlie Warzel, who covers technology, introduced me to the most sophisticated voice-cloning software available. It had already been used to clone President Joe Biden’s voice to create a fake robocall discouraging people from voting in the New Hampshire primary. I signed up and fed it a few hours of me speaking on various podcasts, and waited for the Hanna Rosin clone to be born. The way it works is you type a sentence into a box. For example, Please give me your Social Security number, or Jojo Siwa has such great fashion!, and then your manufactured voice, created from samples of your actual voice, says the sentence back to you. You can make yourself say anything, and shift the intensity of the intonation until it sounds uncannily like you.

Warzel visited the small company that made the software, and what he found was a familiar Silicon Valley story. The people at this company are dreamers, inspired by the Babel fish, a fictional translation device, from The Hitchhiker’s Guide to the Galaxy. They imagine a world where people can speak to one another across languages and still sound like themselves. They also may not be able to put the genie back in the bottle if (or when) the technology leads to world-altering chaos, particularly in this coming year, when more than half of the world’s population will undergo an election.

In this episode of Radio Atlantic, Warzel and I discuss how this small company perfected the cloned voice, and what good and bad actors might do with it. Warzel and I spoke at a live show in Seattle, which allowed us to play a few tricks with the audience.

Listen to the conversation here:

Subscribe here: Apple Podcasts | Spotify | YouTube | Overcast | Pocket Casts


The following is a transcript of the episode:

Hanna Rosin: So a few weeks ago, my colleague staff writer Charlie Warzel introduced me to something that’s either amazing or sinister—probably both.

Charlie’s been on the show before. He writes about technology. And most recently, he wrote about AI voice software. And I have to say: It’s uncannily good. I signed up for it—uploaded my voice—and man does it sound like me.

So, of course, what immediately occurred to me was all the different flavors of chaos this could cause in our future.

I’m Hanna Rosin. This is Radio Atlantic. And this past weekend, I was in Seattle, Washington, for the Cascade PBS Ideas Festival. It’s a gathering of journalists and creators and we discussed topics from homelessness, to the Supreme Court, to the obsession with true crime.

Charlie and I talked about this new voice software. And we tried to see if the AI voices would fool the audience.

For this week’s episode, we bring you a live taping with me and Charlie. Here’s our conversation.

[Applause]

Rosin: So today we’re going to talk about AI. We’re all aware that there’s this thing barreling towards us called AI that’s going to lead to huge changes in our world. You’ve probably heard something, seen something about deep fakes. And then the next big word I want to put in the room is election interference.

Today, we’re going to connect the dots between those three big ideas and bring them a little closer to us because there are two important truths that you need to know about this coming year. One is that it is extremely easy—by which I mean ten-dollars-a-month easy—to clone your own voice, and possibly anybody’s voice, well enough to fool your mother. Now, why do I know this? Because I cloned my voice, and I fooled my mother. And I also fooled my partner, and I fooled my son. You can clone your voice so well now that it really, really, really sounds a lot like you or the other person. And the second fact that it’s important to know about this year is that about half the world’s population is about to undergo an election.

So those two facts together can lead to some chaos. And that’s something Charlie’s been following for a while. Now, we’ve already had our first taste of AI-voice election chaos. That came in the Democratic primary. Charlie, tell us what happened there.

Charlie Warzel: A bunch of New Hampshire voters—I think it was about 5,000 people—got a phone call, and it would say “robocall” when you pick it up, which is standard if you live in a state doing a primary. And the voice on the other end of the line was this kind of grainy-but-real-sounding voice of Joe Biden urging people not to go out and vote in the primary that was coming up on Tuesday.

Rosin: Let’s, before we keep talking about it, listen to the robocall. Okay? We’re going to play it.

Joe Biden (AI): Republicans have been trying to push nonpartisan and Democratic voters to participate in their primary. What a bunch of malarkey. We know the value of voting Democratic when our votes count. It’s important that you save your vote for the November election. We’ll need your help in electing Democrats up and down the ticket. Voting this Tuesday only enables the Republicans in their quest to elect Donald Trump again. Your vote makes a difference in November, not this Tuesday.

Rosin: I’m feeling like some of you are dubious, like that doesn’t sound like Joe Biden. Clap if you think that does not sound like Joe Biden.

[Small amount of clapping]

Rosin: Well, okay. Somewhere in there. So when you heard that call, did you think, Uh-oh. Here it comes? Like, what was the lesson you took from that call? Or did you think, Oh, this got solved in a second and so we don’t have to worry about it?

Warzel: When I saw this, I was actually reporting out a feature for The Atlantic about the company ElevenLabs, whose technology was used to make that phone call. So it was very resonant for me.

You know, when I started writing—I’ve been writing about deep fakes and things like that for quite a while (I mean, in internet time), since 2017. But there’s always been this feeling of, you know, What is the actual level of concern that I should have here? Like, What is theoretical? With technology and especially with misinformation stuff, we tend to, you know, talk and freak out about the theoretical so much that sometimes we’re not really talking about and thinking, grounding it in plausibility.

So with this, I was actually trying to get a sense of: Is this something that would actually have any real sway in the primary? Like, did people believe it? Right? It’s sort of what you just asked the audience, which is: Is this plausible? And I think when you’re sitting here, listening to this with hindsight, and, you know, trying to evaluate, that’s one thing.

Are you really gonna question, like, at this moment in time, if you’re getting that, especially if you aren’t paying close attention to technology—are you really gonna be thinking about that? This software is still working out some of the kinks, but I think the believability has crossed this threshold that is alarming.

Rosin: So just to give these guys a sense, what can it do now? Like, we heard a robocall. Could it give a State of the Union speech? Could it talk to your wife? What are the things that it can do now that it’s made this leap that it couldn’t do a few months ago, convincingly?

Warzel: Well, the convincing part is the biggest part of it, but the other part of these models is the ability to ingest more characters and throw it out there. So this company, ElevenLabs, has a level that you can pay for where you can—if you’re an author, you can throw your whole novel in there, and it can do it in a matter of minutes, essentially, and then you can go through and you can tweak it. It could definitely do a whole State of the Union. Essentially, it’s given anyone who’s got 20 bucks a month the ability to take anything that they want to do content-wise and have it come out in their voice.

So a lot of people that I know who are independent journalists or authors or people like that are doing all of their blog posts, their email newsletters as podcasts—but also as YouTube videos, because they hook this technology, the voice AI, into one of the video or image generators, so it generates an image on YouTube every few paragraphs and keeps people hooked in.

So it’s this idea of: I’m no longer a writer, right? I am a content human.

Rosin: I’m a multi-platform human. Okay. That sounds—you fill in the adjective.

Warzel: Yeah, it’s intense.

Rosin: Okay, so Charlie went to visit the company that has brought us here. And it’s really interesting to look at them because they did not set out to clone Joe Biden’s voice. They did not set out, obviously—nobody sets out to run fake robocalls. So getting behind that fortress and learning, like, Who are these people? What do they want? was an interesting adventure.

So it’s called ElevenLabs—and, by the way, The Atlantic, I will say, uses ElevenLabs to read out some articles in our magazine, so just so you know that. A disclaimer.

I was really surprised to learn that it was a small company. Like, I would expect that it was Google who crossed this threshold but not this small company in London. How did that happen?

Warzel: So one of the most interesting things I learned when I was there—I was interested in them because they were small and because they had produced this tech that is, I think, better than everyone else.

There are a few companies: Meta has one that they have not released to the public, and OpenAI also has one that they have released to certain select users—partly because they aren’t quite sure how to control it, necessarily, from being abused. But that aside, ElevenLabs is quite good. They are quite small.

What I learned when I was there talking to them is they talked about their engineering team. Their engineering team is seven people.

Rosin: Seven?

Warzel: Yeah, so it’s, like, former—this is the engineering research team. It’s this small, little team, and they describe them almost as, like, these brains in a tank that would just—they would say, Hey, you know, what we really want to do is we want to create a dubbing part of our technology, where you can feed it video of a movie in, you know, Chinese, and it will just sort of, almost in real time running it through the technology, dub it out in English or, you know, you name the language.

Rosin: Is that because dubbing is historically tragic?

Warzel: It’s quite bad. It’s quite flat in a lot of places. Obviously, if you live in a couple of the big markets, you can get some good voice acting in the dubbing. But in Poland, where these guys are from, it is all dubbed in a completely flat—they’re called lektors. That’s the name for it. But, like, when The Real Housewives was dubbed into Poland, it was one male voice that just spoke like this for all the real housewives.

Rosin: Oh, my God. That’s amazing.

Warzel: So that’s a good example of, like, this isn’t good. And so people, you know, watching U.S. cinema or TV in Poland is, like, kind of a grinding, terrible experience. So they wanted to change things like that.

Rosin: For some reason, I’m stuck on this, and I’m imagining RuPaul being dubbed in a completely flat, accentless, like, sashay away. You know?

Warzel: Totally. So this is actually one of the problems that they initially were setting out to solve, this company. And they kind of, not lucked into, but found the rest of the voice-cloning stuff in that space. They talk about this research team as these brains in the tank. And they’ll just be like, Well, now the model does this. Now the model laughs like a human being. Like, Last week it didn’t.

And again, when you try to talk to them about what we did, it’s not like pushing a button, right? Then they’re like, It’s too complicated to really describe. But they’ll just say that it’s this small group of people who are, essentially—the reason the technology is good or does things that other people’s can’t do is because they had an idea, an academic idea, that they put into the model, had the numbers crunch, and this came out.

And that, to me, was kind of staggering because what it showed me was that with artificial intelligence—unlike, you know, something like social networking where you just got to get a giant mass of people connected, right? It’s network effects. But with this stuff, it really is like Quantum Leap–style computer science. And, you know, obviously, money is good. Obviously, compute is good. But a very small group of people can throw something out into the world that is incredibly powerful.

And I think that is a real revelation that I had from that.

[Music]

Rosin: We’re going to take a short break. And when we come back, Charlie explains what the founders of ElevenLabs hope their technology will accomplish.

[Music]

Rosin: So these guys, like a lot of founders, they did not set out to disrupt the election. They probably have a dream. Besides just better dubbing, what is their dream? When they’re sitting around and you get to enter their brain space, what is the magical future of many languages that they envision?

Warzel: The full dream is, basically, breaking down the walls of translation completely. Right? So there’s this famous science-fiction book Hitchhiker’s Guide to the Galaxy, where there’s this thing called the Babel fish that can translate any language seamlessly in real time, so anyone can understand everyone.

That’s what they ultimately want to make. They want to have this—you know, dubbing has a little bit of latency now, but it’s getting faster. That plus all the different, you know, voices. And what they essentially want to do is create a tool at the end, down the line, that you can put an AirPod in your ear, and you can go anywhere, and everyone else has an AirPod in their ear, and you’re talking, and so you can hear everything immediately in whatever language. That’s the end goal.

Rosin: So the beautiful dream, if you just take the purest version of it, is all peoples of the world will be able to communicate with each other.

Warzel: Yeah. When I started talking to them—because, living in America, I have a different experience than, you know. Most of them are European, or many of them—the two founders are European. You know, they said, You grow up, and you have to learn English in school, right?

There’s only a few places where you don’t grow up and, they say, you also gotta learn English if you want to go to university wherever, do whatever, and participate in the world. And they said, If we do this, then you don’t have to do that anymore.

Rosin: Ooh, there goes our hegemony.

Warzel: Imagine the time you would save, of not having to learn this other language.

Rosin: So they’re thinking about Babel and this beautiful dream, and we’re thinking, like, Oh, my god, who’s gonna scam my grandmother, and who’s gonna mess up my election?

Do they think about that? Did you talk to them about that? Like, how aware are they of the potential chaos coming down?

Warzel: They’re very aware. I mean, I’ve dealt with a lot of, in my career, tech executives who are sort of—they’re not willing to really entertain the question. Or if they do, it’s kind of glib, or there’s a little bit of resentment, you can tell. They were very—and I think because of their age (the CEO is 29)—very earnest about it. They care a lot. They obviously look at all this and see—they’re not blinded by the opportunity, but the opportunity looms so large that these negative externalities are just problems they will solve, or that they can solve.

And so we had this conversation, where I called it “the bad things,” right? And I just kept, like: What are you going to do about jobs this takes away? What are you going to do about all this misinformation stuff? What are you going to do about scams? And they have these ideas, like digitally watermarking all voices and working with all sorts of different companies to build a watermarking coalition so when you voice record something on your phone, that has its own metadata that says, like, This came from Charlie’s phone on this time.

Rosin: Uh-huh.

Warzel: You know, like, This is real. Or when you post the ElevenLabs thing, it says—and people can quickly decode it, right? So there’s all these ideas.

But I can tell you, it was like smashing my head against a brick wall for an hour and a half with this really earnest, nice person who’s like, Yeah. No, no. It’s gonna take a while before we, you know, societally all get used to all these different tools, not just ElevenLabs.

And I was like, And in the meantime? And they would never say it this way, but the vibe is sort of like, Well, you gotta break a lot of eggs to get the, you know, universal-translation omelet situation. But you know, some of those eggs might be like the 2024 election. It’s a big egg.

Rosin: Right, right, right. So it’s the familiar story but more earnest and more self-aware.

Do you guys want to do another test? Okay. You’ve been listening to me talk for a while. Charlie and I both fed our voices into the system. We’re gonna play to you me saying the same thing twice. One of them is me, recorded. I just recorded it—me, the human being, in the flesh right here. And one of them is my AI avatar saying this thing. There’s only two. I’m saying the same thing. So we’re gonna vote at the end for which one is fake-AI Hanna. Okay, let’s play the two Hannas.

Rosin (Real): Charlie, how far do you think artificial intelligence is from being able to spit out a million warrior robots programmed to destroy humanity?

Rosin (AI): Charlie, how far do you think artificial intelligence is from being able to spit out a million warrior robots programmed to destroy humanity?

Rosin: Okay, who thinks that number one is fake Hanna?

[Audience claps]

Rosin: Who thinks that number two is fake Hanna?

[Audience claps]

Warzel: It’s pretty even.

Rosin: It’s pretty even. I would say two is more robust, and two is correct—that’s the fake one.

Warzel: I’m zero for two.

Rosin: But man, it’s close. Like, Charlie spent time at this place, and he’s gotten both of them wrong so far.

Warzel: We work together!

Rosin: We work together. This is really, really close.

Warzel: You know, the only, like, bulwark right now against this stuff is that I do think people are, generally, pretty dubious now of most things. Like, I do think there is just a general suspicion of stuff that happens online. And I also think that one thing we have seen from some of these is—there’s been a couple of ransom calls, right? Like you get a—it’s a scam but it’s your mom’s voice, right? Or something like that.

Those things sort of come down the line pretty quickly. Like, you can pretty quickly realize that your mom isn’t being kidnapped. You can pretty quickly, as administrators, you can get to the bottom of that. Basically, I don’t know how effective these things are yet, because of the human element. Right? It seems like we have a little bit more of a defense now than we did, you know, let’s say, in 2016.

And I do think that time is our greatest asset here. With all of this, the problem is, you know, it only takes one, right? It only takes some person, you know, in late October, who puts out something just good enough, or early November, that it’s the last thing someone sees before they go to the polls, right?

And it’s too hard to debunk, or that person doesn’t see the debunking, right? And so, those are the things that make you nervous. But also, I don’t think yet that we’re dealing with godlike ability to just totally destroy reality.

It’s sort of somewhere in the middle, which is still, you know, nerve-wracking.

Rosin: So the danger scenario is a thin margin, very strategic use of this technology. Like, less-informed voters, a suppress-the-vote—someplace where you could use it in small, strategic ways. That’s a realistic fear.

Warzel: Yeah, like, hyper-targeted in some way.

I mean, it’s funny. I’ve talked to a couple of AI experts and people in the field of this, and they’re so worried about it. It’s really hard to coax out nightmare scenarios from them. They’re like, No, I’ve got mine. And I’m absolutely not telling a journalist. Like, no way. I do not want this printed. I do not want anyone to know about it. But I do think—and this could be the fact that they’re too close to something, or it could be that they’re right, and they are really close to it. But there’s so much fear from people who work with these tools. I’m not talking about the ElevenLabs people, necessarily.

Rosin: But AI people.

Warzel: But AI people. I mean, true believers in the sense of, you know, If it doesn’t happen this time around, well, wait ’til you see what it’s going to be in four years.

Rosin: I know. That really worries me, that the people inside are so worried about it. It’s like they’ve birthed a monster kind of vibe.

Warzel: It’s also good marketing. You can go back and forth on this, right? Like the whole idea of, you know, We’re building the Terminator. We’re building Skynet. It could end humanity. Like, there’s no better marketing than like, We are creating the potential apocalypse. Pay attention.

Rosin: Right. All right. I’m going to tell you my two fears, and you tell me how realistic they are. One is the absolute perfection of scams, designed to target older people who are slightly losing their memories, that are already pretty good. Like, they’re already pretty good, and you already hear so many stories of people losing a lot of money. That is one I’m worried about. Like, how easy it is to consistently call someone in the voice of a grandson, or in the voice of whatever. That one seems like a problem.

Warzel: Yeah, I think it will be, and I don’t think it has to be relegated to people who are so old they’re losing their memories. It’s difficult to discern this stuff. And, I think, what I have learned from a lot of time reporting on the internet is that nobody is immune to a scam.

Rosin: Yes.

Warzel: There’s a scam waiting to match with you. And, you know, when you find your counterpoint, it’s—

Rosin: It’s like true love.

Warzel: Exactly.

Rosin: Out there is the perfect scam for you. Okay, one more worry and then we’re going to do our last test.

My real worry is that people will know that things are fake, but it won’t matter, because people are so attached to whatever narrative they have that it won’t matter to them if you prove something is real or fake.

Like, you can imagine that Trump would put out a thing that was fake and everybody would kind of know it’s fake, but everyone would collude and decide that it’s real, and proceed based on that. Like, real and fake just—it’s not a line people worry about anymore, so it doesn’t matter.

Warzel: I fully think we live in that world right now. I mean, honestly.

I think a good example is a lot of the stuff, not only the stuff that you see coming out of the Middle East in the way that—I mean, obviously there’s so much literal digital propaganda and misinformation coming from different places, but also just from the normal stuff that we see. And this is a little less AI-involved, but I think there’s just a lot of people, especially younger people, who just don’t trust the establishment media to do the thing. And they’re like, Oh, I’m gonna watch this, and I don’t really care. And so I think the level of distrust is so high at the moment that we’re already in that situation.

Rosin: Like we’re of a generation, and we’re journalists, and so we sit and worry about what’s real and what’s fake, but that’s not actually the line that people are paying attention to out there.

Warzel: Yeah. I think the real thing is, like, getting to a point where you have built enough of a para-social trust relationship with someone that they’re just gonna believe what you say and then try to be responsible about it, about delivering them information, which is crazy.

Rosin: Okay. One final fake-voice trick. This one’s on me since, Charlie, you were wrong both times. Now it’s my turn.

My producers wanted to give me the experience of knowing what it’s like to have your voice saying something that you didn’t say. So they took my account, they had my voice say things, and I haven’t heard it, and I don’t know what it is. So we are going to listen to that now. It will be a surprise for all of us, including me. So let’s listen to these fake voicemails created by my wonderful producers.

Rosin (AI): Hi! I’m calling to leave a message about after-school pickup for my kids. Just wanted to let their homeroom teacher know that Zeke in the white van is a dear family friend, and he’ll be picking them up today.

Rosin: (Laughs.) Okay.

Rosin (AI): Hi, mom. I’m calling from jail, and I can’t talk long. I’ve only got one phone call. I really need you to send bail money as soon as you can. I need about $10,000. Cash App, Venmo, or Bitcoin all work.

Rosin: My mom does not have $10,000.

Rosin (AI): Hey, I hope I have the right number. This is a voicemail for the folks running the Cascade PBS Ideas Festival. I’m running late at the moment and wondering if I’m going to make it. Honestly, I feel like I should just skip it. I can’t stand talking to that Charlie-whatever character. Why am I even here? Washington, D.C., is clearly the superior Washington anyway.

[Crowd boos]

Rosin: Oooh. Yeah, okay, okay. Now, I would say I was talking too fast.

Warzel: So one thing I did with my voice is I had it say a whole bunch of far worse things, like, COVID came from a—whatever, you know, just to see what those things would be like. And they were sort of believable, whatever.

But also, what if then you took audio—so the one from jail, right? What if you took audio—your producers, our producers are great—and inserted a lot of noise that sounded like it was coming from a crowd, or like a slamming of a cell door or something like that in the background, faded it in nicely? That would be enough to ratchet it up, right?

And I think all those things can become extremely believable if you layer the right context on them.

Rosin: Right. You know what, Charlie? Here’s the last thing. You, as someone who’s been really close to this, fluctuate between, Okay, we don’t need to be that alarmed. It’s only got these small uses, and, But also, it’s got these uses, and they’re really scary.

Having been close to this and gone through this experience, is there a word you would use to sum up how you feel now? Because, clearly, it’s uncertain. We don’t actually know—we don’t know how quickly this technology is going to move.

How should we feel about it?

Warzel: I think disorientation is the word because—so a big reason I wanted to go talk to this company was not just because of what they were doing, but to be kind of closer, to get some proximity to the generative-AI revolution, whatever we’re gonna call it. Right? To see these people doing it. To feel like I could moor my boat to something and just feel like—

Rosin: You have control.

Warzel: Yeah, and I understand what we’re building towards, or that they understand what they’re building towards. And the answer is that you can walk up to these people and stare them in the face and have them answer questions and just sort of feel really at sea about a lot of this stuff, because there are excellent transformative applications for this. But also, I see, you know, this voice technology with the other generative-AI technologies—basically, a good way to think of them is like plug-ins to each other, right? And people are going to use, you know, voice technology with ChatGPT with some of the video stuff, and it’s going to just make the internet—make media—weirder. Right?

Everything you see is going to be weirder. The provenance of it is going to be weirder. It’s not necessarily always going to be worse, right? But it could be. And it could maybe be better. But everyone seems like they’re speeding towards this destination, and it’s unknown where we’re going.

And I just feel that disorientation is sort of the most honest and truthful way to look at this. And I think when you’re disoriented, it’s best to be really wary of your surroundings, to pay very close attention. And that’s what it feels like right now.

Rosin: We can handle the truth. Thank you for giving us the truth. And thank you, all, for coming today and for listening to this talk, and be prepared to be disoriented.

[Music]

Rosin (AI): Thanks for listening. And thank you to the production staff of the Cascade PBS Ideas Festival. This is the AI version of Hanna Rosin speaking, as made by ElevenLabs.

This episode of Radio Atlantic was produced by Kevin Townsend. He’s typing these words into ElevenLabs right now and can make me say anything. “You may hate me, but it ain’t no lie. Baby, bye, bye, bye. Bye, bye.”

This episode was edited by Claudine Ebeid and engineered by Rob Smierciak. Claudine Ebeid is the executive producer of Atlantic audio, and Andrea Valdez is our managing editor. I’m not Hanna Rosin. Thank you for listening.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow