Could 'The Simpsons' Replace Its Voice Actors With AI? | WIRED

In May 2015, The Simpsons voice actor Harry Shearer – who plays a number of key characters including, quite incredibly, both Mr Burns and Waylon Smithers – announced that he was leaving the show.

By then, the animated series had been running for more than 25 years, and the pay of its vocal cast had risen from $30,000 an episode in 1998 to $400,000 an episode from 2008 onwards. But Fox, the producer of The Simpsons, was looking to cut costs – and was threatening to cancel the series unless the voice actors took a 30 per cent pay cut.

Most of them agreed, but Shearer (who had been critical of the show’s declining quality) refused to sign – after more than two decades, he wanted to break out of the golden handcuffs, and win back the freedom and the time to pursue his own work. Showrunner Al Jean said Shearer’s iconic characters – who also include Principal Skinner, Ned Flanders and Otto Mann – would be recast.

But you’ll never stop The Simpsons. After a few months, Shearer relented and signed a new deal. The show often jokes about the replaceability of voice actors in animation, but as it pushes through its fourth decade, it’s the iconic voices behind the laughter that could pose the biggest threat to its continued presence. The actors who play Springfield’s residents are approaching retirement age – they’re mostly in their 60s or 70s, Shearer is 77 – and they might soon decide they don’t want to do it anymore. They certainly don’t need the money – between fees for new episodes and residuals from repeats of old ones, they’re sitting on tens of millions of dollars.

But maybe the producers of the show don’t actually need voice actors anymore. In a recent episode, Edna Krabappel – Bart’s long-suffering teacher, whose character was retired from the show after the death of voice actor Marcia Wallace in 2013 – was brought back for a final farewell using recordings that had been made for previous episodes.

Advances in computing power mean that you could extend that principle to any character. Deepfake technology can make convincing lookalikes from a limited amount of training data and the producers of show have thirty years worth of audio to work from. So could The Simpsons replace its voice cast with an AI?

“You could certainly come up with an episode of The Simpsons that is voiced by the characters in a believable way,” says Tim McSmythurs, a Canada-based AI researcher and media producer who has built a speech model that can be trained to mimic anyone’s voice. “Whether that would be as entertaining is another question.”

On his YouTube channel, Speaking of AI, McSmythurs recasts an iconic scene from Notting Hill with Homer playing the Julia Roberts role; Donald Trump stands in for Ralph Wiggum, and Joe Biden ties an onion to his belt, which was the style at the time.

McSmythurs built a generic AI model that can turn any text into audio speech in English. When he wants to make a new voice, he tunes the model further with two or three hours of new data of that particular person speaking, along with a text transcript. “It focuses in on what makes a Homer voice a Homer voice, and the different frequencies,” he says.

After that, it’s a matter of asking the model to generate multiple takes – each one will vary slightly – and choosing the best one for your purposes. The outputs are recognisably Homer, but they sound a little emotionally flat, as if he’s reading out something that he doesn’t really understand the meaning of. “It does depend on the training data,” McSmythurs says. “If the model hasn’t been exposed to those quite wide ranges of emotion it can’t create it from scratch. So it doesn’t sound as energetic as Homer might.”

British startup Sonantic has developed a way of bringing that emotional range to AI voices. They work with voice actors to get a wide range of training data – several hours of the actors running through different lines, with different emotional tones. “We know the difference between sarcasm and sincerity, and the tiny little clues in sound,” says John Flynn, Sonantic co-founder and CTO. “We stretch those natural points and nuances and inflections.”

The amount of training data required has decreased drastically, Flynn says, from 30 to 50 hours down to just ten or 20 minutes. Brisbane-based Replica Studios has built a model that can be trained to recreate a voice simply by being fed recordings of 20 short but specific sentences. “The more data you have the more performance you can get, but we can do something in a couple of minutes,” says Shreyas Nivas, Replica co-founder and CEO.

Words are constructed from syllables, which are built from phonemes – all the individual sounds that your mouth is able to make. In theory, a training model could get everything it needed from a single sentence known as a phonetic pangram, which contains every phoneme of English, although in practice this varies depending on your accent. (For example, try thinking of all the different ways there are to say: “The beige hue on the waters of the loch impressed all, including the French queen, before she heard that symphony again, just as young Arthur wanted.”)

Voice generation technology is already finding a use in video games – Sonantic is working with Obsidian, the makers of Fallout and The Outer Worlds, while Replica has a number of AAA and indie games studios as clients. In games, AI voices can be used to fill out an open world with a much wider range of conversations, instead of characters being limited to saying things that were recorded by a voice actor in a studio.

Nivas says the technology is particularly useful in the development stage, where an AI version of the voice can be used as a stand-in that enable the creators of the game to try out various options before getting the real actor in. It could also be used to drive increased customisation – commentators screaming your actual name on games like FIFA could be one application, while Replica developed a mod for Cyberpunk that changes the main character’s name, and enables every character that interacts with them to say it. Combining AI voice generation, speech recognition, and a text-to-speech algorithm like GPT-3 could mean players can actually converse with non-player characters, with dialogue that is generated right there and then.

However, unless Fox decides to handover scriptwriting and animation to an AI too, you wouldn’t need any of those features for something scripted like The Simpsons. And in fact, using an AI to recast a character would probably be more trouble than just finding someone who can do a pretty good Homer impression. “If the goal is to produce another episode of the show, the best way would be to get the acting cast together with a script and have them perform it – they would deliver a higher quality performance because they have been doing so successfully for decades and they can embody the characters perfectly,” says Nivas. “Using an AI voice actor would require more iterations and more work than just reassembling the cast.”

There’s also a legal minefield to navigate for any producer seeking to recast unruly voice actors with an AI. “This area of the law is thorny,” says Jennifer Rothman, a law professor at the University of Pennsylvania, and the author of The Right of Publicity: Privacy Reimagined for a Public World.

On the one hand, contracts may limit what the studio is allowed to do with the recordings. Added to that are collective bargaining issues – the actors union SAG-AFTRA has, Rothman says, “been very active in trying to regulate the reanimation and reuse of both voice actors and on-screen actors”.

However, in the absence of any contractual stipulations, copyright law comes into play. “Whoever owns the copyright to The Simpsons would hold all of the rights to reproduce the copyrighted works they’ve already made – including the captured recordings of the actors’ performances, and the right under copyright law to make derivative works,” Rothman says.