Will AI Lead to New Creative Frontiers, or Take the Pleasure Out of Music?

We can now train machines to play and write music that goes beyond mere mimicry—but does that mean we should?

May 24, 2022

At Barcelona’s Sónar festival last fall, artist and researcher Mat Dryhurst stepped up to the microphone and began to sing, but the voice of his wife—the electronic musician and technologist Holly Herndon—came out instead. When Dryhurst giggled, the sound was unmistakably hers, high and clear like a bell—and not, as far as anyone could hear, some kind of electronic trick, but as seemingly real as the sound of any human larynx can be.

The performance was part of a demonstration of Holly+, Herndon’s latest experiment in artificial intelligence, which takes one sound and, through the magic of a neural net, turns it into another. Imagine Nicolas Cage and John Travolta swapping visages in Face Off, only this time it’s their voices that trade places.

The effect—watching Herndon’s voice emit from Dryhurst’s mouth—was uncanny. It was also a likely sign of things to come, of a world of shapeshifting forms looming on the horizon: identity play, digital ventriloquism, categories of art and artifice we don’t even have names for yet. The audiovisual forgeries known as deepfakes have been around since the late ’10s, and the technique is becoming increasingly common in pop culture; just this month, Kendrick Lamar’s “The Heart Part 5” video eerily morphed the rapper’s likeness into the faces of O.J. Simpson, Will Smith, and Kanye West.

But the real-time aspect of Holly+ feels new. Created in collaboration with AI researchers and instrument builders Never Before Heard Sounds, Holly+ is a vocal model, a species of deep neural network, that has been trained on Herndon’s voice. To make it, she recorded hours and hours of speaking and singing, from which the system learned to synthesize her vocal timbre. Feed the system a line of text or a snippet of audio, and it regurgitates the sound back in Herndon’s voice.

You can try it out right now, using the web interface. There are caveats; the online tool doesn’t have the fidelity of what was heard in Herndon’s Sónar presentation. The sound is tinny, crinkly like cellophane, haunting. When the outgoing signal is garbled, it sounds like electronic voice phenomena, or EVP—unintelligible recordings of audio interference ostensibly emanating from the spirit world. Just for fun, I fed it a snippet of Alvin Lucier’s “I Am Sitting in a Room,” a landmark 1969 tape-music composition based on vocal modulations, and what came back sounded like it might have come from a horror film.

But machine-learning experiments like Holly+ are only as eerie as the assumptions we project upon them. “When a lot of people think about AI, they just think about this alien, crazy intelligent other,” Herdon tells me by Zoom, having just returned from a think tank in Italy. “But machine learning is just aggregated human intelligence.”

This observation may not inspire much confidence in the technology; human intelligence, and human righteousness, are notoriously fallible. The prospect of artificial intelligence has been with us for decades, but the speed with which it is advancing is enough to give anyone pause. In music, the stakes are not nearly as high as they are in the realm of international relations, where deepfakes of political leaders can spread dangerous disinformation. A neural net capable of mimicking Elvis Presley or Katy Perry, as OpenAI Jukebox can, is somewhere between a proof of concept and a nifty parlor trick. But just as the internet irreversibly changed the way people engage with music, AI is likely to change the way we both make and consume it. And as much machine learning is already purring away under the hood of so many of our digital tools—photo apps, voice assistants, late-night Shazams—it tends to go unnoticed and unremarked upon. Music and art offer a chance to shine a spotlight on the ghost in the machine.

Herndon has been working with AI for a number of years now. On her 2019 album PROTO, with the assistance of AI researcher Jules LaPlace, Herndon trained a neural net—which she called Spawn, and gave she/her pronouns—on human voices; she then invited her to sing, along with an ensemble of human vocalists. The album tracks Spawn’s progress as she learns and improves: On “Birth,” you can hear her stuttering attempts at speech against a backdrop of jittering choral harmonies. Elsewhere, she flickers, holographic, the star vocalist in a transhuman ensemble. The imperfections of Spawn’s voice—glitchy, pixel-pocked, riddled with digital artifacts—lent themselves to Herndon’s signature style of experimental electronic music, which revels in fractured rhythms and twisted timbres.

But the technology has evolved rapidly, meaning that whatever comes next for Herndon and Holly+, it is likely to be vastly different in form and scope. “I was working with lo-fi audio in this super hi-fi digital world,” Herndon says. “Now everything’s becoming much more realistic and high-fidelity, so the thing I was dreaming about during PROTO—a world where you could just sing through someone else’s voice—is becoming a reality. With that comes the realization that, oh shit, the music community isn’t ready for this. Our infrastructure isn’t set up for this.”

In other words, Holly+, and the techniques behind it, raise new questions—about ownership, copyright, ethics, even basic media literacy. Who owns the sound of a voice? What is the line between self-expression and appropriation? And how do we know when not to believe our ears?

For Herndon, Holly+ is a way of testing the waters of technologically aided identity play; the next phase of the project is an interface to allow anyone to input text and generate audio of her speaking voice, which they can use however they like. “Which is kind of fucked up and scary,” she says. “But here we go!” She believes that the shifts—conceptual, legal, economic—that will accompany AI-assisted art and music are akin to those that accompanied sampling. Decades since hip-hop producers first started lifting breakbeats from old funk records, we’re still struggling to catch up to the myriad implications of borrowing others’ creations without their consent. So it’s crucial, she says, that we begin grappling with the changes that AI is likely to bring.

Committed advocates of Web3 and blockchain technology, Herndon and Dryhurst are investigating ways that works created using her voice could generate royalties via “smart contracts” written into the technology. (Early in May, she auctioned off a set of NFTs based on 70 different artists’ songs made using her voice.) She hopes that, by volunteering her own voice as an AI guinea pig, she might help establish precedents for future cases—whether artists like her who willingly clone their own voices, or, more ominously, artists whose voices are appropriated and reused in ways beyond their control. “I think people should have the right to decide what happens with their digital likeness,” she says. “There should be some sort of sovereignty over one’s own voice.”

For a long time, AI projects tended toward mimicry. In 1980, the British composer David Cope started teaching computers to write music. After training a neural net called Emmy—an adaptation of EMI, or “Experiments in Musical Intelligence”—to reproduce the complex counterpoints of Johann Sebastian Bach, his research ultimately yielded written scores to 5,000 “new” Bach chorales, some of which he included on his 1994 album Bach by Design.

Emmy’s parroting was convincing: In a 1997 Turing test that set one of Emmy’s ersatz compositions against a Bach-inspired piece written by a music professor and an actual work by the German composer, the audience thought Emmy’s rendition was the real deal. Later came albums like Virtual Mozart and Virtual Rachmaninoff, but eventually, discomfited by the sheer volume of Emmy’s output, Cole unplugged the machine: The potential limitlessness of the work had the effect of devaluing individual pieces, no matter how compelling any one might be on its own.

A pair of eyes and ears surrounded by digital imagery

The Woes of Being Addicted to Streaming

Cope’s next act was to use AI as a creative tool. Feeding a neural net named Emily a prompt, he would then cherry-pick from the results and create his own composition out of them—treating AI as a transhuman collaborative partner, in effect. Other musicians, like YACHT and even David Bowie, have similarly utilized AI as an endless idea generator. On her 2018 album I AM AI, the actor, singer, and filmmaker Taryn Southern fed suggestions into a neural network, then used the exported audio as the foundation for her own songwriting. For such an ostensibly radical approach, I AM AI’s twinkly pop-rock turned out to be disappointingly conventional—a reminder that AI is only as adventurous as the human imagination at the controls.

But increasingly, artists want to hear what computers think—not just in their own words but also their own voice. Like Holly Herndon’s PROTO, the German electronic group Mouse on Mars’ 2021 album AAI—short for “anarchic artificial intelligence”—was trained on human speech. For their voice model, they turned to Louis Chude-Sokei, a Boston-based professor who has written extensively on the ethical dimension of AI. In a sense, you could say that Chude-Sokei’s work is also concerned with fair-use principles, but he is just as concerned with what might be fair for the AI itself.

As a scholar of African American studies, Chude-Sokei is deeply invested in questions of power and subjugation. “History has taught us that humans can be made inhuman, transformed into animals or treated like objects,” he writes in “Creolization and Machine Synthesis,” an essay accompanying AAI. “Because technology has been implicated in this history, we wanted to imagine and make possible the opposite. We wanted to craft a story in which inhuman objects could redefine life and reimagine what it is to be human.”

When I speak with Mouse on Mars’ Jan St. Werner, he poses related questions like: Could a tool have its own identity? How can we own something that has its own life? The group hit upon the idea of an “anarchic AI,” a machine entity that blithely makes its own decisions, and decided to zero in on the voice as the source of its essence. To train the neural network, they invited Chude-Sokei to read a selection of his texts into the machine, and then modeled the AI on his voice. To make it playable, in a musical sense, they scripted in various parameters, allowing them to make it noisier, more abstract, more onomatopoeic. The results were “guttural, swarming,” recalls St. Werner. They were also primitive. “It was like building a synthesizer in 1910.”

The AAI’s jumpy, stuttering speech then became the basis for Mouse on Mars’ jittery compositions, which turned out to be even knottier and more complex than usual. “The beats are really, really fucked up,” says St. Werner. “It’s super hard to play them live, because you can’t even count them.”

But for Mouse on Mars, the whole point was to use AI to make something difficult, even ungainly. St. Werner believes that there are great risks in the use of AI as a tool of perfection and standardization—whether that means programming an app to write the perfect pop song or to automatically Photoshop the perfect cheekbones into our selfies. They were more interested in exploiting the AI’s potential for error, and using those unexpected glitches as the inspiration for previously unimagined rhythms and sounds. “I think it can be helpful to understand the possibilities of something that is odd, or isn’t behaving properly,” he says. Like any tool, AI is a means of interacting with our world—of altering it, maybe, but also being altered by it. “The real potential of AI is to make us more humble.”

While Herndon, Mouse on Mars, and their experimental cohort in the electronic-music space—Arca, Lee Gamble, Debit, Ash Koosha—are exploring the conceptual dimensions of AI, a growing number of companies are designing AI-enabled music-creation and streaming apps. These tools have enormous implications for the way we make and listen to music—not to mention the way musicians get paid (or don’t) to make it.

Drew Silverstein was originally a film composer. He studied music composition at Vanderbilt and went into business writing music for movies, TV shows, and video games. The story he tells will be familiar to anyone in a creative field. Time and again, his clients told him something along these lines: We love your music, but we just don’t have the budget for it right now. Just this once, do you have something you could give us for free, as a favor?

As a business model, of course, favors are neither sustainable nor particularly scalable. But Silverstein and his colleagues came up with another idea: What if they could create an AI that was capable of fulfilling the sorts of jobs they were being asked to do for free? The end result would never be confused for Hans Zimmer or Danny Elfman, but it might just be good enough that some people—independent video producers, YouTubers, local advertisers—would be willing to pay for it. Silverstein mapped out the original algorithm in Excel, and by 2016, they had created Amper, a tool that allows anyone, no matter how non-musical, to generate soundtrack-ready audio at, literally, the click of a mouse.

And it works. You can try it right now. When you launch a new project, you’re given the choice of nine different musical genres that break down into subcategories like “electronic trip-hop” or “documentary futuristic”; next, you select from a palette of moods: “determined,” “dreamy,” “mysterious,” etc. It helps if you have a piece of video to work to; I uploaded a short clip of my daughter doing handstands and opted for “sweet,” which gave me an array of preset melodic themes, including “Lush Forest,” “Piano Staircase,” and “It’s Happening Babe” (probably a popular choice for pop-the-question videos). I chose “Treetops,” clicked “compose,” and seconds later had a peppy snippet of music, the kind of thing you might hear soundtracking a TV commercial for a digital camera: perky marimba, plucky strings, glistening chimes.

It was okay, but I thought it could be better. Amper Score, the company’s entry-level web product, is meant for non-musicians, so isn’t a fully functional digital-audio workstation like Logic or Ableton; you can’t draw in notes or write chord changes. (The company offers a more robust API to musicians interested in more in-depth projects; Taryn Southern used Amper’s technology on her AI album.) But you can swap out instruments and add additional ones, alter the reverb on each, and add layers of nature sounds. By clicking a button called “remix,” you can randomize the notes and rhythms, generating new patterns until you find the one that you like.

“It was our fundamental thesis that in creative spaces, there is no right answer,” says Silverstein. That meant that Amper couldn’t look to the same methodologies used by AI technologies for doctors or Teslas. An AI based on yes-or-no questions—Does the X-ray show a visible fracture? Will the self-driving car hit the pedestrian?—has no way of discerning that a death-metal growl probably won’t sound good on a wedding video. “So we designed Amper to be wholly collaborative, meaning you could provide feedback and say, ‘Here’s what I like, here’s what I don’t like, here’s what I’d like to change,’” just as any client might send notes back to a composer.

To that end, I added a bright glockenspiel and changed the key from D to F, which sounded somehow zippier. Because my video was outdoors, I thought it would be cool to add a soundscape of katydids. Finally, I clicked the “remix” button and was treated to a nice modulating chord toward the end. Perhaps the coolest thing was the way the strings played a final root chord just as my daughter stuck her landing—the sonic equivalent of a “ta-da!” I realize the AI wasn’t actually “reading” the video, but the touch seemed eerily human.

Amper functions, essentially, as the sum of numerous chance operations; every time you click on a given preset, the system is rolling multiple dice, and the output will be different every time. Utilizing a vast library of sampled sounds, the software composes note by note, “like a human would,” says Silverman. To provide the raw materials, his team created what he calls “the world’s largest sample library,” capturing the sounds of thousands of unique instruments, both acoustic and electronic; they then designed a dataset containing all the information the AI would need to understand the basics of contemporary music and allow it to riff on what it had learned. Having taught the AI the meaning of concepts like guitars, sadness, and ’80s rock, he says, “We could say to the AI, ‘Please compose a piece of music for the guitar in an ’80s rock style that will make someone feel sad.’ And it would synthesize it from scratch.”

Crucially, building their own dataset, rather than working off recorded music, allowed Amper to sidestep potential copyright issues. In infringement cases, U.S. copyright law considers not only intent but also access: That is, might a previous encounter with a given piece of music have led to an instance of entirely unconscious copying? But unlike Bruno Mars, Amper can never be accused of copying Marvin Gaye, because Amper has never heard a lick of Marvin Gaye’s music. “And in a magical world where Amper one day created a piece of music that is Marvin Gaye?” Silverstein replies, anticipating my next thought: “It’s monkeys typing Shakespeare.”

The economic implications of an AI capable of writing convincingly human-sounding music for commercial use are obviously vast. And in 2020, the stock footage company Shutterstock acquired Amper. (Silverstein briefly took on a VP role but is no longer with Shutterstock; in our interview, he stresses that he’s speaking as an individual, not a representative of the company.) The synergy is obvious: Shutterstock is a marketplace for royalty-free imagery, while Amper is a tool for creating royalty-free original music; a single content producer can create the sort of professional-quality audio-visual presentation that would once have required an entire creative team.

But what about Silverstein’s former colleagues, the ones still toiling away arranging string sections? “Nobody’s ever asked me that before,” he says, and it takes me a moment to realize he’s being sarcastic. Silverstein’s answer is blunt: If you make your living in music, the job you have today will probably not exist in five to 10 years. But that’s not all bad, “because a job is really just a way to accomplish a task,” he adds—a rhetorical leap that offers little solace to working musicians who’ve already seen their incomes plummet over the past decade.

To Silverstein, AI is a means, not an end. While it may displace the creators of what he calls “functional” music, resourceful music-makers will find ways of incorporating the technology into their art. “When we think about creative AI, rather than something out of left field that the aliens brought to Earth, I’m thinking of it as the next logical step in a millennia-long technological evolution of creative tools,” he says. “It’s imperative that we as creative folk don’t think of ourselves as the last bastion of humanity fighting to stop the zombies from taking over the Earth,” but the vanguard tasked with exploring the possibilities of a once-unimaginable technology. Of course, that’s easy for him to say; he traded a job writing soundtracks for a successful startup.

Similar automatic-music solutions are everywhere these days, and while many of them are geared toward functional rather than artistic music, there’s no doubt that they are harbingers of things to come—if not the musical future, certainly one of them, and one that is getting closer every day. A growing field of streaming-music apps uses AI to create unique, endlessly regenerative soundtracks designed to aid concentration, sleep, meditation, and exercise. Many of these apps are steeped in the language of science, with a heavy dose of Silicon Valley solutionism.

Brain.fm touts a science-first approach that uses algorithms to create audio “that sounds different—and affects your brain differently—than any other music.” Its offerings are keyed to the productivity-optimizing holy trinity of focus, relaxation, and sleep, and its site boasts of patents on “technology to elicit strong neural phase locking.”

How NFTs Are Shaping the Way Music Sounds

Endel, another app geared toward self-improvement through sound, is even more insistent in its message of better living through circuitry. “We’re not evolving fast enough,” proclaims a manifesto on the company’s website. “Our bodies and minds are not fit for the new world we live in. Information overload is destroying our psyche.” Like Brain.fm, Endel uses AI to generate evolving soundscapes geared to working, meditation, and sleep; Endel’s differentiator is that it factors in variables like weather, time of day, location, and even heart rate in order to tailor its audio streams to each user’s situation.

But on a recent weekday afternoon, when I Zoomed with co-founder and chief composer Dmitry Evgrafov, he was at home in Berlin, wearing not a lab coat but what looked like a very plush, comfortable robe. Where Endel’s marketing stresses hard numbers, Evgrafov came across more like a philosophical ambient musician. Which, in fact, he is.

Active since the early 2010s, Evgrafov has released over a dozen recordings of delicate ambient music on labels including the hallowed classical imprint Deutsche Grammophon. But around the time that Endel’s six co-founders were launching the company, in 2017, he found himself burning out on the whole idea of being an artist. Putting so much focus on the recording felt precious, he thought. He wanted a sense of accomplishment from his music, but it seemed unlikely that his contemplative instrumentals could have much impact on the world. “The only answer I found,” he told me, “is functional music. The catch is that you have to leave your ego at the door. This music has to put people to sleep.”

How Endel works is pretty simple, as far as AI goes: Evgrafov and his fellow sound designers create discrete stems, or musical parts—looping basslines, wafting synth pads, shimmering chimes—which Endel then remixes on the fly, ad infinitum. I spent a few days listening to it, and the results are absolutely unremarkable, although that is the point. Its muted keys and heartbeat-pulse rhythms are meant to be unobtrusive.

While Endel is helping listeners to tune in and chill out, the company has loftier goals, like enabling musicians to plug in and scale up, using AI to create infinitely regenerating artworks in their own signature style. As luck would have it, Grimes was already a fan of the app; having recently given birth to her first child, she designed a continuously evolving sleep soundscape of her sighing, cooing, and chirping “I love you” over gently undulating synthesizers—a lullaby for exhausted mothers as much as their restless progeny. Since that initial experiment, Endel has collaborated with musicians from across the spectrum: R&B visionary Miguel, minimal-techno pioneer Richie Hawtin, sad-soul singer-songwriter James Blake. “Even though you Endelize the thing, and it doesn’t sound like the original stems at all, on a very deep level, the core of the artist remains,” says Evgrafov.

When I ask Evgrafov what he thinks the next five years of AI will bring, he sounds surprisingly doubtful. “I’m a bit pessimistic,” he says. “Our priority is to shatter stereotypes about AI being low quality,” Evgrafov continues. “We’d like to do something that stands alongside the godfathers of ambient music, like Brian Eno and Laraaji.” Yet his fear is that the proliferation of AI-driven audio will lower listeners’ standards. “The problem is not that AI will take over and steal musicians’ bread and butter,” he says. “It’s that people will get used to the shitty sound of AI.” The hypothetical he describes sounds a lot like the one we’ve already seen at work on mood-based streaming playlists, which have filled up with no-name artists cranking out functional music designed to blend into the wallpaper—and bolster the platforms’ bottom lines.

Ironically, Endel may be contributing to that same process. As part of the company’s artist outreach program, they have begun releasing collaborative albums on streaming services—an hour-long snapshot, essentially, of what the artists’ endlessly recombinant compositions might sound like at any given moment. “We’re able to create albums with the push of a button,” he says, telling me about an upcoming long-player from an unannounced artist. “We just pushed a button and created this thing and called it a day, and now it’s being released.”

The birth of a new medium or the death of an old business model? The future of AI holds much promise, but it also raises plenty of cause for concern. To music lovers, not to mention artists, albums generated at the push of a button may not sound like a very inviting prospect, just as metabolically optimized mood music may sound like sonic Soylent, a music replacement that takes the pleasure out of listening. What for an artist is provocative identity play could, in the hands of a malicious actor, become a high-stakes deception.

Perhaps it’s worth recalling something that Brian Eno once said: Artificial intelligence is, in many ways, a misnomer; what computers really have is “artificial stupidity.” A given technology is only as useful as the tasks its creators assign to it. AI can only do what we teach it to do—whether that means copying Elvis, counterfeiting a head of state, or synthesizing a sound never before heard. The imaginative horizon, like the moral one, is up to us.

This week, we’re exploring how music and technology intersect, and what today’s trends and innovations might mean for the future. Read more here.