Magenta: Music and Art Generation (TensorFlow Dev Summit 2017)

Doug and I’m a team member on Google Brain team. And I’m going to talk to you
about a project called Magenta that’s focusing on really
hard challenge, which is generating music and
art using deep learning. So why are we here? We’re here to ask the question,
can we use deep learning and reinforcement learning
to generate compelling media. And treating that
generally, we’re talking about music, and
images, and video, and text, for example, of joke telling. And this is a hard problem. I’d missed the
keynote this morning, but someone in
the audience maybe still is working with
robots and painting. Is that person still here? All right. So you know this is hard, right? This is not an easy
thing just to generate– Furthermore, we want
to generate things that are interesting
and are surprising, and we want to measure success. So we want to understand
what people are really doing with this media. And I think it’s
an exciting time to be able to try a
project like this, where we have some of the
tools available to us that weren’t there before. And this is what we’re about. So one of the things
that we’re trying to do is do this out in the open
and build a community. So a community of
creative coders, and a community of artists
and musicians to work with us. So everything that
we’re doing in Magenta is being put into open-source in
GitHub in the Tensorflow repo, rather Magenta
repo in TensorFlow. And we’re trying really
hard to have tools, for example, Jupiter notebooks,
that allow people to really quickly do some things, so
that we can draw them in and all work on this together. So this is an example of style
transfer that’s out there. But I think more
to the point and– I think the challenge that I
want people to leave with– if you’re going to think
about this problem, and I think it’s a problem we’re
thinking about a challenge– is understanding
just how important it is for these models to
be embedded in the world and to get critical feedback. As a thought experiment,
you can imagine we train some
model that can just generate tens of thousands or
hundreds of thousands of songs that sound just like “the
Beatles”, or whatever. It’s not very
interesting to just think of pushing that button, right? You just keep generating
more and more material and it becomes overwhelming. I think more
interesting is thinking about making a
feedback loop where Magenta is being
used by musicians, being used by artists
in some interactive way. And also the stuff
that’s being generated is being evaluated by people. Another way to look at
this is with respect to what we can do as
engineers, and what we can do as artists and musicians. Art and technology
have always co-evolved. The film camera was
initially thought of as A, this device that
is not artistic and is there to just capture reality,
and B, something that is not very
interesting, compared to what painters are doing. And people came along and turned
it into an artistic instrument. And you would note that
when people turn things into artistic instruments,
they tend to try to break them or to use them in ways in
which they’re not intended. So on the slide
you’re looking at now on the left you’ll see Les Paul. He’s one of the inventors
of the electric guitar. Rickenbacker is another. And what they were trying to
do with the electric guitar was make it louder,
so they could compete with other instruments
in a concert setting. And they weren’t trying
to make it distort, right? And they weren’t trying to– distortion was a failure case. And you have people like Jimi
Hendrix, above, or St. Vincent, on the bottom, that come along
and they take this technology, and they do things with
it that are unexpected. And I think that’s a really
important part of the recipe. What makes, I think, Magenta
interesting with respect to this, if you
will, is to say, OK, what if we can make something
like an electric guitar, or like a drum machine,
or like a camera, but that it itself has some
machine learning intelligence. So it’s pushing
back a little bit. The idea’s that you might
actually have something that you, as an artist or a
musician, can push against, that listens to you,
that responds to you. And the question then is, can
we, the people on Magenta, probably more like Les
Paul or a Rickenbacker than like Jimi Hendrix–
so I do play guitar and I am left-handed,
so there’s a little bit of Jimi Hendrix in me. The basic idea is,
can we, as engineers, build some technology
and then create that loop of collaboration
with them, with artists, right? So let’s dive into a
little bit of research now. Let’s talk about a nice
problem, image inpainting. How many people in the room are
familiar with this basic idea? OK, quite a few. It’s interesting that you
chop a chunk out of an image and then you try
to fill it back in. And it’s a kind of generative
problem, except, in one way, it’s easier than generating
an image from whole cloth, because you don’t have the
surrounding context yet. It’s also harder, because you
have to obey the surrounding context. This image is
actually a research from [INAUDIBLE] not
from the [INAUDIBLE] team of Google Brain. But I thought it
was a great image. You see that the human
artist on the upper right has filled in the missing
spot quite nicely. Probably those of
you that have looked at the recent work in GANs,
generative adversarial networks, are aware
of this finding that a normal L2
loss will give rise to something kind of
blurry and boring and safe. And that you need something
like an adversarial network or reinforcement
learning to force models to generate away
from the mean and do something more interesting. So we had an idea,
led by Anna Huang, who was a Google Brain intern
working with me in the Magenta team, to say, can we compose
music using something similar. And this wasn’t a one to
one reuse of inpainting. The idea was that you would take
multi-voice music, in this case Bach, and you would
then remove voices. A voice is a sort
of one melodic line. And there are four
voices in this music. We use Bach because
the data was available, and because for some
reason AI music generation people obsess on Bach. So we did the same thing. And you can either remove
part of the score as a chunk, or, more interestingly, you
can remove a voice at a time. So you remove one voice,
and the other three voices are providing context. And then, using Gibbs sampling,
you sample in the missing bits. And then you remove another
voice from the original data, and you sample in the
missing bits and continue. And so you can do this
really interesting conditional
generation, conditioned by the original data. Or you can simply start
with something empty and start to build
voices one at a time. And it turns out the
sampling helps a lot. So the first thing I’d like
to do is just play this. Don’t pay attention to
the quality of the sound. It’s just MIDI. Pay attention to
the melody and see if you think there’s anything
there in what’s been generated. So this would be the
time to play the video. [MUSIC PLAYING] All right. So it’s not perfect, but
it’s quite interesting. And what’s interesting about
it– the graph that I’m showing you is from our paper. We compared the three different
sampling routines to real Bach. And we found that when listeners
were asked to rate the music, they actually preferred the
Gibbs sampling to Bach himself, thus proving that we’re
better than Bach, right? Game over. No, what it shows, I
think, is that these models capture something interesting. I think the way we put it
in listening to them is, of course, it’s not
as good as Bach. But it captures almost some
cartoonish aspects of Bach. It’s almost like
more Bach than Bach, because it’s simpler than
Bach, but it captures some salient Bach pieces. I’m also convinced,
if we had gone out to musical experts, not
standard side-by-sides, that we would have found
something different. That said, we were
happy to see that we had the kind of structure
that people just sit and go, god, this is horrible, right? So yeah, more Bach than
Bach, with very big quotes around either “more” or
“Bach”, you take your pick. OK. So now, with what
little time is left, let’s move on to some
image-based work. Also some work done in
Brain with an intern Vincent Dumoulin. Some style transfer,
where we can do style transfer very quickly. We participated in
Magenta in this project and actually put
out on our GitHub. Some, I think, really nice
code for doing style transfer. This was a tweet yesterday
from Josh, here at Google. “Recommended. Magenta style transfer
code works out of the box with
their Docker image.” So I put that in as
a plug to maybe get you people to try things out. I also think they’re
really beautiful images. It’s a really nice use
of multi-style transform, or pastiche. Here’s the obligatory video. I believe we’ll be at
a point where we can do this in real time on device. I also would point out that
because there’s an isolating space, you can move back
and forth between styles and mix them and match them. This is a Magenta
team members dog. Shout out to Peekaboo. However I think the
main reason we’ve all seen a lot of style transfer– I want to point out that,
at least from the viewpoint of Magenta, the style transfer
work that’s being done is extremely preliminary. I mean, first, it’s
really interesting. I think it’s really cool
that we can do this, that we can pull
style from one network and then via convolution
apply it to another. But to give you an example
where the challenges lay– we just have this kind of
standard style transfer up above. And down below our four
portraits by Picasso. And what I would argue
from looking at portraits by an artist like this is
that when artists are playing with geometry, when they’re
playing with sort of the deeper geometry of the human face
or something else that has to do with deeper
structure, it’s clear that these convolutional
patch-based, pixel-based models are lost. They simply can’t do
these kinds of transfers. And it turns out that
structure, whether it’s structure in language
to understand meaning across paragraphs,
or structure in music, understanding phrases of music
that span longer time scales, or structure in art–
it all boils down to that, to this
understanding of meaning at deeper and deeper layers. And I think that’s
where the Grail is for us in terms of research. So why TensorFlow? Magenta is a TensorFlow. GitHub. We are at And we’re dedicated
to being, if you will, the glue
between TensorFlow, and the artistic community
and the music community. It has some real
advantages for us. The first one is,
because it’s Python, we’re able to work
with everything. A lot of working
with music and art has to do with just
piping data around, being able to do
the simple things. It’s hard to see in
the blue over there, but I just grabbed some
of our HTTP archive. Dependencies– pretty_midi
and mido, our two big MIDI I/O packages that we can use
immediately in Magenta. Thus in TensorFlow we’ve
really fast and flexible image, audio and video I/O,
which is crucial. A lot of our machine
learning for even music becomes I/O bound. If we’re learning
from say 16 kHz audio, it’s just a lot of
data to move around. We have TensorBoard. If you’re not using
TensorBoard, you want to be using TensorBoard. Board Here’s a view of
just some spectrograms from some audio generation
stuff we’re working on. It’s so nice to be able to,
while models are training, sample from them, look at
for us at their spectra and listen to them,
play with them without having to stop
the model training while the models are working. And find that there is a
developer community there, all of you. So the ability to
build out a community of coders, and of artists
and musicians that can work with us. What’s next? Well, first, you can follow our
blog at We use the blog to communicate
the research that we’re doing and to also highlight some
code that we’ve written. And also, please,
if you’re a coder– and I guess everybody
in this room is, this is a developer conference–
give it a shot at trying out some things with Magenta. We’re just sitting there at And we have Docker install. We also have a Pip install. It’s all pretty painless to use. So in closing, I want to
leave you with this quote. I won’t have time
to read it all but there’s a beautiful
thing at the end. “The excitement of grainy
film, of bleached-out black and white, is the excitement of
witnessing events too momentous for the medium assigned
to record them. ” I think the exciting part
of Magenta is this idea of actually taking machine
learning and its generative capacities, and actually
connecting it to people and allowing people to
experience and work with this. It’s– the opportunity to create
something like a drum machine, but that actually has some
intelligence to it, I think, is compelling. I want to thank
you for your time. I just want to mention,
we will have a demo during drinks that is
an ability to play along on a piano keyboard with
a Magenta piano model. So that can be kind of fun. Stop by. I’ll have a beer in my hand and
we can look at that together. And then I’m going to introduce
our next speaker, Lilly, who’s going to talk
about TensorFlow and analyzing retinal images to
diagnose causes of blindness. Thank you very much
for your attention. Cheers. [APPLAUSE]


Leave a Reply

Your email address will not be published. Required fields are marked *