Performance and Memory Improvements in Android Run Time (ART) (Google I/O ’17)


DAVID SEHR: Hello, everyone. I’m David Sehr. I’m the lead of the Mountain
View Android Runtime Team. We’re the folks that build the
code that runs on your phone and loads your applications,
runs them, and manages their memory. So last year our team
presented the evolution of ART, and we described some
of the techniques we were using in the runtime,
and how you can really see the goodness that we bring. So last year we talked about
Profile Guided Compilation, how you can make your
app experience better by understanding how
programs run on your phone. We talked about how important
it is to keep memory usage down and to make memory
allocation blindingly fast. And we talked about
just-in-time compilation, and how we can get
great performance from your applications without
the annoying optimizing dialog. So you heard last year
about how great things are, but it was just the beginning. Now that your phone
knows more about how your applications
execute, we can focus on loading only the
parts that are important to you so apps load faster
and use less memory. And speaking of
less memory, we made your applications spend
less time reclaiming, and even less time
allocating memory. And of course with the
just-in-time compiler, we can focus on making your
applications run even faster. Nicolas and Aart will
be talking to you in just a moment about that. You’ll notice I mentioned
using less memory twice. So let’s start by
talking about that. I’m going to begin by talking
a little bit about how we rearrange your application
to use less memory. And then Mathieu is
going to tell you about how we improved our
heap sizes, reduced the jank, reduced the pause times, and
have even better allocation times. First, a little review of
how Android applications work on your Android phone. An application comes
in an APK file, which contains one or more dex files. These contain the instructions
for executing your application. ART uses the dex file to get
information about instructions to run, strings, class
relationships, and lots of other things. And the dex files are
loaded into memory by ART. Loading these files has both
a RAM and a startup time cost. So when you run,
for instance, Maps, one of our favorite
applications, your phone reads the
dex files into memory, and you can see amounts
of thousands of pages. Multiple megabytes
worth of information are loaded into memory. Now the dex files were
produced by a developer, and the typical developer
tools don’t really know about the use cases
that are going to be important to you on your phone. So the dex files may
have unimportant things on the same pages as important
things like, for example, the way you use Maps. You may only use some of the
methods that were in the dex files, or some of the strings
that are there are not seen in your use case. As I told you
before, in Android N, ART introduced profile
based compilation. In Android O, we
have clever new uses for this, several clever
new uses, in fact. But one of the ones
I’m here to talk about is improving dex file locality. The idea is simple. Use the JIT profile information
to move the important things, the things that have
been used by you and your usage of the
applications, closer together. And this is all done seamlessly
and transparently to the user as we compile applications
on the device. So after we run your
application once, at least, to collect the
profile information, we use information about
what parts of the application were important to you. We gather the important
methods together, and leave the unimportant
methods together. As you can see on
the right there, a lot fewer pages are accessed,
especially the methods, when we use profile information. Now let’s take a look at
what type of RAM improvements we see after launching a few
apps, a few of our favorites again. As you can see, there’s
a significant reduction, typically around a third, in
these applications, less memory used by dex after layout. This RAM reduction also
improves launch time on devices that have slower
flash, because you’re pulling in fewer pages and
hence spending less time paging applications off of the flash. So that’s it for my part. Now over to Mathieu, who is
going to tell you about our New Garbage Collector. MATHIEU CHARTIER:
Thank you, David. Another thing that saves RAM– [APPLAUSE] –is the new concurrent and
copying Garbage Collector. Now Android has had a compacting
garbage collector that runs for background
apps since Android L. But unfortunately, since this
collector is non-concurrent, it meant that there was a GC
pause for the entire duration. So basically the problem
of having a GC pause for the entire
duration of the GC is that it can last
hundreds of milliseconds, and this would very
likely cause jank for foreground applications. In Android O, ART now
concurrently compacts the heaps of background and
foreground applications. This enables compaction of
many long-lived applications, such as the Android system
process and Google Play services. Since these processes
were long-lived, they tended to have a high
fragmentation over the time that they were executing. Let’s take a look
at the GC process. Unlike its predecessor,
the new GC is region based. Starting out, the GC
does a brief pause to identify which regions
we are going to evacuate. These regions are also
known as the source regions. Threads then resume
from the pause after having walked
their stacks. Next up is the largest phase
of the GC, the copying phase. In the copying phase,
reachable objects are copied from the source
regions to destination regions. Finally, in the
reclaim phase, the GC frees the RAM for
the source regions. Starting with the pause phase,
the pause is pretty small. During the pause,
one of the key steps is identifying which
regions to evacuate. The goal of
evacuation is to copy all of the reachable
objects out of regions with high fragmentation. After this is
accomplished, the GC can release the RAM
for these regions. In this example, the GC
picks the middle two regions as the source regions, because
these both have more than 20% fragmentation. Next up is the copying phase. In the copying
phase, the GC copies all reachable objects
from the source regions to the destination
regions with the goal that no objects will
reference the source regions after collection has completed. The GC also updates the
references to these regions to point to the new
addresses of the objects. Now since application threads
are running concurrently during the Garbage
Collector, the GC needs a way to make sure that
these threads don’t end up reading a field that
points to a source region. To accomplish this,
the GC uses a technique called a read barrier. A read barrier is a
small amount of work done for every field read. The read barrier
prevents threads from ever seeing references
to the source regions by intercepting the reads
and then copying the objects to the destination regions. In this example,
there’s a thread attempting to read foo.x. This is a reference to a bar
object in the source region. The read barrier intercepts
this read, copies the object to the destination region, and
also returns the new address. The copying process
continues copying, moving objects as well as doing
read barriers, if necessary, until there are no longer
any references to the source region. At this point, the GC
begins the reclaim phase. As you can see here,
there are no longer any references to
the source regions, so the Garbage Collector
can free all of the RAM for these regions. And what we are
left with is a heap that has much less
wasted RAM compared to when the collection began. Now you might be wondering
how much RAM can we save by compacting foreground
applications as well as background applications. Well, the average heap
size is 32% smaller, compared to the Android N
Concurrent Mark Sweep Garbage Collector. And RAM lost to GC overhead
is also a little bit smaller. One important factor about
concurrent garbage collectors is how long application
threads are suspended. Every millisecond that the
threads are paused or suspended is one less
millisecond to prepare the next frame for the UI. With the new GC,
the average pause time is 0.4 milliseconds,
compared to 2.5 milliseconds in Android N. And the 99%
worst case for large heaps is 2.6 milliseconds, instead of
11 milliseconds in Android N. Finally, always
compacting the heap has enabled ART to switch to a
new thread local bump pointer allocator. This allocator is simpler
and faster than the free list allocator it replaces. Overall, this means the
allocation is around 70% faster compared to Android N.
And if you go further back, allocations are 18
times faster than KitKat on a device adjusted basis. And now off to Nicolas
for other optimizations. [APPLAUSE] NICOLAS GEOFFRAY:
Thank you, Mathieu. So besides managing
application memory, ART is also responsible for
executing application code. And for every Android
release, our team spends a significant time
optimizing wherever we can. So in this talk,
we’re going to share with you two major
accomplishments we’ve made for this release. The first is a
performance case study. And the second is a new
optimization formula for loops. Let me start with the
performance case study. And then Aart will later talk
about loop optimizations. So to validate all the
optimizations we do, we have our own benchmarks
where we look for regressions and improvements. But we also want to make
sure that optimizations do matter for real Android apps. So for the O release, we have
invested our optimizations on one of our Android
applications, Sheets. Over the years, the ART
team and the Sheets team have worked on benchmarking
the core computational logic in the Sheets app. Our Sheets benchmark
suite is composed of three types of benchmarks,
benchmarking low-level runtime capabilities, core
Sheets formula evaluation, and benchmarks
that shuffle around themselves. And we’re very excited to share
with you that in Android O, we’ve made significant
improvements to ART’s runtime and compiler, up to the point
that we’re now running eight out of the nine
benchmarks in Sheets from two times to
three times faster, compared to our
previous released Ouga. You may remember a
similar graph to this one that we presented at the
Keynote two days ago. This is actually a snapshot
of our benchmark Money Train tool, where we track
over time the impact of each individual change we
do in ART on our benchmarks. For this graph, we see that
the Sheets aggregate score keeps on improving month after
month, between the Android N and Android O release. So let me, people, explain
to you the major improvements we’ve made for O. So one
major change is of course the New Garbage Collector
that Mathieu just introduced. Overall, it improved the
performance of Sheets by 40%, thanks to thread
local allocation buffer and faster collections. The next major
improvement we made is our Inliner, which
gave us another 20% boost. And this optimization was one of
the easiest optimizations we’ve made, because you already
had an Inliner in N. But what happened in
N is that now we’re doing all the compilation
in the background, and we only compile
important things. So we can avoid the code
bloating when we inline. With those two improvements
implemented in N, it was a lot easier for us to
do a more aggressive inlining for O. So for example, we
now inline for dex files. We inline methods which
could end up throwing. And we give a much larger
inlining budget so more methods get inlined. Well, let’s move to
an optimization that did require some work, and it
helped us increase the Sheets code by 15%. This is called–
the optimization is called Code sinking. It’s an optimization that
essentially moves instructions next to instructions
that actually use them. So typically what you want
to do is, instructions that are rarely being used,
you want to move them closer to where they are used. So the instructions that
are not in the regular flow of your application,
you want to move them to the exceptional cases. That’s fairly abstract. So let me give you an example
to make it more concrete. Here’s an excerpt
of Sheets code. There’s a method
called createRange. It takes two integers
and returns a new range. The createRange method
calls a helper method called checkCondition that will
check the required conditions for creating a range. If those conditions are not
met, it will throw an error. Notice how the checkCondition
is a var arg test method. And this is important
to notice, because it will affect the code that is
being sent to our runtime. Here’s how. So you start with
this simple method. The Java compiler actually
desugars the var args code to allocating a new object– allocating a new array, putting
in an array boxed versions of start and end, and
then passing this array to checkCondition. And this isn’t Android specific. This is what the
Java compiler does when it compiles the byte code. And these instructions
are not completely free. Yes, we’ve made
allocations extremely fast with the New Garbage Collector. But having to create this array
and the boxed versions of start and end, all the
time the method is executed is not really ideal. So here’s how we avoid
executing it completely. First, we can easily inline
to call the check condition. So internally this is how the
code is transformed in ART. Notice how args now is
only used in the If branch. So what the compiler
can do is move the allocation and
the boxing closer to where args is being used
in the exceptional flow. So at the end of this
optimization, what the compiler has made
sure of is that only the required instructions will
be executed in the normal flow. The last optimization
I want to mention here is Class Hierarchy Analysis. Class Hierarchy Analysis is
a pretty common technique in object oriented
language, when a runtime will try to infer
classes and methods that can be made final even though
they aren’t marked as final. Having this information
internally gives a lot of room for the compiler
for more optimizations. For example, you can
do more inlining. And because Java has
done any class loading, it will bail out of
these optimizations if suddenly a new
class gets loaded and all the optimizations we
did by inferring final classes and methods become invalid. So this sums up
the optimizations I want to talk to you about. For the record, we
also did a bunch of other improvements
for this release, such as faster type checks,
faster accessing of compiler code for classes of strings,
faster JNI transitions, more compiler intrinsics, and
a lot more that we do not have the space to list here. So for the interest
of time, I’m not going to go into details
on these optimizations. Instead I’m going
to hand it over to Aart, who is
going to tell you about a whole new set
of loop optimizations. [APPLAUSE] AART BIK: Thank you, Nicolas. So if you don’t look
at virtual method overhead or general
runtime overhead, programs tend to spend
most of the time in loops. So besides all the good
stuff that Nicolas already talked about, it
can really pay off by optimizing
loops specifically. And I will discuss some
of these in this part. Optimizations really always
consist of an analysis part where you look at the program
and an optimization part where you actually perform
the transformations. And you can see here a list
of all the work we did for O. But I’ll only touch on a few of
those in the interest of time. But before I do that, let’s
first look at what loop optimizations can do for you. So here you see a graph that
puts Android O versus Android N performance, so
higher is better. And the color encoding
shows you specifically where loop optimizations
in general, the blue stuff, has helped, and the red stuff
where vectorization, which I’ll talk about briefly, has helped. You see that benchmarks on
the left, like Loop and Sum, really get unrealistic
high speed ups. And it is really as a result
of that we broke the benchmark. So we were able
to transform loops into closed form expressions
that execute really fast. And although that’s
always nice to have stability in your compiler,
it’s not very realistic. But as you go to the
right of the graph, you see more
realistic speed ups. LU and Linpack obtained 10%
improvement by vectorization. So central to all
optimizations is always Induction Variable Recognition. And that consists of
finding expressions that progress in a regular and
predictable way in your loop. And the most common example
is a linear induction. So here in this
example, the basic loop counter i is a linear induction. But also the expression j, shown
in blue, is a linear induction. Every time around
iteration, it increments by loop invariant constants. There’s many more
induction variables. And you see some of them
depicted here in the graph. And detecting as many
of them as possible is always good for
the next phase, the actual optimizations. So let’s look at what loop
optimizations can benefit from induction variables. So here you see a somewhat
synthetic example, where the compiler can easily
see that in the innermost loop, the increment to sum
actually is very predictable. Since the loop only
iterates 100 times, it can actually replace
the sum, hoist it out with a closed form expression
that just adds 100 at a time. After you’ve done
that, you can actually hoist it again out of the next
loop if you take a little bit care of the fact
that the loop may not be taken when N is negative. So after that, the
whole computation has been hoisted
out of the loop, and the whole double
loop can be eliminated. So in this case, the whole loop
is replaced by closed form. And that’s one of the examples
that I showed at the beginning where we broke the benchmark. And obviously your
code will probably not benefit this greatly
from optimizations, but having this
ability can really kick in, like after you
inline library code. Not to the same
degree, but it’s still nice to be able to optimize
your induction variables. Another example where induction
variables can help you is the Bounds Check Elimination. So when you access
an array, you always need to test to see whether
the subscripts can go out of bounds. And if they do, you
throw an exception. But if you know the except range
that induction variables will take, you can often eliminate
those tests completely. So here you see an example
that both the axes to a and b will never go out of bounds. So the compiler can
statically remove those tests, and the program, as a result,
will execute a lot faster. Induction variables also
tell how often loops iterate. So if you know that it
is only a few times, you can actually
completely unroll the loop. And the advantage of unrolling
is that you completely remove the loop control overhead. It reduces the code
size, because you don’t have the control to iterate. And it often enables
constant folding. So you see an example
here where the blue part shows that the loop only
iterates from 10 to 10, so it actually
only iterates once. So you can replace the whole
loop with a single statement. And as a result, you can also
constant fold to multiplication and already do that
at compile time. So the program will
run a lot faster. Again, typical code won’t
optimize right away. But after inlining or other
forms of specializations, these iterations occur. And they can help
improve your performance. So the last loop optimization
I want to talk about is the ability of
Android O to take advantage of SIMD instructions. So SIMD instructions
are instructions that perform a single
operation simultaneously to multiple data operands. So you see an example here
where one instruction actually does four additions
at the same time. And all our target platforms– ARM, Intel, MIPS– they
have such instructions. And if you take
advantage of them, it can greatly improve
the performance of certain classes of programs. So converting sequential
code– and we typically focus on loops, but it doesn’t have
to be restricted to that– into code that exploits
the SIMD instructions, that process is
called vectorization. And it’s illustrated
here, where on the left you see a loop that
iterates sequentially. And on the right, you
see the vector loop that actually goes by four. In order to do this
transformation, this vectorization, you need
some very specific analysis. You have to detect and
resolve memory conflicts between the loop iterations
to see if you can actually execute them [INAUDIBLE]. You may want to be a little bit
more strict about alignments, because SIMD instructions often
require stricter alignments on the memory. And you want to detect
idiomatic constructs– you’ll see an example
of that shortly– stuff that you cannot really
express with single operators in your sequential code. But if you detect
them, you can map them onto very efficient
SIMD instructions. So let’s just explore
this whole vectorization with a case study. Suppose that you want to
display a stream of pictures, like these paintings here on
the graph, on your display. And you want to
transition smoothly from one picture to the other. You don’t want to just show
one and then the other. You want to sort of have
a transition in between. And you want to
do that real time. You don’t know the
stream in advance. And of course you want to
do it in an efficient way. So one way to get
a smooth transition is actually performing
a cross-fade. So what you do is called
rounding halving add. So basically you take the
effects of two pictures, and you show that in
between the two pictures. So in the example here, you
see the shipwreck on the top and the quiet fire
on the bottom. And the picture that
you show in between will sort of be an average
of the two pictures, showing them at the same time. It forms a nice,
smooth transition. So how do you code
it, and you don’t want to go to a native solution. You don’t want to use the NDK. You don’t want to use GPU code. You just want to
stay at source level. So here you see a method that
could do such a cross-fade. So you have two
incoming byte arrays with 3% of pictures x1 and x2. You go for a zero extension
and an averaging operation, and then you compute the x out. So keeping this at source
level is, of course, a much easier way of
expressing such an algorithm. It’s easy to write, test,
debug, maintain, et cetera. You don’t have to use the NDK. You don’t have to use GPU code. So Android N will actually
translate the loop that I just showed to the
sequential code on the left. And Android O will now
generate the SIMD code shown on the right. And I don’t expect you to
address this whole assembly right away. But just focus on the parts
that have been highlighted. So first of all, in orange, you
see that the sequential loop goes by one. It does one byte
at a time before it goes to the other one. In contrast, the
SIMD code goes by 16. It does 16 bytes
at the same time. The other thing
that you notice is that the loop body is a lot
shorter for the SIMD code. And that’s a result of
the yellow highlighted instructions. On the left, there’s
a lot of instructions required to do
the zero extension and the rounding halving add. On the right, there’s a
single idiomatic instruction that can take care of that. So all this combined
makes sure that the loop runs about 10 times faster
if you just look at the loop. However, if you look at
the real application, where this loop is actually part
of a larger thing where there’s a handler requesting
new pictures, et cetera, you can see that
vectorization made a difference between
rendering at the slow 20 frames per second, and the
other vectorization we rendered at 60 frames per second. And that’s actually
as fast as you can go. You cannot render much faster. So that really shows
that vectorization has given you the
power to render at a very acceptable speed. And you actually have
CPU cycles to spare, because it doesn’t need
the full power to go to the 60 frames per second. And all this was made
possible by the vectorization. You don’t need to write a
tedious NDK code anymore. You can just express the
loop at source level. [MUSIC PLAYING]

10 comments

  • Dave Park

    Thank you guys for what you do! 👍👍👍

    Reply
  • GSR_Weirdo

    Thank you for the work guys. The GC optimizations were pretty cool

    Reply
  • free mov

    ART….already great improvements in nougat .now more and more Dex files

    Reply
  • 1Lobby

    So you didn't make use of SIMD before? That sounds bad 😮

    Reply
  • dongliang liu

    cool things,learn too much from this topic

    Reply
  • Grant Johnson

    These are my favorite io talks. Very satisfying seeing those performance improvement numbers

    Reply
  • zungaloca

    unnecessary java bullshit

    Reply
  • zungaloca

    thats why everyone is using the NDK

    Reply
  • zungaloca

    will never match iphones native speed

    Reply
  • Liran Barsisa

    No word about bitmaps?
    According to my tests, on Android 8.0+, bitmaps are stored on the native memory (RAM), instead of heap, so you can have much larger bitmaps than before, without OOM.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *