Google I/O 2009 – Mastering the Android Media Framework

Google I/O 2009 – Mastering the Android Media Framework


Sparks: Good afternoon.
My name’s Dave Sparks. I’m on the Android team. And I’m the technical lead
for the multimedia framework. I’ve been working on Android
since October of 2007. But actually, technically,
I started before that, because I worked on the MIDI
engine that we’re using. So I kind of have
a long, vested interest in the project. So today, we have kind of
an ambitious title, called “Mastering
the Media Framework.” I think the reality is
that if you believe that– that we’re going
to do that in an hour, it’s probably pretty ambitious. And if you do believe that, I have a bridge
just north of here that you might be interested in. But I think we actually will
be able to cover a few kind of
interesting things. In thinking
about this topic, I wanted to cover stuff
that wasn’t really available in the SDK,
so we’re really going to del– delve into the lower parts
of the framework, the infrastructure
that basically everything’s built on. Kind of explain
some of the design philosophy. So… Oh, I guess I should have
put that up first. Here we go. So on the agenda, in the cutesy fashion
of the thing, we’re talking
about the architecture– Frank Lloyd Android. What’s new
in our Cupcake release, which just came out recently. And those of you
who have the phone, you’re running that
on your device today. And then a few common problems
that people run into when they’re writing
applications for the framework. And then
there probably will be a little bit of time
left over at the end for anybody who has questions. So moving along, we’ll start
with the architecture. So when we first started
designing the architecture, we had some goals in mind. One of the things
was to make development of applications that use media,
rich media applications, very easy to develop. And so that was one
of the key goals that we wanted to accomplish
in this. And I think you’ll see it
as we look at the framework. It’s really simple
to play audio, to display a video,
and things like that. One of the key things,
because this is a multi-tasking
operating system, is we have–
you could potentially have things happening
in the background. For example, you could have
a music player playing in the background. We need the ability
to share resources among all these applications, and so that’s one
of the key things, was to design an architecture that could easily
share resources. And the other thing is,
you know, paramount in Android
is the security model. And if you’ve looked over
the security stuff– I’m not sure we had a talk today
on security. But security is really important
to us. And so we needed a way
to be able to sandbox parts of the application
that are– that are particularly
vulnerable, and I think you’ll see
as we look at the– the framework,
that it’s designed to isolate parts of the system that are particularly vulnerable
to hacking. And then, you know,
providing a way to add features in the future that are backwards compatible. So that’s the–
the room for future growth. So here’s kind of
a 30,000-foot view of the way
the media framework works. So on the left side,
you’ll notice that there is the application. And the red line–
red dashed line there– is denoting
the process boundary. So applications run
in one process. And the media server
actually runs in its own process
that’s actually booted up– brought up during boot time. And so the codecs and the file parsers
and the network stack and everything that has to do
with playing media is actually sitting
in a separate process. And then underneath that
are the hardware abstractions for the audio and video pass. So Surface Flingers are
an abstraction for video and graphics. And Audio Flinger’s
the abstraction for audio. So looking at a typical
media function, there’s a lot of stuff– because of this inner process
communication that’s going on, there’s a lot of things
that are involved in moving a call
down the stack. So I wanted to give you
an idea– for those of you who’ve looked
at the source code, it’s sometimes hard to follow,
you know, how is a call– A question that comes up
quite frequently is how does a function call,
like, you know, prepare or make its way all the way down
to the framework and into the–
the media engine? So this is kind of
a top-level view of what a stack might look like. At the very top
is the Dalvik VM proxy. So that’s the Java object
that you’re actually talking to. So, for example,
for a media player, there’s a media player object. If you look at
the media player definition, it’s a pretty–
I mean, there’s not a lot of code
in Java. It’s pretty simple. And basically,
it’s a proxy for– in this case, actually,
the native proxy, which it’s underneath,
and then eventually, the actual implementation. So from that,
we go through JNI, which is
the Java Native Interface. And that is just
a little shim layer that’s static bindings to an actual
MediaPlayer object. So when you create
a MediaPlayer in Java, what you’re actually doing
is making a call through this JNI layer to instantiate a C++ object. That’s actually
the MediaPlayer. And there’s a reference to that
that’s held in the Java object. And then some tricky stuff– weak references
to garbage collection and stuff like that,
which is a little bit too deep for the talk today. Like I said, you’re not going
to master the framework today, but at least get an idea
of what’s there. So in the native proxy, this is actually
a proxy object for the service. So there is a little bit of code
in the native code. You know, a little bit of logic
in the native code. But primarily,
most of the implementation is actually sitting down
in this media server process. So the native proxy is actually
the C++ object that talks through
this binder interface. The reason we have
a native proxy instead of going directly
through JNI is a lot of the other pieces
of the framework does. So we wanted to be able
to provide access to native applications
in the future to use MediaPlayer objects. So it makes it
relatively easy, because that’s something
you’d probably want to do with games
and things like that that are kind of more natural
to write in native code. We wanted to provide
the ability to do that. So that’s why the native proxy
sits there and then the Java layer
just sits on top of that. So the binder proxy
and the binder native piece– Binder is our abstraction
for inter-process communication. Binder, basically,
what it does, is it marshals objects across
this process boundary through a special kernel driver. And through that, we can
do things like move data, move file descriptors
that are duped across processes so that they can be accessed
by different processes. And we can also do something
which–we can share memory between processes. And this is a really efficient
way of moving data back and forth
between the application and the media server. And this is used extensively in Audio Flinger
and Surface Flinger. So the binder proxy is basically
the marshalling code on the applications side. And the binder native code
is the marshalling code for the server side
of the process. And if you’re looking
at all the pieces of the framework–
they start with mediaplayer.java,
for example– there’s an android_media… _mediaplayer.cpp, which is the JNI piece. There’s a mediaplayer.cpp, which is
the native proxy object. Then there’s an
imediaplayer.cpp, which is actually a–
a binder proxy and the binder native code
in one chunk. So you actually see
the marshalling code for both pieces
in that one file. And one is called
bpmediaplayer.cpp– or, sorry,
BP MediaPlayer object. And a BN MediaPlayer object. So when you’re looking
at that code, you can see the piece
that’s on the native side– the server side
and the proxy. And then the final piece
of the puzzle is the actual implementation
itself. So in the case
of the media server– sorry, the MediaPlayer–
there’s a MediaPlayer service which instantiates
a MediaPlayer object in the service that’s,
you know, proxied in the application by this
other MediaPlayer object. That’s basically–
each one of the calls goes through this stack. Now, because the stack is,
you know, fairly lightweight in terms of we don’t make
a lot of calls through it, we can afford a little bit
of overhead here. So there’s a bit of code
that you go through to get to this place,
but once you’ve started playing, and you’ll see this
later in the slides, you don’t have to do
a lot of calls to maintain
the application playing. So this is actually kind of
a top-level diagram of what the media server
process looks like. So I’ve got this media player
service. And it can instantiate a number
of different players. So on the left-hand side,
you’ll see, bottom, we have OpenCORE, Vorbis,
and MIDI. And these are three different
media player types. So going from the simplest one,
which is the Vorbis player– Vorbis basically just plays
Ogg Vorbis files, which is a–
we’ll get into the specifics of the codec, but it’s
a psycho-acoustic codec that’s open sourced. We use this for a lot
of our internal sounds, because it’s very lightweight. It’s pretty efficient. And so we use that
for our ringtones and for our application sounds. The MIDI player,
a little more complex. But basically, it’s just
another instantiation of a media player. These all share
a common interface, so if you look at
the MediaPlayer.java interface, there’s almost, you know,
one-for-one correspondence between what you see there
and what’s actually happening in the players themselves. And then the final one
is OpenCORE. So anything that isn’t
an Ogg file or a MIDI file is routed over
to OpenCORE. And OpenCORE is basically the–
the bulk of the framework. It consists of all
of the major codecs, like, you know,
MP3 and AAC and AMR and the video codecs,
H.263 and H.264 and AVC. So any file that’s not
specifically one of those two ends up going to OpenCORE
to be played. Now, this provides
some extensibility. The media player service
is smart enough to sort of recognize
these file types. And we have a media scanner
that runs at boot time– that goes out,
looks at the files, figures out what they are. And so we can actually,
you know, replace or add new player types by just
instantiating a new type of player. In fact, there are
some projects out there where they’ve replaced OpenCORE
with GStreamer or other media frameworks. And we’re talking
to some other– some different types
of player applications that might have new codecs
and new file types, and that’s one way of doing it. The other way of doing it
is you– if you wanted
to add a new file type, you could actually implement it
inside of OpenCORE. And then on the right-hand side, we have
the media recorder service. Prior to–
in the 1.0, 1.1 releases, that was basically just
an audio record path. In Cupcake, we’ve added
video recording. So this is now integrated
with a camera service. And so the media recorder–
again, it’s sort of a proxy. There’s a proxy, um– it uses the same sort
of type of thing, where there’s a media recorder–
media recorder object in the Java layer. And there’s
a media recorder service that actually does
the recording. And for the actual
authoring engine, we’re using OpenCORE. And it has the–
the encoder side. So we’ve talked about
the decoders, and the encoders would be
H.263, H.264, and also AVC. Sorry, and MPEG-4 SP. And then,
the audio codecs. So all those sit
inside of OpenCORE. And then the camera service
both operates in conjunction
with the media recorder and also independently
for still images. So if your application wants
to take a still image, you instantiate
a camera object, which again is just a proxy
for this camera service. The camera surface takes care
of handling preview for you, so again, we wanted to limit
the amount of traffic between the application
and the hardware. So this actually provides a way
for the preview frames to go directly out
to the display. Your application doesn’t have
to worry about it, it just happens. And then in the case
where the media recorder is actually doing
video record, we take those frames
into the OpenCORE and it does the encoding there. So kind of looking at what
a media playback session would look like. The application provides
three main pieces of data. It’s going to provide
the source URI. The “where is this file
coming from.” It’ll either come from
a local file that’s on the– you know, on the SD card. It could come from a resource that’s in the application,
the .apk, or it could come
from a network stream. And so the application provides
that information. It provides a surface
that basically, at the application level,
called a surface view. This, at the binder level,
is an ISurface interface, which is an abstraction
for the–the view that you see. And then it also provides
the audio types, so that the hardware knows
where to route the audio. So once those
have been established, the media server basically
takes care of everything from that point on. So you–once you have called
the prepare function and the start function, the frames–video frames,
audio frames, whatever, are– they’re going to be decoded
inside the media server process. And they get output directly
to either Audio Flinger or Surface Flinger,
depending on whether it’s an audio stream
or a video stream. And all the synchronization is
handled for you automatically. Again, it’s a very low overhead. There’s no data that’s flowing
back up to the application at this point–it’s all
happening inside the hardware. One other reason
for doing that we mentioned earlier
is that in the case– in many cases, for example
the G1 and the Sapphire, the device that you guys
got today– those devices actually have
hardware codecs. And so we’re able
to take advantage of a DSP that’s in the device
to accelerate. In the case of,
for example, H.264, we can accelerate
the decoded video in there and offload some of that
from the main processor. And that frees the processor
up to do other things, either, you know,
doing sync in the background, or just all sorts of things
that it might need– you might need
those cycles for. So again, that’s–
all that is happening inside the media server process. We don’t want to give
applications direct access to the hardware,
so it’s another good reason for putting this inside
the media server process. So in the media recorder side, we have a similar sort of thing. It’s a little more complex. The application
can either, in the case of– it can actually create
its own camera and then pass that
to the media server or it can let the media server
create a camera for it. And then the frames
from the camera go directly into the encoders. It again is going to provide
a surface for the preview, so as you’re taking your video,
the preview frames are going directly to the–
to the display surface so you can see
what you’re recording. And then you can select
an audio source. Right now that’s just
the microphone input, but in the future,
it could be other sources. You know, potentially
you could be recording from, you know, TV or some–
some other hardware device that’s on the device. And then–so once
you’ve established that, the camera service
will then start feeding frames through the camera service
up to the media server and then they’re pushed out
to the Surface Flinger and they’re also pushed out
into OpenCORE for encoding. And then there’s
a file authoring piece that actually takes the frames
from audio and video, boxes them together,
and writes them out to a file. So, get into a little more
detail about the codecs. We have a number
of different– we have three different
video codecs. So one of the questions
that comes a lot– comes up a lot
from the forums is what kind of codecs
are available, what should they be used for,
and things like that. So just kind of a little bit
of history about the different codecs. So H.263 is a codec from–
I think it was– came out about 1996,
was when it was standardized. It was originally intended
for video conferencing, so it’s really
low bit-rate stuff. You know, designed to go over
an ISDN line or something like that. So it’s actually worked out
pretty well for mobile devices, and a lot of mobile devices
support H.263. The encoder is pretty simple. The decoder
is pretty simple. So it’s a lightweight kind
of codec for an embedded device. It’s part of the 3GPP standard. So it’s adopted by a number
of different manufacturers. And it’s actually used
by a number of existing video sites–
of websites– for their encode. For example, YouTube–
if you go to, like, the m.youtube.com, typically you’ll end up
at an H.263 stream. Because it’s supported
on most mobile devices. So MPEG-4 SP
was originally designed as a replacement
for MPEG-1 and MPEG-2. MPEG-1, MPEG-2–fairly early
standardized codecs. They wanted to do
something better. Again, it has a very simple
encoder model, similar to H.263. There’s just single frame
references. And there’s some question
about whether it’s actually a better codec
or not than H.263, even though they’re– they came out
very close together. It’s missing
the deblocking filter, so– I didn’t mention that before. H.263 has a deblocking filter. If you’ve ever looked
at video, it typically comes out
in, like, 8×8 pixel blocks. And you get kind of
a blockiness. So there’s an in-loop
deblocking filter in H.263, which basically smooths
some of those edges out. The MPEG-4 SP,
in its basic profile, is missing that. So it–the quality of MPEG-4, some people don’t think
it’s quite as good, even though it came out
at roughly the same time. Then the final codec
we support is a fairly recent development. I think it’s a 2003,
or something like that. The H.264 AVC codec came out. Compression’s much better. It includes the ability to have
multiple reference frames, although
on our current platforms, we don’t actually support that. But theoretically, you could get
better compression in the main–
what’s called the main profile. We support base profile. It has this mandatory
in-loop deblocking filter that I mentioned before, which gets rid of the blockiness
in the frames. One of the really nice things is it has a number
of different profiles. And so different devices
support different levels of–of profiles. It specifies things like
frame sizes, bit rates, the–the types
of advanced features that it has to support. And there’s a number
of optional features in there. And basically,
each of those levels and profiles defines
what’s in those codecs. It’s actually used in a pretty
wide range of things. Everything from digital cinema,
now, HDTV broadcasts, and we’re starting to see it
on mobile devices like the G1. When you do a–if you’re using
the device itself today, and you do a YouTube playback, you’re actually–
on Wi-Fi, you’re actually getting
a H.264 stream, which is why
it’s so much better quality. On the downside, it’s a lot
more complex than H.263 because it has these
advanced features in it. So it takes a lot more CPU. And in the case of the G1,
for example, that particular hardware, some of the acceleration
happens in the DSP, but there’s still some stuff
that has to go on the application processor. On the audio side,
MP3 is pretty– everybody’s
pretty familiar with. It uses what’s called
a psycho-acoustic model, which is why we get better
compression than a typical, you know, straight
compression algorithm. So psycho-acoustic means you
look for things in the– that are hidden
within the audio. There are certain sounds that are going to be masked
by other sounds. And so the psycho-acoustic model will try to pick out
those things, get rid of them,
and you get better– much better compression there. You get approximately
10:1 compression over a straight linear PCM
at 128kbits per second, which is pretty reasonable,
especially for a mobile device. And then if you want to,
you know, be a purist, most people figure
you get full sonic transparency at about 192kbits per second. So that’s where most people
won’t be able to hear the difference between
the original and the compressed version. For a more advanced codec, AAC came out
sometime after MP3. It’s built on
the same basic principles, but it has
much better compression ratios. You get sonic transparency
at roughly 128kbits persecond. So, you know,
much, much better compression. And another mark
that people use is 128kbits per second– MP3 is roughly equivalent
to 96kbits per second AAC. We also find it’s–
it’s used, commonly used, in MPEG-4 streams. So if you have an MPEG-4
audio–video stream, you’re likely to find
an AAC codec with it. In the case of our high-quality
YouTube streams, they’re typically
a 96 kilohertz AAC format. And then finally, Ogg Vorbis,
which I’d mentioned earlier, we’re using
for a lot of our sounds. Again, it’s another
psycho-acoustic model. It’s an open source codec, so it doesn’t have
any patent, you know, issues
in terms of licensing– whereas any of the other codecs,
if you’re selling a device, you need to go, you know, get the appropriate
patent licenses. Or I probably shouldn’t
say that, because I’m not a lawyer, but you should probably
see your lawyer. From our perspective,
it’s very low overhead. It doesn’t bring in all
of the OpenCORE framework, ’cause it’s just
an audio codec. So it uses–
it’s very lightweight in terms of the amount
of memory usage it uses and also the amount
of code space that it has to load in
in order to play a file. So that’s why we use it
for things like ringtones and other things that need
fairly low latency and we know we’re gonna
use it a lot. The other thing is that,
unlike MP3– MP3 doesn’t have a native way
of specifying a seamless loop. For those of you
who aren’t audio guy– audio experts, “seamless loop”
basically means you can play the whole thing
as one seamless, no clips, no pops loop
to play over and over again. A typical application for that
would be a ringtone, where you want it
to continue playing the same sound
over and over again without–without
the pops and clicks. MP3 doesn’t have a way to
specify that accurately enough that you can actually do that
without having some sort of gap. There are people that have added
things in the ID3 tags to get around that,
but there isn’t any standardized way to do it. Ogg does it–
actually, both Ogg and AAC have conventions for specifying
a seamless loop. So that’s another reason
why we use Ogg is that we can get
that nice seamless loop. So if you’re doing anything
in a game application where you want to get,
you know, some sort of– a typical thing would be like
an ambient sound that’s playing over and over
in the background. You know, the factory sound
or, you know, some eerie swamp noises
or whatever. That’s the way to do it
is to use the Ogg file. You’ll get pretty good
compression. It’s pretty low overhead
for decoding it. And you can get those loops
that won’t click. And then finally,
the last codecs we’re going to talk about
in terms of audio are the AMR codecs. AMR is a speech codec, so it doesn’t get
the full bandwidth. If you ever try to encode one
with music on it, it will sound pretty crappy. That’s because it–
it wants to kind of focus in on one central tone. That’s how it gets
its high compression rate. But at the same time,
it throws away a lot of audio. So it’s typically used
for video codecs. And in fact,
GSM basically is based on AMR-type codecs. It’s–the input is, for the AMR narrow band,
is 8 kilohertz. So going back to Nyquist,
that basically means your highest frequency
you can represent is just shy of 4 kilohertz. And the output bit-rates
are, you know, anywhere from just under
5kbits per second up to 12.2. AMR wide band is a little bit
better quality. It’s got a 16 kilohertz input,
and slightly higher bandwidths. But again,
it’s a speech codec primarily, and so you’re not going to get
great audio out of it. We do use these,
because in the package, the OpenCORE package,
the AMR narrow band codec is the only audio encoder– native audio encoder
we have in software. So if your hardware platform
doesn’t have an encoder, that’s kind of
the fallback codec. And in fact, if you use
the audio recorder application like MMS,
and attach an audio, this is the codec
you’re going to get. If you do a video record
today, that’s the codec
you’re going to get. We’re expecting that future
hardware platforms will provide, you know,
native encoders for AAC. It’s a little too heavy
to do AAC on the application processor while you’re doing video record
and everything else. So we really need
the acceleration in order to do it. AMR is specified
in 3GPP streams. So most phones
that will decode an H.263 will also decode the AMR. So it’s a fairly compatible
format. If you look at the–the other
phones that are out there that support, you know,
video playback, they typically
will support AMR as well. So we’ve talked about codecs. Both audio and video codecs. The other piece of it,
when you’re doing a stream, is what’s the container format? And so I’m going to talk
a little bit about that. So 3GPP is the stream
that’s defined by the 3GPP organization. These are phones that support
that standard and are going to support
these types of files. 3GPP is actually
an MPEG-4 file format. But it’s–very, very
restricted set of– of things that
you can put into that file, designed for compatibility
with these embedded devices. So you really want to use
a H.263 video codec for–for broad compatibility
across a number of phones. You probably want to use
a low bit rate for the video, typically like 192kbits
per second. And you also want to use
the AMR narrow band codec. For MPEG-4 streams,
which we also support, they’re typically
higher quality. They typically
are going to use either an H.264 or a higher–
bigger size H.263 format. Usually they use
an AAC codec. And then
on our particular devices, the G1 and the device
that you just received today– I’m not even sure
what we’re calling it– I– is capable of
up to 500kbits per second on the video side and 96kbits per second. So a total of about
600kbits per second, sustained. If you do your encoding well, you’re going to actually
get more than that out of it. We’ve actually been able
to do better than 1 megabit per second,
but you have to be– have a really good encoder. If it gets “burst-y,”
it will interfere with the performance
of the codec. So one question that comes up
a lot on the forums is what container
should I use if I’m either authoring
or if I’m doing video recording? So for authoring
for our Android device, if you want
the best quality– the most bang for your bits,
so to speak– you want to use
an MPEG-4 codec– er, container file
with an H.264 encoded stream. It needs to be,
for these devices today, a baseline profile roughly,
as I was saying before, at 500kbits per second HVGA
or smaller, and AAC codec
up to 96kbits per second. That will get you
a pretty high quality– that’s basically
the screen resolution. So it looks really good on–
on the display. For other– you’re going to create content
on an Android device, so you have a video record
application, for example. And you want to be able
to send that via MMS or some other email or whatever
to another phone, you probably want to stick
to a 3GPP format, because not all phones
will support an MPEG-4 stream, particularly
the advanced codecs. So in that case
we recommend… I’m getting
ahead of myself here. So in that case we recommend
using the QCIF format. That’s 192kbits per second. Now, if you’re
creating content on the Android device itself, intended for another
Android device, we have an H.263 encoder. We don’t have an H.264 encoder, so you’re restricted to H.263. And for the same reason
I’ve discussed before, we won’t have an AAC encoder, so you’re going to use
an AMR narrow band encoder, at least on the current range
of devices. So those are kind of
the critical things in terms of inter-operability
with other devices. And then the other thing is–
a question that comes up a lot is if I want to stream
to an Android device, what do I need to do
to make that work? The thing where most people
fail on that is the “moov” atom,
which is the index of frames that tells–basically tells
the organization of the file, needs to precede the data–
the movie data atom. And…the… Most applications
will not do that naturally. I mean, it’s more–
it’s easier for a programmer to write something that builds
that index afterwards. So you have–
you typically have to give it a specific–
you know, turn something on, depending on what
the application is, or if you’re using FFmpeg, you have to give it
a command line option that tell it to–
to put that atom at the beginning
instead of the end. So… For–we just recently came out
with what we’ve been calling the Cupcake release,
or the 1.5 release. That’s the release
that’s on the phones you just received today. Some of the new features
we added in the media framework. We talked about
video recording before. We added
an AudioTrack interface and an AudioRecord interface
in Java, which allows direct access
to raw audio. And we added the JET
interactive MIDI engine. These are kind of the–
the highlights in the media framework area. So kind of digging
into the specifics here… AudioTrack–
we’ve had a lot of requests for getting
direct access to audio. And…so what AudioTrack does
is allow you to write a raw stream
from Java directly to the Audio Flinger
mixer engine. Audio Flinger
is a software mixer engine that abstracts the hardware
interface for you. So it could actually–
it could mix multiple streams from different applications. To give you an example, you could be listening
to an MP3 file while the phone rings. And the ringtone will play while the MP3 file
is still playing. Or a game could have
multiple sound effects that are all playing
at the same time. And the mixer engine takes care
of that automatically for you. You don’t have to write
a special mixer engine. It’s in–
built into the device. Potentially could be hardware
accelerated in the future. And it also allows you
to… It does sample rate conversion
for you. So you can mix multiple streams
at different sample rates. You can modify the pitch
and so on and so forth. So what AudioTrack does,
it gives you direct access to that mixer engine. So you can take
a raw Java stream, you know, 16-bit PCM samples,
for example, and you can–
you can send that out to the mixer engine. Have it do the sample rate
conversion for you. Do volume control for you. It does–
has anti-zipper volume filters so–if anybody’s ever played
with audio before, if you change the volume, it changes the volume
in discrete steps so you don’t get
the pops or clicks or what we typically refer to
as zipper noise. And that’s all done
with… Either you can do writes
on a thread in Java, or you can use the callback
engine to fill the buffer. Similarly, AudioRecord gives you
direct access to the microphone. So in the same sort of way, you could pull up a stream
from the microphone. You specify the sample rate
you want it in. And, you know,
with the combination of the two of those, you can now take a stream
from the microphone, do some processing on it,
and now put it back out via the… the AudioTrack interface too,
that mixer engine. And that mixer engine will go
wherever audio is routed. So, for example,
a question that comes up a lot is, well, what if
they have a Bluetooth device? Well, that’s actually
handled for you automatically. There’s nothing you have to do
as an application programmer. If there’s a Bluetooth device
paired that supports A2DP, then that audio
is going to go directly to the…to the A2DP headset. Your…whether it’s a headset
or even your car or whatever. And then we’ve got
this call mack– callback mechanism
so you can actually just set up a buffer
and just keep– when you get a callback,
you fill it. You know, if you’re doing
a ping-pong buffer, where you have half of it
being filled and the other half is actually
being output to the device. And there’s also
a static buffer mode where you give it a–
for example, a sound effect
that you want to play and it only does a single copy. And then it just
automatically mixes it, so each time
you trigger the sound, it will mix it for you, and you don’t have to do
additional memory copies. So those are kind of
the big highlights in terms of the–
the audio pieces of it. Then another new piece
that’s actually been in there for a while, but we’ve finally
implemented the Java support, is the JET Interactive
MIDI Engine. So JET is– it’s based upon
the EAS MIDI engine. And what it does is allow you
to pre-author some content that is very interactive. So what you do
is you, if you’re an author,
you’re going to create content in a–
your favorite authoring tool. Digital authoring
workstation tool. It has a VST plugin,
so that you can, you know, basically write your–
your game code– your–your audio
in the tool and hear it back played as it
would be played on the device. You can take and have
multiple tracks that are synchronized
and mute them and unmute them synchronous with the segment. So basically, your piece
is going to be divided up into a bunch of little segments. And just as an example, I might have an A section,
like the intro, and maybe I have a verse
and I have a chorus. And I can interactively
get those to place one after another. So, for example,
if I have a game that, um– it has kind of levels,
I might start with a certain background noise,
and perhaps, you know, my character’s taking damage. So I bring in
some little element that heightens the tension
in the game and this
is all done seamlessly. And it’s very small content,
because it’s MIDI. And then you can actually have
little flourishes that play in synchronization
with it– with the music
that’s going on. So some–for example,
let’s say you, you know, you take out an enemy. There’s a little trumpet sound
or whatever. A sound effect
that’s synchronized with the rest of the–
the audio that’s playing. Now all this is done under–
under program control. In addition to that,
you also have the ability to have callbacks
that are synchronized. So a good example would be
aGuitar Herotype game where you have music
playing in the background. What you really want to do
is have the player do something in synchronization
with the rhythm of the sound. So you can get a callback
in your Java application that tells you when
a particular event occurred. So you could create
these tracks of–of events that you’ve been–
you know, measured– did they hit
before or after? And we actually have
a sample application in the SDK that shows you
how to do this. It’s a–I think a, like,
two- or three-level game that with–
complete with graphics and sound and everything
to show you how to do it. The code–the code itself
is written in native code that’s sitting on top
of the EAS engine, so again, in keeping
with our philosophy of trying to minimize the– the overhead
from the application, this is all happening
in background. You don’t have to do anything
to keep it going other than
keep feeding it segments. So periodically,
you’re going to wake up and say, “Oh, well, here’s the next
segment of audio to play,” and then it will play
automatically for whatever the length
of that segment is. It’s all open source. Not only is the–
the code itself open source, but the tools are open sourced, including the VST plugin. So if you are ambitious and you want to do something
interesting with it, it’s all sitting out there
for you to play with. I think it’s out there now. If not, it will be shortly. And so those are
the big highlights of the– the MIDI–
the MIDI engine. Oh, I forgot.
One more thing. The DLS support–
so one of the critiques of general MIDI,
or MIDI in general, is the quality
of the instruments. And admittedly, what we ship
with the device is pretty small. We try to keep
the code size down. But what the DLS support
does with JET is allow you
to load your own samples. So you can either
author them yourself or you can go
to a content provider and author these things. So if you want
a high-quality piano or you want, you know,
a particular drum set, you’re going for a techno sound
or whatever, you can actually, you know, put these things
inside the game, use them as a resource, load them in and–
and your game will have a unique flavor
that you don’t get from the general MIDI set. So… I wanted to talk about
a few common problems that people run into. Start with the first one here. This one I see a lot. And that is the behavior
of the application for the volume control is–
is inconsistent. So, volume control
on Android devices is an overloaded function. And as you can see
from here, if you’re in a call,
what the volume control does is adjust the volume
that you’re hearing from the other end
of the phone. If you’re not in a call,
if it’s ringing, pressing the volume button
mutes the–the ringer. Oh, panic. I’m in a, you know,
middle of a presentation and my phone goes off. So that’s how you mute it. If we can detect
that a media track is active, then we’ll adjust the volume
of whatever is playing. But otherwise,
it adjusts the ringtone volume. The issue here is that if your–
if your game is– or your application is just
sporadically making sounds, like, you know,
you just have little UI elements or you play a sound effect
periodically, you can only adjust the volume
of the application during that short period
that the sound is playing. It’s because we don’t
actually know that you’re going to make sound
until that particular instant. So if you want
to make it work correctly, there’s an–
there’s an API you need to call. It’s in–it’s part
of the activity package. It’s called
setVolumeControlStream. So you can see a little chunk
of code here. In your onCreate, you’re going to call this
setVolumeControlStream and tell it what kind of stream
you’re going to play. In the case of most applications
that are in the foreground, that are playing audio, you probably want
streamed music, which is kind of
our generic placeholder for, you know, audio
that’s in the foreground. If your ringtone application,
for some– you know,
you’re playing ringtones, and you would select
a different type. But this basically tells
the activity manager, when you press the audio button, if none of those… previous things are–
in other words, if we’re not in call,
if it’s not ringing, and if there’s–
if– if none of these other things
are happening, then that’s the default behavior
of the volume control. Without that,
you’re probably going to get pretty inconsistent behavior
and frustrated users. That’s probably
the number one problem I see with applications
in the marketplace today is they’re not using that. Another common one I see
on the–in a– on the forums
is people saying, “How do I–how do I play
a file from my APK? “I just want to have
an audio file that I ship with the–
with the package,” and they get this wrong
for whatever reason. I think we have
some code out there from a long time ago
that looks like this. And so this doesn’t work. This is the correct way
to do it. So there’s this
AssetFileDescriptor. I talked a little bit earlier
about the binder object and how we pass things through, so we’re going to pass
the file descriptor, which is a pointer
to your resource, through the binder
to the… I don’t know
how that period got in there. It should be setDataSource. So it’s setDataSource,
takes a FileDescriptor, StartOffset,
and a Length, and so what this will do is,
using a resource ID, it will find, you know,
open it, find the offset
where that raw– that resource starts. And it will, you know,
pass– set those values
so that we can tell the media player
where to find it, and the media player
will then play that from that offset
in the FileDescriptor. I had another thought there. Oh, yeah.
So–yeah. Raw resources, make sure
that when you put your file in, you’re putting it in
as a raw resource, so it doesn’t get compressed. We don’t compress things
like MP3 files and so on. They have to be
in the raw directory. Another common one
I see on the forums is people running out
of MediaPlayers. And this is kind of
an absurd example, but, you know,
just to give you a point. There is a limited amount
of resources. This is an embedded device. A lot of people who are
moving over from the desktop don’t realize that they’re
working with something that’s, you know,
equivalent to a desktop system from maybe ten years ago. So don’t do this. If you’re going to use
MediaPlayers, try to recycle them. So our solution is,
you know, there are resources
that are actually allocated when you create a MediaPlayer. It’s allocating memory,
it may be loading codecs. It may–there may actually
be a hardware codec that’s been instantiated
that you’re preventing the rest of the system
from using. So whenever
you’re done with them, make sure you release them. So you’re going to call release, you set null
on the MediaPlayer object. Or you can call reset and set–
do a new setDataSource, which, you know, is basically
just recycling your MediaPlayer. And try to keep it to, you know,
two or three maximum. ‘Cause you are sharing with
other applications, hopefully. And so if you get a little piggy
with your MediaPlayer resources, somebody else can’t get them. And also, if you go
into the background– so, and you’re in–
on pause, you definitely want to release
all of your MediaPlayers so that other applications
can get access to them. Another big one
that happens a lot is the CPU…
“My CPU is saturated.” And you look at the logs
and you see this. You know, CPU is–
is– can’t remember
what the message is now. But it’s pretty clear
that the CPU is unhappy. And this is kind of
the typical thing, is that you’re trying to play
too many different compressed streams
at a time. Codecs take
a lot of CPU resources, especially ones that are running
on software. So, you know, a typical, say,
MP3 decode of a high-quality MP3
might take 20% of the CPU. You add up two or three
of those things, and you’re talking about
some serious CPU resources. And then you wonder why your,
you know, frame rate on your game is pretty bad. Well, that’s why. So we actually have
a solution for this problem. It’s called SoundPool. Now, SoundPool had some problems
in the 1.0, 1.1 release. We fixed those problems
in Cupcake. It’s actually pretty useful. So what it allows you to do
is take resources that are encoded in MP3 or AAC
or Ogg Vorbis, whatever
your preferred audio format is. It decodes them and loads them
into memory so they’re ready to play, and then uses
the AudioTrack interface to play them out
through the mixer engine just like
we were talking about before. And so you can get
much lower overhead. You know, some are in the order
of about 5% per stream as compared to these, you know,
20% or 30%. Depending on what
the audio codec is. So it gives you
the same sort of flexibility. You can modify–in fact,
it actually gives you a little more flexibility,
because you can set the rates. It can–
will manage streams for you. So if you want to limit
the number of streams that are playing,
you tell it upfront, “I want,” let’s say,
“eight streams maximum.” If you exceed that,
it will automatically, based on the priority, you know,
select the least priority, get rid of that one,
and start the new sound. So it’s kind of managing
resources for you. And then you can do things
like pan in real time. You can change the pitch. So if you want to get
a Doppler effect or something like that,
this is the way to do it. So that’s pretty much it. We have about ten minutes left
for questions, if anybody wants to go up
to a microphone. [applause] Thank you. man: Hi, thank you.
That was a great talk. Is setting the
streamed music, so you can respond
to the volume control– do you have to do that every
time you create a new activity, or is it sticky
for the life of the app? Sparks: It’s sticky– you’re going to call it
in your onCreate function. man: But in
every single activity? Sparks: Yeah, yeah.
man: Okay. man: Hi, my first question
is that currently, Android using the OpenCORE for the multimedia framework. And my question is that
does Google has any plan to support any other middleware, such as GStreamer
or anything else? Sparks: Not at this time. We don’t have any plans
to support anything else. man: Okay. What’s the strategy of Google for supporting other pioneers providing this
multimedia middleware? Sparks: Well, so,
because of the flexibility of the MediaPlayer service,
you could easily add another code–another media
framework engine in there and replace OpenCORE. man: Okay. So my second question
is that, um– [coughs] that currently– Google, you mentioned
implementing the MediaPlayer and the recording service. Is there any plan to support
the mobile TV and other, such as video conference,
in frameworks? Sparks: We’re–we’re looking
at video conferencing. Digital TV is probably
a little bit farther out. We kind of need a platform
to do the development on. So we’ll be working
with partners. Basically, if there’s
a partner that’s interested in something that isn’t there, we will–we can
work with you on it. man: Okay, thank you. man: Does the media framework
support RTSP control? Sparks: Yes. So RTSP support is not as good
as we’d like it to be. It’s getting better
with every release. And we’re expecting
to make some more strides in the next release
after this. But Cupcake is slightly better. man: And that’s specified by… in the URL, by specifying
the RTSP? Sparks: Yeah. Right.
man: Okay. And you mentioned, like,
500 kilobits per second being the maximum, or– What if you tried
to play something that is larger than that? Sparks: Well, the codec
may fall behind. What will typically happen
is that you’ll get a– if you’re using our MovieView,
you’ll get an error message that says that
it can’t keep up. man: Mm-hmm. So it will try,
but it will– It might fall behind.
Sparks: Yeah. man: Thank you. man: My question is ask– how about–
how much flexibility we have to control the camera services? For example,
can I control the frame rate, and the color tunings,
and et cetera? Sparks: Yeah, some of that’s
going to depend on the– on the device. We’re still kind of struggling with some
of the device-specific things, but in the case of the camera, there’s a setParameters
interface. And there’s access,
depending on the device, to some of those parameters. The way you know that is,
you do a setParameter. Let’s say you ask
for a certain frame rate. You–you do a getParameter. You find out if it accepted
your frame rate or not. Because there’s a number
of parameters. man: Yeah, but also, in the–
for example, the low light. So you want–not only you want
to slow the frame rate, but also you want to increase
the integration time. Sparks: Right. man: So in the–
sometimes you want, even in the low light, but you want
to slow the frame rate. But you still want to keep
the normal integration time. So how you–do you have those
kind of flexibility to control? Sparks: Well,
so that’s going to depend on whether the hardware
supports it or not. If the hardware supports it,
then there should be a parameter for that. One of the things
we’ve done is– for hardware dev–
manufacturers that have specific things
that they want to support, that aren’t like, standard– they can add a prefix to their
parameter key value pairs. So that will, you know–
it’s unique to that device. And we’re certainly open
to manufacturers suggesting, you know, new–
new standard parameters. And we’re starting to adopt
more of those. So, for example, like,
white balance is in there. Scene modes, things like that
are all part of it. man: Okay.
Sparks: Yeah. man: I was wondering
what kind of native code hooks the audio framework has? I’m working on an app
that basically would involve, like, actively doing
a fast Fourier transform, you know, on however many
samples you can get at a time. And so, it seems like
for now– or in the Java, for example, it’s mostly built
toward recording audio and– and doing things with that. What sort of active control
do you have over the device? Sparks: So officially,
we don’t support native API access
to audio yet. The reason for that is, we, you know–
any API we publish, we’re going to have to live with
for a long whi– a long time. We’re still playing
with APIs, trying to, you know, get–
make them better. And so the audio APIs have changed a little bit
in Cupcake. They’re going to change again
in the next two releases. At that point,
we’ll probably be ready to start providing
native access. What you can do, very shortly we’ll have
a native SDK, which will give you access
to libc and libm. You can get access
to the audio from the Java–
official Java APIs, do your processing
in native code, and then feed it back,
and you’ll be able to do that without having to do MEMcopies. man: And so basically,
that would just be accessing the buffer
that the audio writes to. And also, just a very tiny
question about the buffer. Does it– does it loop back
when you record the audio? Or is it–does it record in,
essentially, like, blocks? Do you record an entire buffer
once in a row, or does it sort of go back to
the start and then keep going? Sparks: You can either have it
cycle through a static buffer, or you can just pass in
new buffers each time, depending on how you want
to use it. man: Okay. Thanks. man: Let’s say
you have a game where you want to generate
a sound instantly on a button press or a touch. Sparks: “Instantly”
is a relative term. man: As instantly
as you can get. Would you recommend,
then, the JET MIDI stuff, or an Ogg, or what? Sparks: You–you’re probably
going to get best results with SoundPool, because SoundPool’s
really aimed at that. What SoundPool
doesn’t give you– and we don’t have an API
for it, we get a lot of requests
for it, so, you know, it’s on my list
of things to do– is synchronization. So if you’re trying to do
a rhythm game where you–you want to be able
to have very precise control of–of, say, a drum track– you–there isn’t a way
to do that today. But if you’re just trying
to do– man: Like gunfire kind of thing. Sparks: Gunfire?
SoundPool is perfect for that. That’s–that’s what it was
intended for. man: Yeah, if I use
the audio mixer, can I control the volume of the different sources
differently? Sparks: Yes.
man: Okay. Sparks: So, SoundPool
has a volume control for each of
its channels that you– basically, when you trigger
a SoundPool sound, you get an ID back. And you can use that
to control that sound. If you’re using
the AudioTrack interface, there’s a volume control
interface on it. man: My question is, for the testing sites,
how– does Google have a plan
to release a certain application or testing program
to verify MediaPlayer and other media middleware
like this? Sparks: Right. man: 3D and everything else? Sparks:
So we haven’t announced what we’re doing there yet. I can’t talk about it. But it’s definitely something
we’re thinking about. man: Okay. Another question is about
the concurrency there for the mobile devices. The resource is very limited. So for example,
the service you mentioned. The memory
is very limited. So how do we handle any– or maybe you have
any experience– handle the 3D surface and also the multimedia surface and put together
a raw atom surface or something like that? Sparks: So when you say “3D,”
you’re talking about– man: Like OpenGL,
because you do the overlay and you use the overlay
and you– Sparks: Yeah, I’m–
I’m not that up on it. I’m not a graphics guy. I’m really an audio guy. But I actually manage the team
that does the 3D stuff. So I’m kind of familiar
with it. There’s definitely
limited texture memory that’s available–that’s
probably the most critical thing that we’re running into–
but obviously, you know, that– we’re going to figure out
how to share that. And so– I don’t have a good answer
for you, but we’re aware of the problem. man: Okay.
Yeah. Just one more question
is do you have any plan to move OpenGL 2.0
for the Android? Sparks: Yes. If you– man: Do you have a time frame? Sparks: Yeah,
if you’re following the master source tree
right now, you’ll start to see changes
come out for– we’re–we’re marrying 2D
and 3D space. So the 2D framework will be
running as an OpenGL context, which will allow you, then,
to, you know– ES 2.0 context. So you’ll be able to share
between the 3D app and the 2D app. Currently,
if you have a 3D app, it takes over the frame buffer and nothing else can run. You’ll actually be able
to run 3D inside the 2D framework. man: Okay, thank you. man: I think this question
is sort of related. I was wondering how would you
take, like, the– the surface that you use
to play back video and use it as a texture,
like in OpenGL? Sparks: That’s coming, yeah. Yeah, that–so you actually
would be able to map that texture onto a 3D– man: Is there any way
you can do that today with the current APIs? Sparks: Nope. Yeah, there’s no access
to the– to the video after it leaves
the media server. man: And no time frame as far as
when there’ll be some type of communication
as far as how to about doing that
in your applications? Sparks: Well, it’s–
so it’s in our– what we call
our Eclair release. So that’s master today. man: Okay.
Okay, thank you. Sparks: I think–
are we out of time? woman: [indistinct] Sparks: Okay. woman: Hi, do you have
any performance metrics as to what are
the performance numbers with the certain playback
of audio and video to share, or any memory footprints
available that we can look up, maybe? Sparks: Not today. It’s actually part of some
of the work we’re doing that somebody was asking about
earlier. That I can’t talk about yet.
But yeah. There’s definitely some–
some plans to do metrics and to have baselines
that you can depend on. woman: And then the second
question that I have is that do you have
any additional formats that are lined up
or are in the roadmap? Like VC-1 and additional
audio formats? Sparks: No, not–
not officially, no. woman: Okay. woman: Hi, this is back
to the SoundPool question. Is it possible
to calculate latency or at least know, like, when the song actually went
to the sound card so I could at least know
when it actually did play– if there’s any sort of callback
or anything? Sparks: So you can get
a playback complete callback that tells you
when it left the player engine. There’s some additional latency
in the hardware that we…we don’t have
complete visibility into, but it’s reported back through the audio track
interface, theoretically,
if it’s done correctly. So at
the MediaPlayer level, no. At the AudioTrack level, yes. If that’s…makes any sense. woman: Okay, so I can at least
get that, even if I can’t actually
calculate latency for every single call? Sparks: Right, right. woman: Okay. Thank you. Sparks: Uh-huh. man: Yeah, this is a question about the samples processing. You partially touched
upon that. But in your architecture
diagram, where do you think
the sound processing effect really has to be placed? For example, it could be
an equalizer or different kind
of audio post processing that needs to be done. Because in the current
Cupcake version, 1.5, I do not see a placeholder or any implementation
of that sort. Sparks: So one of the things
we’re in the process of doing is we’re–
we’re looking at OpenAL– Have I got that right?
OpenAL ES? As the, um–possibly the–
an abstraction for that. But it definitely
is something you want to do on an application-by-application
basis. For example,
you don’t want to have effects running on, you know,
a notification if… The–you–you wouldn’t want
the application in the foreground
and forcing something on some other application
that’s running in background. So that’s kind of the direction
we’re headed with that. man: What’s the current
recommendation? How do you want the developers
to address? Sparks: Well, the–
since there isn’t any way, there’s no recommendation. I mean,
if you were doing native code, it’s kind of up to you. But our recommendation would be
if you’re, you know, doing some special version
of the code, you would probably want
to insert it at the application level
and not sitting at the bottom
of the Audio Flinger stack. man: Okay, thanks. woman: Is it better to get
the system service once and share it across activities
in an application, or let each activity
fetch the service? Sparks: I mean, there’s
a certain amount of overhead, ’cause it’s a binder call
to do it. So if you know
you’re going to use it, I would just keep it around. I mean, it’s just a–
a Java object reference. So it’s pretty cheap
to hold around. man: Is there any way
to listen to music on a mono Bluetooth? Sparks: Ah, on a SCO? Yeah, no.
[chuckles] The reason
we haven’t done that is the audio quality
is really pretty poor. I mean, it’s designed for–
for call audio. So the experience isn’t going
to be very good. Theoretically, you know,
it’s possible. We just don’t think
it’s a good idea. [chuckling] man: If you want to record
for a long period of time, you know, like a half-hour, can you frequency scale
the processor or put it to sleep, or… Sparks: It–well,
that happens automatically. I mean, it’s–
it’s actually going to sleep and waking up all the time. So it’s just depending
on what’s– man: But if you’re doing, like,
a raw 8k sample rate, how big a buffer can you have,
and then will it sleep in– while that buffer’s filling? Sparks: So the–the size
of those buffers is defined
in the media recorder service. And I think they’re… I want to say they’re like 2–
2k at… whatever the output rate is. So they’re pretty good size. I mean, it’s like
a half a second of audio. So the processor,
theoretically, would be asleep
for quite some time. man: So is that handled
by the codec, or is it handled by–
I mean, the DSP on a codec? Or is it handled by– Sparks: So the…
the process is going to wake up
when there’s audio available. It’s going to… you know, route it over
to the AMR encoder. It’s going to do its thing. Spit out a bunch of bits
that’ll go to the file composer to be written out. And then theoretically, it’s gonna go back
to sleep again. man: No, I mean
on the recorder. If you’re recording the audio. If you’re off the microphone. Sparks: I’m sorry? man: If you’re recording
raw audio off the microphone. Sparks: Yeah. Oh, oh, are you talking about
using the AudioTrack or AudioRecord interface? man: The AudioRecord interface.
ADPCM. Sparks: Yeah, that’s… So it’s pretty much
the same thing. I mean, if you define
your buffer size large enough, whatever that buffer size is,
that’s the buffer size it’s going to use
at the lower level. So it’ll be asleep
for that amount of time. man: And the DSP will be
the one filling the buffer? Sparks: Yeah, yeah.
The DSP fills the buffer. man: All right, thanks. man: One last question. From a platform perspective, would you be able to state
a minimum requirement on OpenGL performance? Sparks: I’m not ready
to say that today. But… at some point we’ll– we’ll be able
to tell you about that. man: Okay, thanks.
Sparks: Uh-huh. Guess that’s my time.
Thanks, everyone. [applause]

Author:

15 thoughts on “Google I/O 2009 – Mastering the Android Media Framework”

  • Kumar Brajbhushan says:

    I wanted to know detail about Audio Flinger. Looks interesting but has too low volume. Had all the volume settinsg set to max, still not able to hear it clear.
    It will be good to increase the volume of this video.

  • any body know how to pause the running mediaplayer, if any alarm get started at the time of media playing position

  • So it turns out this very person is responsible for all the codecs' nightmares of Android – HORRIBLE audio recording quality, inability to add additional codecs so we could play divx and similar files… That is one of the biggest flaws of Android.

  • Daniel Shapiro says:

    We are working on an upgrade to JNI, which hides the C code from the Java programmer. Have a look at "Returning Control to the Programmer: SIMD Intrinsics for Virtual Machines" for the (ACM) Queue Magazine. In terms of Android and I/O, our approach provides access to SIMD operations on media files without explicitly calling JNI. The multi-layer approach google shows in this video can be quite difficult to debug. We feel that an API is better than JNI in a "framework" with native code.

  • XtreamStealth2010 says:

    This was amazing to watch. David Sparks, If you read this, Hello from your Cuz James Wheat. I guess its been close to 42 years, I had dinner with uncle Dennis and your mom about a week ago. Dennis filled me in a little regarding Google and Racing. I've also spent a life time racing, but at 56 I stick to Sim Racing. GT3 and F1. My Son James is still driving the real cars. Take care Cuz.

Leave a Reply

Your email address will not be published. Required fields are marked *