You are viewing libv

LIBV Intentionally Breaks Videodrivers - Hardware MPEG2 Slice decoding added to unichrome driver. [entries|archive|friends|userinfo]
Luc Verhaegen

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Hardware MPEG2 Slice decoding added to unichrome driver. [Nov. 4th, 2009|10:24 pm]
Previous Entry Share Next Entry
[Tags|, , , , ]
[Current Location |couch]
[music |The Dust Brothers - Medula Oblongata]

I just pushed out code which adds MPEG2 slice decoding to my graphics driver. It is based on XvMC, but unlike "standard" XvMC implementations, it sends MPEG slices over to the graphics driver over the X protocol.

The base idea is the following: The MPEG engine gets MPEG slices, and outputs to a buffer. This buffer needs to then be displayed by the overlay engine. So, we need to spend most of our time managing the communication and syncing of those engines. We already have the other video engines implemented nicely, so, why not stick the MPEG code next to that and have a nice and clean implementation?

The XvMC protocol, X-wise, is mostly about telling the driver that the MPEG hardware is in use, and subsequently claiming buffers in the framebuffer, managed by the X driver. Everything else is expected to happen in the client library. For what reason, i do not know, but part of it could be that, this way, X is not seen to eat any CPU cycles. In any case, this makes it a very weird protocol, with things spread all over the stack.

Things were made worse with the advent of the XvMC wrapper. Instead of expanding the XvMC protocol slightly to provide the name of the XvMC client library to be loaded, DRI is abused for this purpose. So... a pointless hard dependency on DRI is added, and now, no working DRI means no working XvMC... Curious. Makes the pointless dependency on Shm look harmless.

So what i did now is send all the data over the X protocol, over a tiny X extenstion, so that it could be fed into the hardware and synced inside the X driver. An XvPutImage with a longword buffer containing the mpeg buffer id then makes sure that everything gets displayed correctly. And while the overlay is being set up, the mpeg engine can finish its work, and at the very last minute, the overlay code waits for the mpeg engine to finish, and then the overlay gets told to display the new image.

Other XvMC implementations went and completely reimplemented the overlay in the client library, and even involved 2d acceleration to be able to send mpeg slices to the hw a bit faster. A syncing nightmare. Another advantage is that my implementation can implement the newer mpeg2 engines in just a few hundred highly hardware specific lines.

Of course, sending this data over the X protocol in tiny bits does incur some more cpu cycles, and i also am not feeding mpeg data into the hardware over the command buffers. Because of this, my code uses about 30-35% for a normal DVD (write a comment if you guessed which :)) on a VIA C3 Samuel2 (yes, half speed FPU, not quite PPro compatible) at 600Mhz, while openchrome uses 20-25%, roughly 2/3rds. But the performance of my code is still very good, good enough to not bother with speeding things up just yet.

As usual, it is easy to get this new code. It builds and runs against all Xorg versions that are common, and the debian package build system has also been updated for the xvmc code.

For xine there is one caveat, due to the horrible implementation of video_out_xxmc. We need to fool xine into thinking that we do support subpictures (we don't as it the xxmc way of implementing things didn't even get close to how the hardware implemented it). For that, the following option needs to be set:

Option "XvMCBrokenXine" "True"

in the device section of xorg.conf.

Enjoy!
linkReply

Comments:
From: (Anonymous)
2009-11-04 11:01 pm (UTC)

shm?

(Link)

Wouldn't it make sense to just use shm for the slices when the display is local?

-Tor
[User Picture]From: libv
2009-11-04 11:06 pm (UTC)

Re: shm?

(Link)

Well, individual slices are almost always less than 2kb, often in the 300kb range, mostly just a few bytes. It really will not bring more speed by implementing shm, shm is good for big blobs, not many minute blobs. What could be done is that a frame is sent over as a whole, combining all or some frames into 1 and then feeding them into the hardware. But before that happens the command queue needs to be implemented.

All of that is a bit of a waste of resources atm, it is fast enough.

[User Picture]From: libv
2009-11-04 11:07 pm (UTC)

Re: shm?

(Link)

Err, 300b not kb, and combining all or some _slices_ not frames.