I just pushed out code which adds MPEG2 slice decoding to my graphics driver. It is based on XvMC, but unlike "standard" XvMC implementations, it sends MPEG slices over to the graphics driver over the X protocol.
The base idea is the following: The MPEG engine gets MPEG slices, and outputs to a buffer. This buffer needs to then be displayed by the overlay engine. So, we need to spend most of our time managing the communication and syncing of those engines. We already have the other video engines implemented nicely, so, why not stick the MPEG code next to that and have a nice and clean implementation?
The XvMC protocol, X-wise, is mostly about telling the driver that the MPEG hardware is in use, and subsequently claiming buffers in the framebuffer, managed by the X driver. Everything else is expected to happen in the client library. For what reason, i do not know, but part of it could be that, this way, X is not seen to eat any CPU cycles. In any case, this makes it a very weird protocol, with things spread all over the stack.
Things were made worse with the advent of the XvMC wrapper. Instead of expanding the XvMC protocol slightly to provide the name of the XvMC client library to be loaded, DRI is abused for this purpose. So... a pointless hard dependency on DRI is added, and now, no working DRI means no working XvMC... Curious. Makes the pointless dependency on Shm look harmless.
So what i did now is send all the data over the X protocol, over a tiny X extenstion, so that it could be fed into the hardware and synced inside the X driver. An XvPutImage with a longword buffer containing the mpeg buffer id then makes sure that everything gets displayed correctly. And while the overlay is being set up, the mpeg engine can finish its work, and at the very last minute, the overlay code waits for the mpeg engine to finish, and then the overlay gets told to display the new image.
Other XvMC implementations went and completely reimplemented the overlay in the client library, and even involved 2d acceleration to be able to send mpeg slices to the hw a bit faster. A syncing nightmare. Another advantage is that my implementation can implement the newer mpeg2 engines in just a few hundred highly hardware specific lines.
Of course, sending this data over the X protocol in tiny bits does incur some more cpu cycles, and i also am not feeding mpeg data into the hardware over the command buffers. Because of this, my code uses about 30-35% for a normal DVD (write a comment if you guessed which :)) on a VIA C3 Samuel2 (yes, half speed FPU, not quite PPro compatible) at 600Mhz, while openchrome uses 20-25%, roughly 2/3rds. But the performance of my code is still very good, good enough to not bother with speeding things up just yet.
As usual, it is easy to get this new code. It builds and runs against all Xorg versions that are common, and the debian package build system has also been updated for the xvmc code.
For xine there is one caveat, due to the horrible implementation of video_out_xxmc. We need to fool xine into thinking that we do support subpictures (we don't as it the xxmc way of implementing things didn't even get close to how the hardware implemented it). For that, the following option needs to be set:
Option "XvMCBrokenXine" "True"
in the device section of xorg.conf.