| Luc Verhaegen ( @ 2008-07-30 11:44:00 |
| Current location: | office |
| Current mood: | |
| Current music: | Amon Tobin - Supermodified 02 - Four ton mantis |
| Entry tags: | x radeonhd ati amd |
The RadeonHD Command Submission (CS) branch.
As you all might have noticed Tuesday last week, I pushed a new branch into the radeonhd driver: CS, which stands Command Submission.
In the true spirit of the radeonhd driver, this separates out all the muck and yuckiness of feeding commands, be them register writes or CP (command processor) command packets, into the hardware's various engines.
CS has become an infrastructure which does command buffer handling and (command) engine maintainance in a clean way, for both MMIO and DRM CP. MMIO means only register writes to the MMIO aperture. With DRM CP, we use the command processor by feeding it (indirect) command buffers through the DRM.
So now we have both XAA and EXA acceleration running on top of the CS infrastructure. XAA is the existing radeonhd "MMIO-only" code now working on top of DRM CP as well. For EXA, i now also added faster Up/Downloads.
When developing CS, i first went for a clean implementation and making sure the code was well-structured and as nicely condensed as possible. Once this was achieved, i ran benchmarks and optimised the code. When doing so I noticed that our code was running 10x slower on OpenSuSE 10.3 (old, i know) with inline functions compared to macros. After talking to the GCC people here, they confirmed that this was still often the case for the gcc version included in OpenSuSE 10.3, and that it was fixed for the gcc version in OpenSuSE 11. Since our driver needs to give a decent user-experience on a wide variety of systems, i therefor was left with no other option but to turn the calls used in the XAA callbacks (Grab, Write/RegWrite, Advance, all of which are used _very_ often) into macro's. Nasty, but some things cannot be avoided :(
But even so, the macros themselves are small and readable, as the underlying code is nice and condensed, and the we get really good performance out of them :)
So, now that we have the radeonhd acceleration code working when DRI is enabled, we can start working on adding the remaining r5xx acceleration, like EXA Render and Textured Video. After that, new CS back-ends (like direct CP) can be added in.
With Direct CP we are setting up and handling the CP directly from the X driver. Without the kernel to depend on (where we currently fall back to MMIO), we have no other option than to stick the CPs main command buffer into the graphics cards memory. Thanks to the memory virtualisation in the hardware, the command processor knows no different, even though we are not as fast writing into this buffer as when going through the GART :)
The advantage of having a working CP is that we can use command packets, and for our hardware this means that we can have Textured Video and EXA Render acceleration (as the MMIO path there is broken), and we then can, eventually, get rid of the MMIO path completely and further optimize the CS infrastructure.
For R5xx direct CP works, but for rs690, i am running into issues still. I know that the main commandbuffer in the framebuffer on this IGP hardware is in itself ok as high speed memory writes to unbuffered registers are perfectly ok, but as soon as buffered registers are written, one of the FIFOs loses count :(
In the short term, i am going to push out the code and enable it just for R5xx, and keep MMIO around for rs690, but this kind of beats the purpose of Direct CP, as that was getting rid of MMIO in the first place :)
When trying to get rs690 going, the CS really showed its power. I coded up the two other options for stuffing commands into the CP (indirect buffers and stuffing the CSQ directly, as opposed to using the main ringbuffer directly) in a matter of hours. Neither were successful as they showed the exact same issue as when using the main buffer, but it was nice to see how adaptable the CS infrastructure was.
But the ability to easily add different back-ends is important for more than direct CP.
Jerome Glisse has been working on cleaning up/adding to the DRM drivers ioctls to provide cleaner handling of the CP. The cleanup of the drm driver here is much overdue, there are many issues with this code (CP IDLE on a modern CPU is a good example), and it's fantastic that Jerome is working on it. The work does sometimes resembles archeology more than software development, so bear with him :)
Once Jerome's code is working, a new back-end can be written up for radeonhd in a few hundred lines, and we can take full advantage of his work almost immediately. It will ease the burden on Jerome for testing, and it will give us a faster and cleaner DRM CP right away.
So after some hard labour on creating the infrastructure and fitting it into the driver nicely, we are reaping the rewards already :)