WaveOps and undefined mappings
November 2022 (1204 Words, 7 Minutes)
These days if you spent any time doing graphics, you probably wrote a compute shader. What you probably didn’t do, is spent much time thinking about the order of the threads relative to their lane ID (and neither did Microsoft until very recently). In this post we’ll explore this undefined mapping and it’s implication, and we’ll do it through the eras of shader models.
The Pre SM6.0 era - FXC
When compute shaders were first introduced the shading language was specifically constructed to abstract away the HW details. You launch threads and use them. This was done to facilitated the multi-vendor landscape. You don’t need to worry about how wide the HW is, how many threads or thread groups are executing. If you want to communicate between threads, it can be done through the Local Data Share (LDS). Things are pretty straight forward.
The Pre SM6.6 era - DXC and WaveOps
After some time, people noticed that in many situations the data you want already exist in a register, just not the one assigned to your thread. Initially limited to the consoles and vendor extensions, WaveOps finally entered the API with DXC and SM6.0. Clever algorithms could gain a significant speed up by allowing threads to communicate without having to go through LDS. Others allow for scalarization, using WaveOps to convert vector operations into scalar ones. Pixel shaders could gain even more power by exposing the QuadOps - operations that allowed for quick access into the other lanes that make up the quad.
But not all was well. WaveOps expose HW implementation details that were hidden before. Suddenly, you needed to know how wide the HW is, how many threads are in a wave, and how to write algorithms that can work if these details varied. Most importantly, and the main topic of this post, you now needed to know where the threads were.
Where the wild threads are
Let’s take a simplified model, a HW that is 8-lanes wide, and launch the following shader:
[numthreads(4,2,1)]
void main(uint3 gid : SV_GroupThreadID )
{
uint laneId = WaveGetLaneIndex();
}
What will gid be? That is well defined (z is always 0 and will be omitted here):
(0,0) (1,0) (2,0) (3,0)
(0,1) (1,1) (2,1) (3,1)
but what is the result laneId, relative to gid? That is undefined by the spec! Completely up to the implementation. It could be this:
0 1 2 3
4 5 6 7
or this:
0 1 4 5
2 3 6 7
or even this:
7 4 6 3
2 5 1 0
The spec doesn’t say. But if we want to be pedantic, it gets even worse. How about the next shader:
[numthreads(8,1,1)]
void main(uint3 gid : SV_GroupThreadID )
{
uint laneId = WaveGetLaneIndex();
}
Again, the gid is well defined (y and z are 0 here):
(0) (1) (2) (3) (4) (5) (6) (7)
and surely laneId is something sane like this:
0 1 2 3 4 5 6 7
Luckily (and as far and I know) it is, but crucially the spec does not enforce this. If our 8-wide HW is composed of 2 4-wide SIMD units and we want the laneId order to be:
4 5 6 7 0 1 2 3
We can go right ahead.
If you did not define it, it is undefined
So how does any of this ever work? Most of the time, it doesn’t matter. Either the algorithm doesn’t care about this detail (for example if you are scalarizing a shader, the WaveOps themselves abstract this detail away). Other times, people relied on the undefined behavior that the laneId is just counting left-to-right, up-to-down (known as “type-writer” order), and if a shader broke, the driver could set the type-writer order for that shader. Relying on the order of linear threadgroups (defined using [numthreads(X,1,1)]
) is generally safe. Using WaveGetLaneIndex()
is even better, if you want to be extra spec-compliant. With that index in hand, you can create your own, well-defined equivalent to SV_GroupID. SPD is a great example on how to do it (with some nice helper functions such as ARmpRed8x8
you can reuse). Tedious, but ensures spec compliance and working shaders across different vendors and drivers.
The post SM6.6 era (AKA the present)
With much fan-fare (well, a blog post), microsoft announced SM6.6. Along with many nice features such as Dynamic Resource Binding and Integer Atomics, Compute Shader Derivatives and Samples were introduced. Up until SM6.6, Compute shader could not do derivative. But this was a somewhat artificial limitation, derivatives are computed numerically, simply differences between adjacent values. The only issue is who is adjacent to whom.
MS wanted a nice and stable reference point and so chose to rely on consecutive lane indices. A sensible and outright non-sense choice. Let’s go back to our previous example to see why. We’ll look again at fictional 8-wide HW and the following shader (now using SM6.6):
[numthreads(4,2,1)]
void main(uint3 gid : SV_GroupThreadID )
{
uint laneId = WaveGetLaneIndex();
float x_derivative = ddx(laneId);
float y_derivative = ddy(laneId);
}
Again a reminder of the gid:
(0,0) (1,0) (2,0) (3,0)
(0,1) (1,1) (2,1) (3,1)
and the laneId: (relying here on undefined behavior that almost always works)
0 1 2 3
4 5 6 7
what is the value of x_derivative
and y_derivative
? Under the original rule in the blog post x_derivative
is:
-1 -1 -1 -1
-1 -1 -1 -1
looks fine at first glance, nothing to see here. How about y_derivative
?
-2 -2 -2 -2
-2 -2 -2 -2
Ugh, what just happened? The derivative followed the lane order instead of the more natural spatial order implied by gid! so instead of ddy = 0-4 = -4 (the value at (1,0) - the value at (0,0)) for the first pixel, we got -2, the value at (2,0) - the value at (0,0)! The quad were not arranged like we though they were! In a more realistic shader, it gets worse. Frequently we do something like this in a compute shader:
[numthreads(4,2,1)]
void main(uint3 tid : SV_DispatchThreadID )
{
float3 color = colorTexture[tid.xy];
...
}
Each thread loads a pixel, and the threads are launched on a grid “aligned” to the image. This is some nice syntactic sugar that helps writing clearer code (imagine if all access was through linear indices :shudder:). But this has strong spatial implication that originally, SM6.6 completely ignored. If you were to Sample or do a gradient operation the result will be complete junk. Some of the x derivatives will make sense, but all y derivative will be nonsense.
However with some nudges from AMD, MS changed the spec. There is still no strong mapping between SV_GroupThreadID
and the result of WaveGetLaneIndex()
, but in SM6.6 Quads are well defined. Defined using SV_GroupThreadID
instead of the original consecutive WaveGetLaneIndex()
, derivative and QuadOps will now do “the right thing”. Within a quad, threads are now in “Z” order:
0 1 4 5
2 3 6 7
However, between Quads, anything goes. If for example we had 16-wide HW, launched with [numthreads(4,4,1)] the order might be:
0 1 4 5
2 3 6 7
8 9 12 13
10 11 14 15
Which will be a Z in Quads, and Z between quads. But it can also be Z and N:
8 9 0 1
10 11 2 3
12 13 4 5
14 15 6 7
The spec makes no promises, and it is up to the HW vendors to arrange the quads as they see fit. While it isn’t the full mapping we were hoping for, at least derivatives and QuadOps make sense.