NicolasRobidoux wrote:One last "manic perfectionist" thing: Some of the positions, you know ahead of time that they are within 1, or farther than 1. So, you could use a special weight function for these special cases and skip some branches for these "indexes".
What I mean is this:
I assume that you fix things so that the sampling point is within the convex hull of the four central input pixel locations within the 4x4. (I could figure this from your code but I'm too lazy.)
If so, you know right off the bat that these four closest input pixels cannot be at more than a distance of 2 (because sqrt(1+1)=sqrt(2)<2). This means that the third branch of the weight computation is not applicable to the four "inner" input pixels.
You also know right off the bat the the outer input pixels (the 16-4=12 that are not discussed above) cannot be at a distance that is less than 1. This means that first branch of the weight computation is not applicable to the 12 "outer" input pixels.
Now, the weight computation for all input pixels has only two branches, instead of three. You should be able to exploit this to make the code faster. (This may require computing contributions one position at a time instead of looping. That is, getting speed out of this may require manually unrolling the loop that goes over all 16 input pixel positions.)
P.S.
This comment is not specifically about doing the unrolling here, but besides this, unless your library/compiler is really smart, you probably should organize
Code: Select all
color = mul(weights[0], float4x3(c00, c10, c20, c30));
color+= mul(weights[1], float4x3(c01, c11, c21, c31));
color+= mul(weights[2], float4x3(c02, c12, c22, c32));
color+= mul(weights[3], float4x3(c03, c13, c23, c33));
like this
Code: Select all
color1 = mul(weights[0], float4x3(c00, c10, c20, c30));
color2 = mul(weights[1], float4x3(c01, c11, c21, c31));
color3 = mul(weights[2], float4x3(c02, c12, c22, c32));
color4 = mul(weights[3], float4x3(c03, c13, c23, c33));
color = ( color1 + color2 ) + ( color3 + color4 );
The reason for this is that, just like I suggested with min and max, you are splitting the operations into two parallel tracks, which are merged at the end. This is a standard trick, the name of which I forget. (In standard C, it's basically "Use multiple accumulators to minimize latency." If you visualise the components of the computation as (very short) trees, it's basically a red-black trick.) Having split things like this, it's easy to integrate the advice I give at the top: One now has a natural split of the input data and weights in four groups of four, which gives 4+12 painlessly.
(Hopefully, I am not making incorrect assumptions about your computing environment. This is how I'd go at things if I was working with an HSLS programmer.)
P.S. I don't like playing Sudoku, but I love doing this kind of optimization puzzle
