Legacy ImageMagick Discussions Archive

If you have set things up so that the alignment is the same throughout the computation, a number of the 16 texels are outside of the discs and consequently always have coefficient 0, which means they can be dropped. (Maybe this requires reflections of the data to put it in "standard position"---like is done in Nohalo---and so on to make it work and consequently is not worth it.)

One last "manic perfectionist" thing: Some of the positions, you know ahead of time that they are within 1, or farther than 1. So, you could use a special weight function for these special cases and skip some branches for these "indexes".

You write beautifully clear code.

NicolasRobidoux wrote:Suggestion: Compute the weights using the formulas used in ImageMagick's resize.c
Code: Select all
if (x < 1.0)
  return(resize_filter->coefficient[0]+x*(x*(resize_filter->coefficient[1]+x*resize_filter->coefficient[2])));
if (x < 2.0)
  return(resize_filter->coefficient[3]+x*(resize_filter->coefficient[4]+x*(resize_filter->coefficient[5]+x*resize_filter->coefficient[6])));
return(0.0);
and save one flop.

It saved me like 20% of processing time! Indeed it works! Thanks.

Hyllian wrote:It saved me like 20% of processing time! Indeed it works! Thanks.

Standard polynomial evaluation trick: Horner's rule.

NicolasRobidoux wrote:One last "manic perfectionist" thing: Some of the positions, you know ahead of time that they are within 1, or farther than 1. So, you could use a special weight function for these special cases and skip some branches for these "indexes".

What I mean is this:
I assume that you fix things so that the sampling point is within the convex hull of the four central input pixel locations within the 4x4. (I could figure this from your code but I'm too lazy.)
If so, you know right off the bat that these four closest input pixels cannot be at more than a distance of 2 (because sqrt(1+1)=sqrt(2)<2). This means that the third branch of the weight computation is not applicable to the four "inner" input pixels.
You also know right off the bat the the outer input pixels (the 16-4=12 that are not discussed above) cannot be at a distance that is less than 1. This means that first branch of the weight computation is not applicable to the 12 "outer" input pixels.
Now, the weight computation for all input pixels has only two branches, instead of three. You should be able to exploit this to make the code faster. (This may require computing contributions one position at a time instead of looping. That is, getting speed out of this may require manually unrolling the loop that goes over all 16 input pixel positions.)
P.S.
This comment is not specifically about doing the unrolling here, but besides this, unless your library/compiler is really smart, you probably should organize

Code: Select all

color = mul(weights[0], float4x3(c00, c10, c20, c30));
color+= mul(weights[1], float4x3(c01, c11, c21, c31));
color+= mul(weights[2], float4x3(c02, c12, c22, c32));
color+= mul(weights[3], float4x3(c03, c13, c23, c33));

like this

Code: Select all

color1 = mul(weights[0], float4x3(c00, c10, c20, c30));
color2 = mul(weights[1], float4x3(c01, c11, c21, c31));
color3 = mul(weights[2], float4x3(c02, c12, c22, c32));
color4 = mul(weights[3], float4x3(c03, c13, c23, c33));
color = ( color1 + color2 ) + ( color3 + color4 );

The reason for this is that, just like I suggested with min and max, you are splitting the operations into two parallel tracks, which are merged at the end. This is a standard trick, the name of which I forget. (In standard C, it's basically "Use multiple accumulators to minimize latency." If you visualise the components of the computation as (very short) trees, it's basically a red-black trick.) Having split things like this, it's easy to integrate the advice I give at the top: One now has a natural split of the input data and weights in four groups of four, which gives 4+12 painlessly.
(Hopefully, I am not making incorrect assumptions about your computing environment. This is how I'd go at things if I was working with an HSLS programmer.)
P.S. I don't like playing Sudoku, but I love doing this kind of optimization puzzle

Nicolas, for some reason, your last optimization trick actually made the code slower (119 vs 129 cycles), measured using nvshaderperf. OTOH, the Horner's rule one was very good (119 vs 143 cycles).
Maybe my Cg compiler is smart for some of these tricks already, and dumb for others.

A question for you: that jinc2 filter I made, technically, should I call it ewa-lanczos2sharp?

Sounds like you're running out of "registers". "Red-black" tricks (this includes my initial suggestion about min and max being computed as

Code: Select all

min(min(.,.),min(.,.))

instead of

Code: Select all

min(.,min(.,min(.,.)))

) generally use more memory.
If it's not too much to ask, could you try

Code: Select all

kolor = mul(weights[0], float4x3(c00, c10, c20, c30));
color = mul(weights[1], float4x3(c01, c11, c21, c31));
kolor += mul(weights[2], float4x3(c02, c12, c22, c32));
color += mul(weights[3], float4x3(c03, c13, c23, c33));
color += kolor

?

Hyllian wrote: A question for you: that jinc2 filter I made, technically, should I call it ewa-lanczos2sharp?

It's not really what I call EWA Lanczos2Sharp because it does not use Jinc and it does not use one of my standard deblurs.
It's a deblurred EWA Sinc-windowed Sinc 2-lobe. <- Too long for a short name.
So I don't know.

NicolasRobidoux wrote: If it's not too much to ask, could you try

Code: Select all

kolor = mul(weights[0], float4x3(c00, c10, c20, c30));
color = mul(weights[1], float4x3(c01, c11, c21, c31));
kolor += mul(weights[2], float4x3(c02, c12, c22, c32));
color += mul(weights[3], float4x3(c03, c13, c23, c33));
color += kolor

?

Sure, but no gain (113 vs 113 cycles).

I couldn't get rid of jaggies using ewa-cubic. The clown image at 4x I've got is this (B=0.0, C=0.5, Catmull-Rom):

EWA Catmull-Rom is super jaggy. Some people have liked it for downsampling but I've never liked it, up or down.
Try EWA RobidouxSoft:

Code: Select all

B = (9-3*sqrt(2))/7 = 0.67962275898295921
C = (1-B)/2 = 0.1601886205085204

NicolasRobidoux wrote:EWA Catmull-Rom is super jaggy. Some people have liked it for downsampling but I've never liked it, up or down.
Try EWA RobidouxSoft:
Code: Select all
B = (9-3*sqrt(2))/7 = 0.67962275898295921
C = (1-B)/2 = 0.1601886205085204

Very soft, indeed. A bit too blurry:

I have the feeling we can't get the ewa-lanczos quality using cubic.

Don't give up too fast.
Let's first try Keys cubics: Once you choose B, set C=(1-B)/2.
Start with Mitchell which is the Keys with B = 1/3.
Then, vary B until you're happy with what you get.

NicolasRobidoux wrote:Don't give up too fast.
Let's first try Keys cubics: Once you choose B, set C=(1-B)/2.
Start with Mitchell which is the Keys with B = 1/3.
Then, vary B until you're happy with what you get.

No dice!

There isn't a single Keys config that comes close to this ewa-lanczos (WA=0.4, WB=0.9) quality:

I think there is a need to derive new cubic functions that switch between x=1.1 and x=1.3, and not at 1.0 and 2.0 as is the default points. But they need to be splines (so, first derivative smooth at the switch point). I can't just chnage the swtch points using the current cubic functions, because some discontinuity will arise. Just an idea.

I persists in thinking that if you vary B and C (without, possibly, sticking to Keys cubics) you'll find a combination that compares.
The only thing that could make a comparable result reachable (correction: UNreachable) with 4x is that you extend your disc up to radius 2.5.

Legacy ImageMagick Discussions Archive

Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling

Re: Sigmoidized Ginseng (pronounced "Jinc-Sinc") resampling