I've been doing some benchmark tests using SGIs (using
test.blend from www.eofw.org/bench), during which I got
to wondering about how Blender renders the sub-regions
of an image with multiple threads.
Assuming the use of N threads, once there is less than N
areas remaining (call it K) then some threads go unused,
so the tail end of render process is not as fast. Worst
case is if the final region happens to be a complex one
in the image: only one thread is running and it takes
much longer than normal. This wouldn't matter if every
subregion was equally complex, but in real images this is
never the case.
In other words, if there are N threads, the parallelism
drops off as soon as there are N-1 areas left to render.
Unless the overhead kills it, surely it would be better
once K < N to halve the width/height of the remaning K
areas, which would mean being able to use N threads again,
ie. maximum speed. Depending on the resolution of the
image, this could be done once or twice and should speed
up the rendering of the final N-1 pieces quite a lot.
Example: 8 threads (very common these days with the latest
dual/quad-core CPUs). Image split into the default 4 x 4
pieces. When 7 pieces remain, halve the width/height of
the peices, so thus 28 remain, and 8 threads can be used
again. As before, when only 7 pieces of this smaller size
remain, the efficiency will slide, but the final result
will be quicker than without. If the image is large enough,
a further resolution-halving would still be effective. At
[....]