I am in the process of adding ray-packet tracing to my ray tracer, that from now on shall be known by the silly name 'Photonizer' (as a reminder that the ultimate goal is a photon mapper).
I decided for the sake of 'going ahead faster' to support only triangles as primitives from now on (goodbye my ellipsoid test scene... sniff)
I changed the triangle intersector from Trumbore/Moller to Wald's Optimized projection.
I did this because I found it will be nicer to SIMDify (no cross products). A nice side effect however was that I also got 25% speed-up with the new intersection code.
After that I wrote a parallel version of the intersector that can take a 2x2 ray packet. This took a full day between researching, coding, going through visual studio SSE/SSE2 intrinsics documentation, some time was also wasted writing cpuid code for CPU feature detection.
I still did not parallelize the BVH traversal, but I was too eager to try it out, even with only primary rays and without shading, and I was extremely pleased with the result.
I got a 4.5x speedup rendering the cornell box brute force without any space partitioning, and for the first time I saw a 1M rays/sec stat. on my console (copy-paste: 1040062.855307 rays/sec.).
So although there is still work to do, this is the 1M for the record post.