Friday, October 31, 2008

GI - Photon Mapping - Post 8 - Optimization Summary / Rant

In the last 3 days all the work has been on optimization, below is a summary of changes and the speedups they produces (speedup = old render time / new render time)
  • Changing intersector from Trumbore/Moller to Optimized Wald Projection -> 0.25
  • 2x2 Packet Tracing (SSE/SSE2 using SIMD intrinsics) -> 3.0-4.0
  • Including ordering information into traversal (do not traverse BVH box if intersection was found in previous boxes and t intersection <> 1.8, 2x2-> 1.3
  • SIMDifying primary ray packet generation (instead of looping to generate the 4 rays) -> 1.05
  • Changing BVH traversal from recursive to iterative: Mono->1.04, 2x2->soon
  • Changing BVH memory allocation scheme to improve traversal cache friendliness: soon
  • Chancing BVH build algorithm (to SAH?): soon
Some if not all speedups are of course scene dependent, some where tested with Stanford Bunny while others with Cornell Box. So the numbers here are very rough and relative, additionally some changes might produce slowdowns in some scenes/configurations, but this is just a simple summary. After I have all of the high level optimizations done, I will probably try toxie's test benchmark and then have a go at lower level optimizations using a profiler, but for that I am waiting for my new hardware where I am planning to have Linux installed, I will then be able to comfortably use valgrind, gprof or VTune.

I am also planning to have a look at gcc's profile guided optimization. Acovea seems to be a fun thing to try as well.

Talking about low level optimization, I recently read Randall Hyde's Write Great Code-Vol I, and reading Vol II, which are nice as refreshers on assembly and low level optimization as well as nice summaries of what modern compilers do and more importantly do not do although one might expect it.
The publicly available Software Optimization Guide for AMD64 Processors is also similar in a way, in that it provides a list of do's and dont's for the mentioned processors, but many things apply in general as well, the nice thing is that there is an example for each case.

Another fun thing is superoptimizers, and the fact that a perfect super optimization is actually a NP-Hard problem and that is of course not a surprise, the interesting thing about it is that actually most if not all fun games are also NP-Hard, since as soon as a game is found not to be NP-Hard it is considered 'solved', and it not really fun to play when one knows the solution,
this might explain why so many of us are attracted to optimization because it is actually NP-Hard (fun!). The same goes for AI and other game related technologies.

Basically all we are trying to do is to surf on the edge of the current technology and try to push the most out of it by using algorithms that were not possible before but are almost possible now if optimized.

Technological Singularity

I tend to agree with the critics though...

Wednesday, October 29, 2008

GI - Photon Mapping - Post 7 - 1 Million Rays

I am in the process of adding ray-packet tracing to my ray tracer, that from now on shall be known by the silly name 'Photonizer' (as a reminder that the ultimate goal is a photon mapper).

I decided for the sake of 'going ahead faster' to support only triangles as primitives from now on (goodbye my ellipsoid test scene... sniff)
I changed the triangle intersector from Trumbore/Moller to Wald's Optimized projection.
I did this because I found it will be nicer to SIMDify (no cross products). A nice side effect however was that I also got 25% speed-up with the new intersection code.

After that I wrote a parallel version of the intersector that can take a 2x2 ray packet. This took a full day between researching, coding, going through visual studio SSE/SSE2 intrinsics documentation, some time was also wasted writing cpuid code for CPU feature detection.

I still did not parallelize the BVH traversal, but I was too eager to try it out, even with only primary rays and without shading, and I was extremely pleased with the result.
I got a 4.5x speedup rendering the cornell box brute force without any space partitioning, and for the first time I saw a 1M rays/sec stat. on my console (copy-paste: 1040062.855307 rays/sec.).

So although there is still work to do, this is the 1M for the record post.

Monday, October 27, 2008

GI - Photon Mapping - Post 6

Finally the bunny...

Of course, the 'performance' is still absolutely lame, using the 1st BVH (almost naive building without SAH) tests I am getting:
1526622 rays, 25.317707 secs, 60298.589844 rays/sec

but still, without BVH this would have taken hours if not days...

Saturday, October 25, 2008

GI - Photon Mapping - Post 5

Basic Features
Here is a list of renders representing the still very basic state of the ray tracer.
you can see here flat planes, spheres, quadrics and triangles, points lights, flat shading, lambertian diffuse, blinn-phong specular, hard shadows, reflections, refractions, exposure, super sampling and wavefront .obj loading (using the code by 'Micah Taylor' a.k.a kixornet found here), and finally multi core rendering in 2 flavors (Task Thread Scheduler, OpenMP).

2 renders show artifacts (also known as surface acne).

Self Shadowing Artifacts
The 1st artifacts were caused by 'self shadowing', numerical imprecision causes the calculated intersection point to lie inside the surface it intersects, shooting a new ray to determine light visibility from the intersection point cause an intersection again with the surface itself, a common problem... There is more than 1 solution to the problem but I chose to avoid 'epsilons' whenever possible.
I solved this without epsilons in a way that works for primitives that do not self shadow, which is for the now the case of all the supported primitives:

when a ray-primitive intersect function is called on primitive (A), a flag is passed to indicate if the ray's origin was also obtained by intersection with (A), for flat surfaces like planes and triangles I directly conclude that there is no new intersection possible.

For Spheres and general quadrics I also use the flag, if the flag is false (intersection coming from another primitive) I switch to the normal intersection test (finding the smallest non negative root). However if the flag is true, I use a 'smarter' and a tiny bit more expensive algorithm, based on our knowledge that the ray's origin is on the surface, we have 3 cases based on the 2 possible roots (tMin and tMax)

1. we obtain 2 positive roots: this can only mean the imprecision pushed the intersection point into the primitive with the ray's direction pointing out of the primitive (most probably a refraction ray), therefore we return tMax.

2. we obtain 2 negative roots: this can only mean the imprecision pushed the intersection out of the primitive with the ray's direction pointing into the primitive (most probably a light or reflection), therefore there is no intersection.

3. we obtain roots with opposite signs: the point was pushed inside the primitive, in this case we signal an intersection only if the positive root is larger than the absolute value of the negative root.

This solves the problem for general quadrics (including spheres) without using epsilon or some iterative technique, of course this might not be feasible for other types of surfaces, if I add such surfaces later I will add other techniques and the technique will be chosen based on the primitive, or at least ... that's the current plan...

Big Float / Small Float Loss of precision Artifacts
The second kind of artifact is seen in the Test Cornell Box Render (Test because it uses one point light). The point lights are currently represented by spheres, this scene is 540 units in heigh, the light sphere is created with a radius of 0.1 units. The problem was occuring because, when trying to shoot a ray from a surface point toward the light, the sphere was missed at times, this was due to the fact that normalizing the distance (point to light) vector was causing enough loss or precision for the intersection tests with the sphere to report no intersection! My first idea was not to normalize the direction vector, this eliminated the artifact, but produced others, now, there were intersections being detected between the ray and close by triangles, there were only very few of the artifacts, but still, they were there, So I went back to normalizing the direction vector, and simply increased the radius of the light sphere. This is probably not the best solution, I will look into this some more ...

Thursday, October 9, 2008

Wednesday, October 8, 2008

GI - Photon Mapping - Update 4 (OpenMP)

I decided to make the code support 4 ways of handling multi-core systems:
  1. No Multi-Threading
  2. Do-It-Yourself (DIY) with all its disadvantages (using a multi-threaded task manager) which is what I already have
  3. OpenMP
  4. Intel TBB
I got the first OpenMP version working, but I will keep working on it a bit in the next sessions.

It seems pretty well suited for the task, and for now I simply parallelized the render loop, allowing each horizontal line to be processed by a different thread.

I played around tuning it to my machine and test scenes, logically, I got best results by splitting the scene in 2: I have 2 processors and the load is pretty balanced in the scenes.

This of course does not scale well, it would be nice if it was possible to make the code more scalable (number of processors) and scene nature (load balance).
Simply increasing the task granularity (per example splitting the scene into more chunks, or even pixel by pixel) comes at the cost of more scheduling, which is not negligible at all, additionally scheduling directives for OpenMP are defined at compile time, yes, 'free lunch is over'....

I thought this would be better in OpenMP than in the DIY solution, but for now I have the feeling both have to be fine tuned to a specific type of machine and processor count.
OpenMP is 'standard' and takes care of all argument marshalling headaches, it also surely beats any experimental DIY in terms of robustness and stability, so unless really necessary it is of course the better choice. 'really necessary' depends on the specific needs, OpenMP can be too simple and too limiting making it unusable for many cases, but when it fits, it fits!

That said, I will still be trying to improve the OpenMP solution a bit before coding a basic TBB version.

Tuesday, October 7, 2008

The McCarthy Show (Building Great Teams)

If this URL says nothign to you, IMMEDIATELY head there and start reading, starting with 'the Core'.
Examples of application are also at

This should have been a course in my university!

one could quote the whole thing, but I just selected this one to make you interested: "Asking in time of trouble means you waited too long to ask for help. Ask for help when you are
doing well."

Monday, October 6, 2008

GI - Photon Mapping - Day 3 (Multi-Threading)

Today I added multi-threading to the Ray tracing renderer.
on my Dual-Core machine at work, the rendering time with multi-threading enabled takes a bit more than half the time needed with multi-threading off, this is of course excellent.

The absolute times are still irrelevant since the rendering itself is not yet optimized (no spatial subdivision).

I added multi-threading by allowing the renderer to split the render area to a configurable amount of regions, each region creating a RenderJob which is then sent to a TaskManager as a task.
The TaskManager in turn, has a configurable amount of Threads waiting to execute any virtual tasks they get, it is all pretty robust and thread-safe (where it needs to be).
Of course, since we are only rendering, the whole scene is 'read only' during the job, so there is not much to be done in terms of synchronization, ray tracing is pretty parallel out of the box.

Sunday, October 5, 2008

GI - Photon Mapping - Day 2 (More Ray-Tracer Features)

I had some time to work on the Ray Tracer a bit more, I got specular, reflections, refractions and exposure working. I also added a simple camera control which enters a 'preview mode' that renders at slow but still interactive rates, this way I can move the camera and when I stop I get the scene rendered in normal (non-preview mode).
One thing I noticed is how easy it is to debug artifacts, just find the coordinates of the problem pixel and follow the ray...
One thing I am currently working on is removing any magical tolerances, I have seen such tolerances used in tutorials for cases such as: you have a ray colliding with a primitive, and you generate a new ray with the collision point as source but a new direction, you do not want the new ray to be detected as colliding with the source primitive. It is possible to solve this by using a small tolerance to move the collision point to the outside of the surface, I do not like that, because those tolerances are usually scene dependent, caused by the lack of floating point precision AND on the metrics and nature of the scene, that is why I like to avoid such work-arounds whenever possible even at the cost of some performance.
I changed the collision routines for the spheres to handle the cases cleanly without using any tolerances.

Next I will fix the plane intersection routines (for now all I have is spheres and planes) to handle the case as nicely.
After that, I will probably take a small break from adding 'features' and play around with some multi-core support for the ray tracer. Even if I did not add spatial subdivision yet, one important goal of this project is playing around with multi-core, so the sooner that happens the better, even if the 'single-core' version is not optimized yet.

A screenie for the record.