Monday, April 27, 2009

Screenshots of some of my work at AIGamedev.com



Many thanks to Alex Champandard, it has been (still is) great to work with you.

Sunday, April 26, 2009

Framework1 / Quake3 level lightmaps

It is sunday, instead of doing something fun (which I am regretting now that I see that it's 23:41 ...) I added lightmap support to the Quake3 rendering part of the framework. 20 years too late, but still, it looks kinda cool...

Can't wait to halfway finish the framework to start using it for 'not old hat' graphics related experiments.


Lightmapped:


Detail texture only:

Lightmap texture only:

Detailt+Lightmap textures:

The Lightmaps of the level (all 128x128):

Friday, April 24, 2009

Rendering Framework1 first video, Streaming Quake3

Aside from working on Animation and AI related things at AIGamedev.com, which by the way just went live with a brand new website :), I am slowly working on an XNA rendering Framework.

I figured it would be a good and time efficient idea to write a Quake3 map loader this time, since much content is directly available, All the nice geometry can then be used as a basis for graphics experiments.

This is the first version that successfully loads Quake3 levels, it took approximately 24 hours, it uses only colored vertices, but it already does something cool, leafs and their faces are streamed as needed and not pre-loaded, and the streaming usese Generics and Reflection and is Vertex format agnostic, of course streaming maps of this size (~100k vertices) does not really make sense since they could all fit nicely in one very small vertex/index buffer, but still, one of the goals of the framework is to focus on streaming and parallelization (think submission engine), so the code is being designed around that, with the Quake3 levels providing the data.

More technical details and source code will come as appropriate, for now, have fun with the (not very exciting colored vertices) video, at least it shows how only the pre-baked radiosity in vertex colors is already enough to create a sense of nice lighting.




References I used for the bsp format and loading:
The nice test maps are from:

Thursday, April 23, 2009

3D Transform-Matrix multiplication reference

There is a known and well documented confusion about the order of matrix multiplication, row major, column major, c++ memory layout....
In general you just need to pay a bit of attention to detail to get it under control, once for each new library or SDK or combination thereof you get to deal with.

I dug up a small reference document I wrote some time ago, and added to it how XNA handles matrices since I am doing a bit of XNA development.
It is not complete so if you would like to add to it I will gladly send you the OpenOffice original.
If you find I wrote anything wrong please let me know.



Saturday, April 18, 2009

Rocket science Tic-Tac-Toe testbed

It has been some time since the last post,

I have been busy working on the AIGamedev.com sandbox, some of the work was related to A*, Hierarchical A*, terrain clustering, pathfinding, Animation mirroring, Locomotion planning ...

I also had the time to start my 'own private course' in academic Machine Learning thanks to the many good online resources and video lectures.

Remotely related, and since I decided to share much of the things I do, no matter how simple or meaningless I might find them, here is a Tic-Tac-Toe AI testbed, which started as an interview 3 hour coding test, but then received some fun additions, namely an Alpha-beta minimax AI, which is adapted from Steve Chapel's code and a Reinforcement learning value iteration AI (my own 1 hour exercise to apply one of the very basic machine learning techniques)

http://jadnohra.net/release/TicTacToeTestbed_rev28.zip


Friday, March 20, 2009

Writing modular software / There is no Perfection only Beauty

I have been working for a while now on the AIGamedev.net sandbox.
For the sandbox, we decided to put extra weight on the 'modular' constraint. This of course puts it in conflict with other constraints I will try to explain why and how.

I like to base this on a nice and short but deep list summarizing good software design.
  1. Create crisp and resilient abstractions.
  2. Maintain a good separation of concerns.
  3. Create a balanced distribution of responsibilities.
  4. Focus on simplicity.

Taking modularity as our main constraint, we can explore the depth of the statement: Create a balanced distribution of responsibilities. We quickly notice that there is a catch here, and therefore depth. The depth comes from the world 'balanced'. It is pretty vague, and that is how it should be.

I remember dealing with exactly that problem of balancing during coding. At some point the code was so 'modular' that every single class almost did not do anything on it's own.
It was very modular but it was a strange feeling, and it also has its disadvantages like any other constraint, and balancing those conflicts is what it practically comes to, sometimes it is possible, but most of the times it is not.

To give an example we can take any physical piece of electronic equipment. Let us take a server as an example, or even better, a server rack.



In a server rack it is nice to think of one server as a module, and the first thing we see is ... the problem ... a big mess of cables, even if organized ... it is still pretty messy, but there is no way around it. This means, the higher the modularity the more cables, and small extra things to hold those cables nicely bundled and manageable.

In software this would mean taking those modules and using groups of them together to let them do something useful, this is where you have many options as well and flavors thereof: composition, aggregation, component based, virtual inheritance, delegates, functors... Each with it's own advantages and disadvantages.

Let us go one level deeper and examine the insides of one server, how modular is it? It is modular, we can simply see that from the fact that it has cables and removable components inside.
But we could ask for more, we could ask for the motherboard itself be more modular, by demanding to be able to replace The chipset chip, the secondary cache chips or for a graphics card we might to require the ability to replace the RAM easily. But why are these parts not modular. Of course, because of other constraints: performance, cost ...



In software it might be a bit more difficult to judge the limit where modular should stop, eventually, every single statement that is repeated somewhere in the code could be 'modularized' but then the amount of screws, pipes, cables and glue will become larger and much more complicated than the code that actually does something.

It is a tricky balance, and an important one. There is no recipe for it, but there are hints we get from the modules themselves ... does a module need too many cables (function parameters per example) to do something useful? Do I need 10 modules (classes) and a similar amount of pipes(each module having a pointer to some other module) to achieve the smallest module that actually achieves something on the level of detail I am working with? If yes, then it is time to redesign? The price can sometimes be duplication, but we have to learn to accept duplication sometimes.

We can see from all of this that something that is modular and nice from one point of view (or level of detail) might become a pain to work with from another one, it is all really relative, and there is no one design that can look good at all levels of detail, I am convinced of that.

The more levels of detail we want to cover, the more general we want to be, the more we will loose the ability to be optimal for only one task or for only one level of detail, so optimization for the task at hand, the level of detail at hand, will always be inevitable, unless we have infinite resources, which I do not see coming in the next few infinitely many years, So remember: There is no perfection, only beauty. Or: only strive for perfection up to the visible horizon (beauty).

Tuesday, January 27, 2009

Beam me up scottie, update...

One small step for a man, one giant leap for teleportation:
http://news.cnet.com/8301-17938_105-10150272-1.html

Tuesday, December 30, 2008

Game STL ...... released ....

Do you know "Legandary Megadeth"'s song 'holy wars'?
Yes, Dave Mustaine was and always will be a prophet!
The words "Game STL" alone are enough to start an atomic war.
enjoy the carnage at: http://www.gamedev.net/community/forums/topic.asp?topic_id=518784

Friday, October 31, 2008

GI - Photon Mapping - Post 8 - Optimization Summary / Rant



In the last 3 days all the work has been on optimization, below is a summary of changes and the speedups they produces (speedup = old render time / new render time)
  • Changing intersector from Trumbore/Moller to Optimized Wald Projection -> 0.25
  • 2x2 Packet Tracing (SSE/SSE2 using SIMD intrinsics) -> 3.0-4.0
  • Including ordering information into traversal (do not traverse BVH box if intersection was found in previous boxes and t intersection <> 1.8, 2x2-> 1.3
  • SIMDifying primary ray packet generation (instead of looping to generate the 4 rays) -> 1.05
  • Changing BVH traversal from recursive to iterative: Mono->1.04, 2x2->soon
  • Changing BVH memory allocation scheme to improve traversal cache friendliness: soon
  • Chancing BVH build algorithm (to SAH?): soon
Some if not all speedups are of course scene dependent, some where tested with Stanford Bunny while others with Cornell Box. So the numbers here are very rough and relative, additionally some changes might produce slowdowns in some scenes/configurations, but this is just a simple summary. After I have all of the high level optimizations done, I will probably try toxie's test benchmark and then have a go at lower level optimizations using a profiler, but for that I am waiting for my new hardware where I am planning to have Linux installed, I will then be able to comfortably use valgrind, gprof or VTune.

I am also planning to have a look at gcc's profile guided optimization. Acovea seems to be a fun thing to try as well.

Talking about low level optimization, I recently read Randall Hyde's Write Great Code-Vol I, and reading Vol II, which are nice as refreshers on assembly and low level optimization as well as nice summaries of what modern compilers do and more importantly do not do although one might expect it.
The publicly available Software Optimization Guide for AMD64 Processors is also similar in a way, in that it provides a list of do's and dont's for the mentioned processors, but many things apply in general as well, the nice thing is that there is an example for each case.

Another fun thing is superoptimizers, and the fact that a perfect super optimization is actually a NP-Hard problem and that is of course not a surprise, the interesting thing about it is that actually most if not all fun games are also NP-Hard, since as soon as a game is found not to be NP-Hard it is considered 'solved', and it not really fun to play when one knows the solution,
this might explain why so many of us are attracted to optimization because it is actually NP-Hard (fun!). The same goes for AI and other game related technologies.

Basically all we are trying to do is to surf on the edge of the current technology and try to push the most out of it by using algorithms that were not possible before but are almost possible now if optimized.

Technological Singularity

http://en.wikipedia.org/wiki/Technological_singularity

I tend to agree with the critics though...

Wednesday, October 29, 2008

GI - Photon Mapping - Post 7 - 1 Million Rays

I am in the process of adding ray-packet tracing to my ray tracer, that from now on shall be known by the silly name 'Photonizer' (as a reminder that the ultimate goal is a photon mapper).

I decided for the sake of 'going ahead faster' to support only triangles as primitives from now on (goodbye my ellipsoid test scene... sniff)
I changed the triangle intersector from Trumbore/Moller to Wald's Optimized projection.
I did this because I found it will be nicer to SIMDify (no cross products). A nice side effect however was that I also got 25% speed-up with the new intersection code.

After that I wrote a parallel version of the intersector that can take a 2x2 ray packet. This took a full day between researching, coding, going through visual studio SSE/SSE2 intrinsics documentation, some time was also wasted writing cpuid code for CPU feature detection.

I still did not parallelize the BVH traversal, but I was too eager to try it out, even with only primary rays and without shading, and I was extremely pleased with the result.
I got a 4.5x speedup rendering the cornell box brute force without any space partitioning, and for the first time I saw a 1M rays/sec stat. on my console (copy-paste: 1040062.855307 rays/sec.).

So although there is still work to do, this is the 1M for the record post.

Monday, October 27, 2008

GI - Photon Mapping - Post 6


Finally the bunny...

Of course, the 'performance' is still absolutely lame, using the 1st BVH (almost naive building without SAH) tests I am getting:
1526622 rays, 25.317707 secs, 60298.589844 rays/sec

but still, without BVH this would have taken hours if not days...

Saturday, October 25, 2008

GI - Photon Mapping - Post 5

Basic Features
Here is a list of renders representing the still very basic state of the ray tracer.
you can see here flat planes, spheres, quadrics and triangles, points lights, flat shading, lambertian diffuse, blinn-phong specular, hard shadows, reflections, refractions, exposure, super sampling and wavefront .obj loading (using the code by 'Micah Taylor' a.k.a kixornet found here), and finally multi core rendering in 2 flavors (Task Thread Scheduler, OpenMP).

2 renders show artifacts (also known as surface acne).

Self Shadowing Artifacts
The 1st artifacts were caused by 'self shadowing', numerical imprecision causes the calculated intersection point to lie inside the surface it intersects, shooting a new ray to determine light visibility from the intersection point cause an intersection again with the surface itself, a common problem... There is more than 1 solution to the problem but I chose to avoid 'epsilons' whenever possible.
I solved this without epsilons in a way that works for primitives that do not self shadow, which is for the now the case of all the supported primitives:

when a ray-primitive intersect function is called on primitive (A), a flag is passed to indicate if the ray's origin was also obtained by intersection with (A), for flat surfaces like planes and triangles I directly conclude that there is no new intersection possible.

For Spheres and general quadrics I also use the flag, if the flag is false (intersection coming from another primitive) I switch to the normal intersection test (finding the smallest non negative root). However if the flag is true, I use a 'smarter' and a tiny bit more expensive algorithm, based on our knowledge that the ray's origin is on the surface, we have 3 cases based on the 2 possible roots (tMin and tMax)

1. we obtain 2 positive roots: this can only mean the imprecision pushed the intersection point into the primitive with the ray's direction pointing out of the primitive (most probably a refraction ray), therefore we return tMax.

2. we obtain 2 negative roots: this can only mean the imprecision pushed the intersection out of the primitive with the ray's direction pointing into the primitive (most probably a light or reflection), therefore there is no intersection.

3. we obtain roots with opposite signs: the point was pushed inside the primitive, in this case we signal an intersection only if the positive root is larger than the absolute value of the negative root.

This solves the problem for general quadrics (including spheres) without using epsilon or some iterative technique, of course this might not be feasible for other types of surfaces, if I add such surfaces later I will add other techniques and the technique will be chosen based on the primitive, or at least ... that's the current plan...

Big Float / Small Float Loss of precision Artifacts
The second kind of artifact is seen in the Test Cornell Box Render (Test because it uses one point light). The point lights are currently represented by spheres, this scene is 540 units in heigh, the light sphere is created with a radius of 0.1 units. The problem was occuring because, when trying to shoot a ray from a surface point toward the light, the sphere was missed at times, this was due to the fact that normalizing the distance (point to light) vector was causing enough loss or precision for the intersection tests with the sphere to report no intersection! My first idea was not to normalize the direction vector, this eliminated the artifact, but produced others, now, there were intersections being detected between the ray and close by triangles, there were only very few of the artifacts, but still, they were there, So I went back to normalizing the direction vector, and simply increased the radius of the light sphere. This is probably not the best solution, I will look into this some more ...






Thursday, October 9, 2008

Wednesday, October 8, 2008

GI - Photon Mapping - Update 4 (OpenMP)

I decided to make the code support 4 ways of handling multi-core systems:
  1. No Multi-Threading
  2. Do-It-Yourself (DIY) with all its disadvantages (using a multi-threaded task manager) which is what I already have
  3. OpenMP
  4. Intel TBB
I got the first OpenMP version working, but I will keep working on it a bit in the next sessions.

It seems pretty well suited for the task, and for now I simply parallelized the render loop, allowing each horizontal line to be processed by a different thread.

I played around tuning it to my machine and test scenes, logically, I got best results by splitting the scene in 2: I have 2 processors and the load is pretty balanced in the scenes.

This of course does not scale well, it would be nice if it was possible to make the code more scalable (number of processors) and scene nature (load balance).
Simply increasing the task granularity (per example splitting the scene into more chunks, or even pixel by pixel) comes at the cost of more scheduling, which is not negligible at all, additionally scheduling directives for OpenMP are defined at compile time, yes, 'free lunch is over'....

I thought this would be better in OpenMP than in the DIY solution, but for now I have the feeling both have to be fine tuned to a specific type of machine and processor count.
OpenMP is 'standard' and takes care of all argument marshalling headaches, it also surely beats any experimental DIY in terms of robustness and stability, so unless really necessary it is of course the better choice. 'really necessary' depends on the specific needs, OpenMP can be too simple and too limiting making it unusable for many cases, but when it fits, it fits!

That said, I will still be trying to improve the OpenMP solution a bit before coding a basic TBB version.

Tuesday, October 7, 2008

The McCarthy Show (Building Great Teams)

http://www.mccarthyshow.com/

If this URL says nothign to you, IMMEDIATELY head there and start reading, starting with 'the Core'.
Examples of application are also at http://aigamedev.com/team-dynamics/core-protocols

This should have been a course in my university!

one could quote the whole thing, but I just selected this one to make you interested: "Asking in time of trouble means you waited too long to ask for help. Ask for help when you are
doing well."

Monday, October 6, 2008

GI - Photon Mapping - Day 3 (Multi-Threading)

Today I added multi-threading to the Ray tracing renderer.
on my Dual-Core machine at work, the rendering time with multi-threading enabled takes a bit more than half the time needed with multi-threading off, this is of course excellent.

The absolute times are still irrelevant since the rendering itself is not yet optimized (no spatial subdivision).

I added multi-threading by allowing the renderer to split the render area to a configurable amount of regions, each region creating a RenderJob which is then sent to a TaskManager as a task.
The TaskManager in turn, has a configurable amount of Threads waiting to execute any virtual tasks they get, it is all pretty robust and thread-safe (where it needs to be).
Of course, since we are only rendering, the whole scene is 'read only' during the job, so there is not much to be done in terms of synchronization, ray tracing is pretty parallel out of the box.