On Programming

A discussion of programming strategies and results

Custom Binary Heap

| Comments

As a games programmer, one of the basic tasks that everyone accomplishes is implementing A* Path Finding. First, let me say that path finding is not basic in video games, but the underlying concepts like A* can be. One way to improved the speed of A* Path Finding is to store the open list in a binary heap. Lucky for us there is already one in the standard library called priority queue. Upon further inspection I noticed some limitations of the priority queue.

  1. There was no find() or contains() function to see if a node was already in the open list
  2. There was no way to re-sort the list after invalidating it by changing the parent node of a node already on the open list

To fix this I implemented my own binary heap with these options available. They should be used sparingly because they are not fast operations for binary heaps, but I feel like they should still be available. Below is my code for the binary heap. Please note that I use C++11 features in this code so you will need a modern compiler like Clang 3.1.

bheap.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
  #pragma once
  #include <functional>
  #include <vector>

  template <class T>
  class bheap
  {
  public:
    /** 
     * Initialize the bheap with a function to compare the entries and an expected size to reduce allocations during insertion
     * 
     * @param _compare A function to compare the entries. If compare(a,b) returns true, A is located higher up in the queue to be popped off first
     * @param expected_size The amount of space to allocate for values
     */
    bheap(std::function< bool(T& a, T& b) > _compare = [](T& a, T& b) {return a < b;}, size_t expected_size = 100) :
      compare(_compare)
    {
      heap.reserve(expected_size);
    }
    /** 
     * Initialize the bheap with a list of initial values, a function to compare the entries and an expected size to reduce allocations during insertion
     * 
     * @param input An initializer list of values to be inserted into the heap
     * @param _compare A function to compare the entries. If compare(a,b) returns true, A is located higher up in the queue to be popped off first
     * @param expected_size The amount of space to allocate for values
     */
    bheap(const std::initializer_list<T> & input, std::function< bool(T& a, T& b) > _compare = [](T& a, T& b) {return a < b;}, size_t expected_size = 100) :
      compare(_compare)
    {
      heap.reserve((expected_size > input.size() ? expected_size : input.size()));
      for (const T& val : input)
        heap.push_back(val);
      sort();
    }
    ~bheap()
    {

    }

    /** 
     * Check to see if the heap has no values
     * 
     * 
     * @return true if it has no values, otherwise false
     */
    bool empty() const
    {
      return heap.empty();
    }

    /** 
     * Get the size of the heap
     * 
     * 
     * @return The size of the heap
     */
    size_t size() const
    {
      return heap.size();
    }

    /** 
     * Add a new value to the heap in the correct position
     * 
     * @param value The new value to add
     */
    void push(const T& value)
    {
      heap.push_back(value);
      percolate_up(heap.size()-1);
    }

    /** 
     * Get the lowest value in the heap
     * 
     * Gets the lowest value in the heap. This is determined by finding which value returns true compared against all other items when passed as the first parameter to the compare function. By defaul the compare function performs a simple operator<()
     * 
     * 
     * @return The lowest value in the heap
     */
    const T& top() const
    {
      return heap[0];
    }

    /** 
     * Remove the lowest value from the heap
     * 
     */
    void pop()
    {
      heap[0] = heap[heap.size()-1];
      heap.pop_back();
      percolate_down(0);
    }

    /** 
     * Fully sort the heap assuming nothing about the current state. This operation is expensive so only call it if necessary. The pop and push functions will keep the heap sorted on their own.
     * 
     */
    void sort()
    {
      if (empty())
        return;
      for (size_t i = (heap.size()-1)/2; i != -1; --i)
        percolate_down(i);
    }

    /** 
     * Find the first entry that matches value.
     * 
     * Find the first entry that matches value. If there are multiple entries with the same value there is no guarantee which one will be returned. Please note that if you change a value that affects the sorting then you must also make a call to sort()
       * 
       * @param value The value to look for
       * 
       * @return A pointer to the value in the heap matching the input value or nullptr if none is found
       */
    T* find(const T& value)
    {
      for (T& val : heap)
        if (val == value)
          return &val;
      return nullptr;
    }

    /** 
     * Remove a value from the heap.
     * 
     * Remove a value from the heap given a pointer to its location. This will call sort() automatically since the sorting of the heap will be destroyed. To remove by value, use find()
     * 
     * @param value A pointer to the value in the heap that needs to be removed.
     */
    void remove(const T* value)
    {
      bool found = false;
      for (auto it = heap.begin(); it != heap.end(); ++it)
        {
          if (&*it == value)
            {
              heap.erase(it);
              found = true;
              break;
            }
        }
      if (found)
        sort();
    }
  private:
    void percolate_down(size_t i)
    {
      while (2*i + 1 < heap.size())
        {
          size_t left_child = 2*i + 1;
          size_t right_child = 2*i + 2;
          size_t parent = (i-1)/2;
          size_t last_leaf = heap.size()-1;
          size_t child;
          if (right_child < heap.size() && compare(heap[right_child], heap[left_child]))
            child = right_child;
          else
            child = left_child;
          if (compare(heap[child], heap[i]))
            {
              //swap
              T tmp = heap[i];
              heap[i] = heap[child];
              heap[child] = tmp;
              i = child;
            } else {
            break;
          }
        }
    }

    void percolate_up(size_t i)
    {
      size_t parent = (i-1)/2;
      while (parent < i)
        {
          if (compare(heap[i], heap[parent]))
            {
              T tmp = heap[i];
              heap[i] = heap[parent];
              heap[parent] = tmp;
              i = parent;
              parent = (i-1)/2;
            } else {
            break;
          }
        }
    }

    std::function< bool(T& a, T& b) > compare;
    std::vector<T> heap;
  };

Runtime CPU Feature Checking

| Comments

Writing game engines is all about getting the most performance out of your machine. It is always on the bleeding edge pushing the limits of today’s hardware, but in the PC market we can’t forget about those who have older machines. This leaves two options: write for old machines and end up with a game looking like it was written half a decade ago or write for both old and new machines.

The choice is obviously the latter, but how do we detect what the end user’s CPU can handle? Enter CPUID. CPUID is an assembly instruction used to identify all the fun extensions on the processor like SSE and AES-NI. Luckily, we can embed assembly directly inside C and C++ so this information is actually easy to acquire. First we start with the function from wikipedia:

1
2
3
4
5
6
7
8
9
10
11
  void cpuid(unsigned info, unsigned *eax, unsigned *ebx, unsigned *ecx, unsigned *edx)
  {
    *eax = info;
    __asm volatile
      ("mov %%ebx, %%edi;" /* 32bit PIC: don't clobber ebx */
       "cpuid;"
       "mov %%ebx, %%esi;"
       "mov %%edi, %%ebx;"
       :"+a" (*eax), "=S" (*ebx), "=c" (*ecx), "=d" (*edx)
       : :"edi");
  }
Now all we need is some helper functions to make this much easier to read and use. Below is my helper function to detect if the user has the hardware accelerated AES instruction set:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  bool has_aes_ni()
  {
    unsigned int eax, ebx, ecx, edx;
    cpuid(1, &eax, &ebx, &ecx, &edx);
    return ((edx & 0x2000000) != 0);
  }

  int main()
  {
    if (has_aes_ni())
      printf("AES-NI Detected!\n");
    else
      printf("You do not have AES-NI support!\n");
    return 0;
  }
Naturally theres a lot of CPU extensions to look through and write helper functions for, but this should get you started. After you’ve detected the CPU features its a simple matter of redirecting functions calls to either the optimized or fallback functions. Happy coding!

On Game Object Architecture

| Comments

This post will serve as a fact finding mission. In class my professor described two object models for video games: Object-Centric Architecture and Property-Centric Architecture. In class it was suggested that while Property-Centric Architecture is less intuitive to write, it will run faster due to less cache misses by packing all the relevant data into the same block of memory. I decided that there was a third option which I will call Hook-Centric Architecture which I believed would be the fastest through avoiding branches. To run this test I am going to write a small C++ program to perform Explicit Euler Numerical Integration.

Object-Centric Architecture

This is your classic design of objects. All the properties relating to the object are stored in the object itself. Below is the structure I used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  typedef struct
  {
      double x;
      double y;
      double z;
  } triplet;

  typedef struct
  {
      triplet acceleration;
      triplet velocity;
      triplet position;
      bool in_use;                /**< Only used for Property-Centric */
  } transform;

  typedef struct
  {
      bool has_transform;
      transform* p;
      float mass;
      void* model;                /**< Since this is just a test we can use void* */
      void* texture;
  } object;
You’ll notice that this closely mimics the classic object oriented design which closely mimics real life objects.

Property-Centric Architecture

The idea behind the Property-Centric Architecture is that all the related data is packed together in contiguous memory in order to avoid cache misses. For this test we don’t need to actually create the object structs since I am only performing operating on the transform.

1
2
3
4
5
6
7
  transform* transform_objects;

  void init_property_centric()
  {
      transform_objects = new transform[MAX_NUM_OF_OBJECTS];
      //Perform other initialization
  }
You’ll notice that since this data is allocated in an array it will be in one contiguous block in memory, whereas depending on how I allocate the Object-Centric objects they could be scattered across the full spectrum of memory.

Hook-Centric Architecture

The idea here is that every object would store all the fields it owns just like in Object-Centric Architecture, but now the fields would only be processed by putting function pointers into a long list of functions to get executed at every frame. This way only the objects with the specified property will spend cycles calculating the changes. Allow me to illustrate:

Classic Update

1
2
3
4
5
6
7
  for (object & o : classic_objects) //C++11 range-based for loop
  {
      if (o.has_transform) // BRANCHING!
      {
          update_transform(o);
      }
  }

Hook Update

1
2
3
4
5
  //Passing in a double representing number of milliseconds since last frame
  for (function<void (double)> & f : callbacks)
  {
      f(time);
  }

Explanation

As you can see, the Object-Centric Archtecture introduces branching through its if statement, whereas the Hook-Centric Architecture can blindly chug along processing everything in its path.

Running The Tests

For all three architectures I initialized the transform object using this function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  /** 
   * Initialize the phyics objects with blank data and random acceleration
   * 
   * @param data The transform object to initialize
   */
  inline void init_transform(transform & data)
  {
      data.position.x = 0;
      data.position.y = 0;
      data.position.z = 0;
      data.velocity.x = 0;
      data.velocity.y = 0;
      data.velocity.z = 0;
      data.acceleration.x = ((double)(rand()%100)) / 10.0;
      data.acceleration.y = ((double)(rand()%100)) / 10.0;
      data.acceleration.z = ((double)(rand()%100)) / 10.0;
  }
To perform the integration I used this function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  /**
   * Perform explicit euler numeric integration
   * 
   * @param data The transform object
   * @param time The amount of time in seconds since last run
   */
  inline void explicit_euler_numeric_integration(transform & data, const double & time)
  {
      data.velocity.x += data.acceleration.x * time;
      data.velocity.y += data.acceleration.y * time;
      data.velocity.z += data.acceleration.z * time;
      data.position.x += data.velocity.x * time;
      data.position.y += data.velocity.y * time;
      data.position.z += data.velocity.z * time;
  }
The tests were run on a Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz with 200000 objects created for each test, and each test went through 10000 integrations. Before performing the test I made sure to disable all CPU freqency scaling since the processor will be running at full blast inside a game anyway.
1
  for i in 0 1 2 3 4 5 6 7; do sudo cpufreq-set -c $i -g performance; done

Results

Times are listed in number of seconds

No Compiler Optimizations

Object-CentricProperty-CentricHook-Centric
4531135
4430135
4531136
4629137
4531135

Compiler Optimizations

Object-CentricProperty-CentricHook-Centric
362928
342929
332928
333026
363028

Analysis

As you can see, the Hook-Centric model without compiler optimizations is just awful, but once you put on compiler optimizations it suddenly becomes the best option. I’ve uploaded the code for these tests to http://static.paphus.com/blog/architecture_test.tar.bz2. I encourage everyone to try it out for themselves. The comment section is open for anyone who can enlighten us further or provide another alternative!

To compile the code install premake and run

1
2
3
4
  premake4 gmake
  make config=debug #no optimizations
  # -or-
  make config=release #compiler optimizations

Octopress and Org-Mode

| Comments

OUT OF DATE

WARNING This post is out of date. A newer infinitely better method is located at http://blog.paphus.com/blog/2012/08/01/introducing-octopress-blogging-for-org-mode/.

I am making the switch to octopress for my blogging platform. Those of you that know me know that I am an avid user of Emacs Org-Mode so naturally I want to use it to write my blog.

The first tutorial I followed is located at http://jaderholm.com/blog/blogging-with-org-mode-and-octopress. It managed to get me up and running but I noticed a few issues with it. The first issue was that it didn’t handle exported babel blocks like dot/graphviz. To fix that I needed to change two items in my .emacs file. First I define a save-then-publish function which I found online. Sadly I lost the original location.

1
2
3
4
5
  (defun save-then-publish ()
    (interactive)
    (save-buffer)
    (org-save-all-org-buffers)
    (org-publish-current-project))
Then I declared an additional project that handles the publishing of image files to the proper directory to be synced with rsync.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
  (setq org-publish-project-alist
        '(("blog-org" .  (:base-directory "~/blog/source/org_posts/"
                                          :base-extension "org"
                                          :publishing-directory "~/blog/source/_posts/"
                                          :sub-superscript ""
                                          :recursive t
                                          :publishing-function (set-octopress org-publish-org-to-html unset-octopress)
                                          :headline-levels 4
                                          :html-extension "markdown"
                                          :body-only t))
          ("blog-extra" . (:base-directory "~/blog/source/org_posts/"
                                           :publishing-directory "~/blog/source/"
                                           :base-extension "css\\|pdf\\|png\\|jpg\\|gif\\|svg"
                                           :publishing-function org-publish-attachment
                                           :recursive t
                                           :author nil
                                           ))
          ("blog" . (:components ("blog-org" "blog-extra")))
          ))
This worked great, but then I noticed that code blocks were getting exported as simple <pre> blocks and not getting highlighted. I managed to fix that with a simple regular expression to replace the code blocks before export and then restore the file back to its original state after the export. Please note that this is the first iteration of the function so it is not robust, it will only work for blocks that are simple begin_src blocks with a language specified. If you would like to improve on the function the comment section is open and I would appreciate it!

NOTE I had to locate this on pastebin because my function was matching its own code and messing with the highlighting. Now instead of hitting C-c C-e F I just hit M-x save-then-publish ret and everything seems to work. Hope this helps someone out there looking to use org-mode to its full potential!