Excerpts from a Virtual Reality Researcher

Multi-threaded loading of data in OpenGL applications

During implementation of the vizHOME viewer point rendering application, it quickly became apparent that with the amount of data we are gathering with the LiDAR scanner, we would need a multi-threaded technique for loading points outside of our main rendering thread.. often known as an “out-of-core” technique.

Implementation details on how to properly execute this sort of technique weren’t exactly in abundance.  A common use in graphics applications is for loading level-of-detail versions of large terrains on the fly, in games or geographical applications.  Other previous point cloud renderers have done it in various forms but not a whole lot is out there on what is the correct way to do it.

Having had various multi-threaded programming experiences over the years, I’ve been able to put together a couple different techniques for loading “out-of-core” in the current viewing tool.  In working through this problem, two different variations on the technique stood out, namely, whether we should create multiple threads and a single OpenGL context (MTSC – multiple threads, single context) or whether we should create multiple threads, each with their own OpenGL context (MTMC – multiple threads, multiple context).

The big difference between these two variations is with MTSC, physical memory has to be allocated each time a file is read from disk containing points, whereas with MTMC, we can share OpenGL objects across threads and can read data directly onto the GPU via the glMapBufferRange function in the actual reading thread.  This is instead of passing the read data in a physical memory buffer back to the main rendering thread for upload to the GPU via the glBufferData function. Uploading to the GPU on the reading thread saves an extra copy of data and also makes it so we don’t have to worry about things like memory fragmentation from dynamic memory allocations (supposing that we aren’t using some sort of memory pool to alleviate this). On the other hand, literature states that the more OpenGL contexts you have active and the more things you have going on in those contexts, the more the OpenGL pipeline needs to perform a “context switch”, which, according to this nice blog post about multiple contexts, affords a performance penalty.

In our situation, we actually aren’t performing any rendering within the other contexts, just uploading to the GPU, so the question becomes, does this still create a significant performance hit?  If so, is it worse than having to pass around a bunch of memory buffers at run time as opposed to being able to directly upload to the GPU?

To test this out, I conducted a short test, comparing rates at which data loads to overall frame rate of our point rendering application amongst the two techniques.  The test was performed on a 1280×800 window on a Windows 7 64 Bit machine with 8 GB ram, 8 cores and a GeForce GT 750M nVidia card with 4 GB of graphics memory.  The test point cloud consisted of 156.4 million points.  Timing was performed by starting in a position within the model where 2,259 octants and 4,433,172 points would need to be loaded and recording the time between start of reading and when all the reading queues became empty.  All times were recorded after nodes had been cached by the system try to remove this factor from the measurements as much as possible (no other files or applications were accessed in-between).  Also, the reading here is actually being done from a big binary blob that has been memory mapped (this will be discussed in a future post), so basically a memcpy copies out the data from the binary blob either into a physically allocated float array (MTSC) or a direct GPU memory pointer returned from glMapBufferRange (MTMC).  The test was conducted with 1, 2, 4, and 8 threads in each case.  I recorded 10 timings for the MTSC and MTMC for each of these thread numbers and threw out the min and max of each set of 10 timings.  I then took the average of the remaining 8 samples and plotted them (milliseconds to converge on y-axis, number of threads on x axis):

thread_reading_analysis

A couple fairly interesting things stand out here.  There was a decent difference in over all convergence time where Multiple Threads, Single OpenGL Context was faster when using a fewer number of threads.  However, when the thread total upped to 4-8, the time to convergence was very similiar, with the Multiple Threads, Multiple OpenGL Contexts actually performing about a quarter to half millisecond faster.

So it appears that using multiple OpenGL contexts for reading can possibly give you a win, due to the same reading speeds (with a larger number of threads), but no need for physical memory allocation.  It would be interesting to see if trying this with more samples, or more threads yields the same results.  Also, there are other things going on in this test case other then reading (it’s actually drawing the data as well until it converges), so it’d be interesting to see if there’s any difference in drawing time between the two – although frame rates during the tests both leveled out at 30 fps.

Has anyone else out there dealt with multiple OpenGL contexts?  How has performance been affected in your application?