2019 Retrospective

With 2019 comming to a close I though it’d be nice to have a look back and see what I’ve been up to.

The hot topic for this year was clearly Ray Tracing, which has been a goal of mine for quite some time. I’ve spent some time reading Ray Tracing in One Weekend, The Next Weekend, and The Rest of Your Life which are great resources and entry points to the subject. I created a few projects on the subject with an SDL2 base game where I added multi threading support and calculating ambient occlusion.

I’ve started to use vcpkg ( C++ Library Manager for Windows, Linux, and MacOS ) which is a neat little tool that allows me to easily grab libraries and keep them updated. No more issues with making SDL apps, importing Assimp, etc.

I started to familiarise myself with CMake. It’s still a hassle to get things working correctly on my first try, but I now have a couple of projects that I can look up to.

I forked my OpenGL sanbox and created a creativelly named Engine Sandbox project with CMake, used vcpkg for SDL, Assimp, nlohmann, etc.. I’ve been mostly fixing it, and working on neat new things, like Hot Reloading Shaders, investigating on how to Hot Reload Cpp code ( will need a bit of re-structuring ). I’ll probably make a new one next year, with a bit of planning to organize it better, but for now, It will do.

Vulkan: I started to readup on that, but after a few tutorials, and not getting anything on screen I eventually ventured of to other topics. Maybe someday i’ll come back to that. SIMD: I’ve read a few tutorials and articles on the subject. Data Oriented Design is another topics I’m looking into.
C++ Concurrency In Action 2nd edition: I’ve been reading this book, which has been on my whishlist for some time, and that naturally spun up a few projects.

I have many topics for next year, namelly, GPU RayTracing, Bounding Volume Hierarchies and other acceleration structures, Data oriented design, AI Design and Control, more MultiThreading, and I’m sure that list will keep on growing.

Parallelizing Ray Tracing

Index

Setup

I’m going to use Ray Tracing in one weekend as a basis for this article. I’ve implemented a ray tracer based on these books by Peter Shirley, but the books have since been updated and moved into a new repository. So this is a nice and clean entry point if you want to follow along.

Single Threaded

You can fork my version of RayTracingOneWeekend, I’ve added a CMakeFile and edited the project so it writes the resulting image into a file. I made a baseline branch for the default result for creating a 1200×800 image with 10 samples per pixel.

Singlethreaded, 1200×800, 10spp, 88 seconds.

On my CPU this takes around 88 seconds, and overall uses only 13Mb of memory during runtime.

Naive Jobification

In order to have an idea of different approaches to make this run faster, I decided to start by using std::async, and create one job per pixel. I assumed from the beginning that this would bring “some” speed gains, but some caveats.

Once again, I’ve made a new branch for the Jobs version of the Ray Tracer, feel free to download that and experiment with it. The main changes include creating a std::async job for each pixel, saving the std::future<ResultJob> in a vector, and using a std::condition_variable to wait until all jobs are complete.

std::mutex mutex;
std::condition_variable cvResults;
std::vector<std::future<RayResult>> m_futures;

for (int j = 0; j < ny; ++j) {
	for (int i = 0; i < nx; ++i) {
		auto future = std::async(std::launch::async | std::launch::deferred, 
		[&cam, &world, &ns, i, j, nx, ny, &cvResults]() -> RayResult {
			const unsigned int index = j * nx + i;
			vec3 col(0, 0, 0);
			for (int s = 0; s < ns; ++s) {
				float u = float(i + random_double()) / float(nx);
				float v = float(j + random_double()) / float(ny);

				ray r = cam.get_ray(u, v);
				col += color(r, world, 0);
			}
			col /= float(ns);

			RayResult result;
			result.index = index;
			result.col = vec3(sqrt(col[0]), sqrt(col[1]), sqrt(col[2]));
			return result;
		});

		{
			std::lock_guard<std::mutex> lock(mutex);
			m_futures.push_back(std::move(future));
		}
	}
}

This is what the main loop now looks like. RayResult is a simple struct containing the image index and resulting color value. After this, I wait for m_futures to be the same value as the number of pixels in the image, and then build the image before writing it to a file.

// launched jobs. need to build image.
// wait for number of jobs = pixel count
{
	std::unique_lock<std::mutex> lock(mutex);
	cvResults.wait(lock, [&m_futures, &pixelCount] {
		return m_futures.size() == pixelCount;
	});
}

// reconstruct image.
for (std::future<RayResult>& rr : m_futures)
{
	RayResult result = rr.get();
	image[result.index] = result.col;
}
1 Job per Pixel, 1200×800, 10spp, 23 seconds.

The resulting image is the same. As expected this reduced execution time to around 23 seconds, 3.8x improvement. But not everything is looking good, on the contrary, we now started to use a lot more memory!

Now this took almost 1 Gb of memory while running, compared to the 13Mb for the single threaded version! CPU usage is almost 100% across all the execution, meaning most cores where used, but that memory usage is way too high. I think we can do better!

Threads and Blocks

The next implementation involves creating N-Threads, the number of threads my CPU can run concurrently, and splitting the image into N blocks of image rows. I’ll be using a std::condition_variable to determine if each thread has finished as well, and we’ll see if this improves our program.

We do get around the same speed benefit and a small enough increase in memory consumption from the baseline test. std::async jobs still performs faster, but I suspect that is it because some of the blocks had less work to do than other, and therefor, finished first. This will make some of our CPU cores idle while the threads finish their blocks ( we can see that from the decrease CPU usage in the screenshot above ). The image is less computationally intensive in some areas than others, think about diffuse spheres versus refractive ones.

Now, I also think that if we used std::async, and split work equally in blocks, we would also reduce memory consumption and calculate the image slower. I think we need to find a nice balance between jobs sizes, obviously one job per pixel is too little and a too big block might cause idle time if the jobs is performed too fast ( assuming that thread doesn’t have another job to perform )

You can grab the source code on GitHub

Fine Tuning Job Sizes

If we have less jobs than CPU cores, some of them become idle and have no more jobs to take on. I’ve created new tests to try out different job sizes. You can check out the code for the image block version using threads and using std::async.

In both branches, you can edit nRowsPerJob to test different job sizes.

const int nThreads = std::thread::hardware_concurrency();
int nRowsPerJob = 10; // play with amount of rows for each job
int nJobs = ny / nRowsPerJob;
int leftOver = ny % nThreads;

I managed to get the same results on both methods. I no longer get a gigantic 1Gb memory usage with std::async, but now using a reasonable amount of pixels to generate, instead of one. There is no visible benefit in terms of performance from threads vs std::async that I could see. On both versions, with various block sizes, I had the same results: around 24 seconds per image and 30Mb of memory usage. By keeping the number of images rows per block low, more jobs will be created and this is ideal to split jobs evenly across CPU cores.

Take away

I set out to expand and consolidate my knowledge on multi threading paradigms and concepts using c++, using some older and some newer c++ features, and I had a lot of fun doing so.

I’m sure there’s lot of room for improvement and provably made lots of mistakes, if I did, let me know. In any case I hope you can take your own conclusions and maybe learn something too.

Exploring Multi-Threading in C++: Loading Textures

Index

Problem Overview

Let’s say we have a game engine that uses OpenGL and we need to load textures asynchronously so that we don’t block the main thread, and we can load the editor or game much faster.

Now as previously demonstrated, I could launch a new set of threads and have them load up the textures ( I’m using stb_image for that ) and generate textures with glGenTextures. The main issue here is that OpenGL context is only availabe in the main thread so if we want to take advantage of multi-threading texture loading we need to split the loading and generating for textures.

Loading is going to be done in worker threads, and generating textures will be done in the main thread. The following diagram shows a simplified workflow of what we’ll achieve.

Simplified execution of loading textures in worker threads, and processing them in the main thread.

In our main thread we have a method that will check our processing textures queue for a job. If it finds one, Generates the OpengGL texture and assigns it back to the material.

void AssetManager::Update()
{
	if (!m_processingTexturesQueue.Empty())
	{
		TextureLoadJob assetJob;
		if (m_processingTexturesQueue.TryPop(assetJob))
		{
			// Generate OpenGL texture
			Texture outputTexture = GenerateTexture(assetJob.loadedData, assetJob.textureType);
			// Update Material
			assetJob.materialOwner->AddTexture(outputTexture);
		}
	}
}

The loader thread will continuously run and check the loading textures queue for jobs. In this case, I load the texture from a file path and assigning the result into the loaded data.

void AssetManager::LoaderThread()
{
	while (m_loadingThreadActive)
	{
		if (!m_loadingTexturesQueue.Empty())
		{
			TextureLoadJob assetJob;
			if (m_loadingTexturesQueue.TryPop(assetJob))
			{
				// Load texture data into asset job
				assetJob.loadedData = LoadTextureData(assetJob.texturePath);
				// push job into processing queue
				m_processingTexturesQueue.Push(assetJob);
			}
		}
		// ....
	}
}
In game sponza scene.

This architecture allows me to load textures while the game is running without blocking the main thread. Its a bit pointless to compare times here since I’m using my own sandbox instead of a sample program to test only this matter. See Part 1 and Part 2 for more info and code you can follow along.

Full source code can be found on GitHub ( commit )

In the next part, we’ll parallelize a toy Ray Tracer. This a different problem on its own where we need to use the resulting values of multiple threads or jobs to build a final image.

Continue Reading

Exploring Multi-Threading in C++ Cont

Index

Specific Worker Threads for Specific Jobs

My next test case is to have different worker thread to run different kinds of tasks. The idea is to have a couple of threads for important jobs, others for less important jobs. I’ve split the tasks into different Job queues for simplicity.

static std::mutex g_mutexLowJobQ;
static std::mutex g_mutexMediumJobQ;
static std::mutex g_mutexHighJobQ;

std::queue<CalcPiJob*> GetJobsOfType(int count, int iterations)
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < count; ++i)
	{
		jobQ.emplace(new CalcPiJob(iterations));
	}
	return jobQ;
}

void RunThreadedPriority()
{
	int nHighThreads = 3;
	int nMediumThreads = 2;
	int nLowThreads = 2;
	
	std::queue<CalcPiJob*> lowJobQ = GetJobsOfType(Settings::JobCountLow, Settings::IterationCountLow);
	std::queue<CalcPiJob*> mediumJobQ = GetJobsOfType(Settings::JobCountMedium, Settings::IterationCountMedium);
	std::queue<CalcPiJob*> highJobQ = GetJobsOfType(Settings::JobCountHigh, Settings::IterationCountHigh);

	std::vector<std::thread> threads;

	std::atomic<bool> hasHighJobsLeft = true;
	for (int i = 0; i < nHighThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasHighJobsLeft, highJobQ, g_mutexHighJobQ);
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> hasMediumJobsLeft = true;
	for (int i = 0; i < nMediumThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasMediumJobsLeft, mediumJobQ, g_mutexMediumJobQ);
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> hasLowJobsLeft = true;
	for (int i = 0; i < nLowThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasLowJobsLeft, lowJobQ, g_mutexLowJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	while (hasHighJobsLeft || hasMediumJobsLeft || hasLowJobsLeft)
	{
		if (hasHighJobsLeft) 
		{
			ExecuteJobsQ(hasHighJobsLeft, highJobQ, g_mutexHighJobQ);
		}
		else
		{
			// wait for other threads to complete.
			std::this_thread::sleep_for(std::chrono::milliseconds(10));
		}
	}

	const int threadCount = threads.size();
	for (int i = 0; i < threadCount; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 6059 ms. ( 4 High Job threads, 2 medium and 2 low threads. )

( click to expand )

The profile image show the 4 threads handling only big jobs, 2 threads handling medium jobs and the other 2 threads handling smaller jobs. As we can see, this won’t win us much time, since when some thread finish their work, they stand idle, not contributing to the bigger picture.

We can try to fix that by implementing some kind of work stealing. When a thread has no more jobs meant for them, they can steal jobs from other thread queues.

Specific Threads with Work Stealing

This next test is just that. Each thread type was setup to grab a job of less priority from their main one, once they run out of jobs. Hopefully we will prevent threads from going idle.

void RunThreadedPriorityWorkStealing()
{
	int nHighThreads = 5;
	int nMediumThreads = 1;
	int nLowThreads = 1;

	std::queue<CalcPiJob*> lowJobQ = GetJobsOfType(Settings::JobCountLow, Settings::IterationCountLow);
	std::queue<CalcPiJob*> mediumJobQ = GetJobsOfType(Settings::JobCountMedium, Settings::IterationCountMedium);
	std::queue<CalcPiJob*> highJobQ = GetJobsOfType(Settings::JobCountHigh, Settings::IterationCountHigh);

	std::vector<std::thread> threads;

	std::atomic<bool> isHighPriorityThreadsActive = true;
	for (int i = 0; i < nHighThreads; ++i)
	{
		std::thread t([&]() {
			
			while (isHighPriorityThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(highJobQ, g_mutexHighJobQ);

				// if no more High Jobs, take on Medium ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);
				}

				// if no more Medium Jobs, take on Small ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
				}

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isHighPriorityThreadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> isMediumThreadsActive = true;
	for (int i = 0; i < nMediumThreads; ++i)
	{
		std::thread t([&]() {
			while (isMediumThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);

				// if no more Medium Jobs, take on Small ones.
				if (!currentJob)
				{
					currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
				}

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isMediumThreadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	std::atomic<bool> isLowThreadsActive = true;
	for (int i = 0; i < nLowThreads; ++i)
	{
		std::thread t([&]() {
			while (isLowThreadsActive)
			{
				CalcPiJob* currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					isLowThreadsActive = false;
				}
			}
			});
		threads.push_back(std::move(t));
	}

	// main thread
	while (isLowThreadsActive || isMediumThreadsActive || isHighPriorityThreadsActive)
	{
		if (isHighPriorityThreadsActive)
		{
			CalcPiJob* currentJob = GetAndPopJob(highJobQ, g_mutexHighJobQ);

			// if no more High Jobs, take on Medium ones.
			if (!currentJob)
			{
				currentJob = GetAndPopJob(mediumJobQ, g_mutexMediumJobQ);
			}

			// if no more Medium Jobs, take on Small ones.
			if (!currentJob)
			{
				currentJob = GetAndPopJob(lowJobQ, g_mutexLowJobQ);
			}

			if (currentJob)
			{
				currentJob->DoWork();
				delete currentJob;
			}
			else
			{
				isHighPriorityThreadsActive = false;
			}
		}
		else
		{
			// wait for other threads to complete.
			std::this_thread::sleep_for(std::chrono::milliseconds(10));
		}
	}

	const int threadCount = threads.size();
	for (int i = 0; i < threadCount; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

(click to expand)

Now we can see that the high priority worker threads started to take on medium sized jobs as soon as the higher ones depleted, and then the small jobs followed.

Synchronizing Threads

Now lets say I’m processing data and I need to start Jobs in sync in between multiple threads, or maybe I’m building a game engine and my main update loop needs to start at the same time as the physics loop in some other thread. Whichever the case, I tough looking up synchronization mechanisms was worth doing as well.

std::mutex g_syncMutex;
std::condition_variable g_conditionVariable;

void RunSynchronizedThreads()
{
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> signal = false;
	std::atomic<bool> threadsActive = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			while (threadsActive)
			{
				// Tell main thread, worker is available for work
				{
					std::unique_lock<std::mutex> lk(g_syncMutex);
					g_conditionVariable.wait(lk, [&] { return signal == true; });
				}

				CalcPiJob* currentJob = GetAndPopJob(jobQ, g_mutexJobQ);

				if (currentJob)
				{
					currentJob->DoWork();
					delete currentJob;
				}
				else
				{
					threadsActive = false;
				}
			}
		});
		threads.push_back(std::move(t));
	}

	// main thread
	std::atomic<bool> mainThreadActive = true;
	while (mainThreadActive && threadsActive)
	{
		// send signal to worker threads, they can start work.
		{
			std::lock_guard<std::mutex> lk(g_syncMutex);
			signal = true;
		}
		g_conditionVariable.notify_all();

		// send signal to worker threads, so they have to wait for their next update.
		std::this_thread::sleep_for(std::chrono::milliseconds(1));
		{
			std::lock_guard<std::mutex> lk(g_syncMutex);
			signal = false;
		}
		g_conditionVariable.notify_all();

		// main thread work.
		CalcPiJob* currentJob = GetAndPopJob(jobQ, g_mutexJobQ);

		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			mainThreadActive = false;
		}
	}

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time: 2674 ms

(click to expand)

I’ve setup this one up so worker thread only start at the same frequency of the main thread. The goal here was to use condition variables to synchronize the threads, and hopefully confirm it with the profiler., which we can look at in the image above.

Test RunTime (ms)Improvement
One Thread 103961.99x
Threaded 26257.88x
Threaded with Priority 60593.4x
Threaded with Work Stealing 26257.8x
Synchronized Threads 26747.7x

Download code from GitHub

Continue Reading

Exploring Multi-Threading in C++

The pursuit of performance is something that interests me as a developer, so as a learning exercise I decided to experiment and consolidate my knowledge about multi-threading. Nowadays it’s becoming even more important since our CPUs get more and more cores. Modern game engines and applications use multiple CPU cores to stay fast and responsive.

Index

Setup and Baseline Result

As a test case, I decided to create a series of tasks, ones small and other big, to simulate different workload types. As an easy test case, I grabbed a method to calculate Pi, and run that method multiple times, depending on how heavy I want the workload to be.

double CalcPi(int n)
{
	double sum = 0.0;
	int sign = 1;
	for (int i = 0; i < n; ++i)
	{
		sum += sign / (2.0 * i + 1.0);
		sign *= -1;
	}
	return 4.0 * sum;
}

Now I create a couple of different Jobs running CalcPi and add them into a vector or a queue ( depending on the test I’m running ). My CalcPiJob class looks something like this.

class CalcPiJob
{
public:
	CalcPiJob(int iterations)
		: m_iterations(iterations)
	{ }

	void DoWork()
	{
		float p = 0.0f;
		for (int i = 0; i < m_iterations; ++i) {
			p += CalcPi(m_iterations);
		}

		p /= m_iterations;
		std::this_thread::sleep_for(std::chrono::milliseconds(Settings::ThreadPause));
	}

private:
	int m_iterations;
};

Creating a series of different workload types looks something like:

std::queue<CalcPiJob*> GetJobsQ()
{
	std::queue<CalcPiJob*> jobQ;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobQ.emplace(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobQ;
}

std::vector<CalcPiJob*> GetJobVector()
{
	std::vector<CalcPiJob*> jobs;
	for (int i = 0; i < Settings::JobCountHigh; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountHigh));
	}

	for (int i = 0; i < Settings::JobCountMedium; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountMedium));
	}

	for (int i = 0; i < Settings::JobCountLow; ++i)
	{
		jobs.push_back(new CalcPiJob(Settings::IterationCountLow));
	}
	return jobs;
}

I have also defined a couple of constants to help out.

struct Settings
{
	enum class Priority : int {
		Low = 0,
		Medium,
		High
	};

	static const int JobCountLow = 120;
	static const int JobCountMedium = 60;
	static const int JobCountHigh = 25;

	static const int ThreadPause = 100;

	static const int IterationCountLow = 5000;
	static const int IterationCountMedium = 10000;
	static const int IterationCountHigh = 20000;

	static const int PrecisionHigh = 100;
	static const int PrecisionMedium = 100;
	static const int PrecisionLow = 100;
};

Now for baseline, I go through all Jobs and execute DoWork sequentially.

void RunSequential()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();
	while (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		job->DoWork();
		delete job;
	}
}

I’m running all my tests on a i7 4770K, that has 4 cores and 8 threads. All timings where taken from a release build, and all profile images from debug builds ( for illustration of workload purposes ).

Sequential run time: 20692 ms

First Worker Thread

Let the interesting part begin. As an easy step towards a multi-threading application, I’m going to create only one thread, to share the workload with the main thread.

This already brings a few new concepts to be aware of such as sharing data across multiple threads. We protect our data access with a std::mutex, and lock it with std::scoped_lock ( introduced in C++17. Use similar std::lock_guard if your compiler doesn’t support it ).

You’ll need a few includes first.

// you should already have these.
#include <vector>
#include <queue>

#include <thread> // thread support
#include <mutex>  // mutex support
#include <atomic> // atomic variables
#include <future> // later on for std::async
CalcPiJob* GetAndPopJob(std::queue<CalcPiJob*>& jobQ, std::mutex& mutex)
{
	std::scoped_lock<std::mutex> lock(mutex);
	if (!jobQ.empty())
	{
		CalcPiJob* job = jobQ.front();
		jobQ.pop();

		return job;
	}
	return nullptr;
}

GetAndPopJob does exactly what is says, it will get a job if one exists and pop it from the queue. empty(), front() and pop() are protected inside this method with the use of the std::scoped_lock.

void ExecuteJobsQ(std::atomic<bool>& hasWork, 
	std::queue<CalcPiJob*>& jobQ, 
	std::mutex& mutex)
{
	while (hasWork)
	{
		CalcPiJob* currentJob = GetAndPopJob(jobQ, mutex);
		if (currentJob)
		{
			currentJob->DoWork();
			delete currentJob;
		}
		else
		{
			hasWork = false;
		}
	}
}

ExecuteJobsQ will run in the main thread and the worker thread. It gets a job, execute it, and continue until there is no more work to do.

// global mutex for read/write access to Job Queue
static std::mutex g_mutexJobQ;

void RunOneThread()
{
	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> jobsPending = true;

	// Starting new thread
	std::thread t([&]() {
		ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);
	});

	// main thread, also does the same.
	ExecuteJobsQ(jobsPending, jobQ, g_mutexJobQ);

	t.join();
}

One worker thread run time: 10396 ms

(click to expand)

The image above show the execution of the jobs, the larger ones first, then the medium sized ones and lastly the smaller ones. This was the order at which the tasks where added into the queue.

More Worker Threads

Now this is nice, so lets add more threads! How many? Well, I know my CPU has 8 thread, but nothing guarantees they will only run for my program tho. Operating system time slice program execution across multiple cores/threads, so even if you create more threads than your max CPU threads, there’s no “problem” because the operating system will switch execution time for them on its own.

C++ provides us a way of determining how many concurrent threads our system supports, so lets just use that: std::thread::hardware_concurrency()

void RunThreaded()
{
	// -1 to make space for main thread
	int nThreads = std::thread::hardware_concurrency() - 1;
	std::vector<std::thread> threads;

	std::queue<CalcPiJob*> jobQ = GetJobsQ();

	std::atomic<bool> hasJobsLeft = true;
	for (int i = 0; i < nThreads; ++i)
	{
		std::thread t([&]() {
			ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);
		});
		threads.push_back(std::move(t));
	}

	// main thread
	ExecuteJobsQ(hasJobsLeft, jobQ, g_mutexJobQ);

	for (int i = 0; i < nThreads; ++i)
	{
		threads[i].join();
	}
}

Run time with 8 threads: 2625 ms.

8 threads ( click to expand )

Now this is a nicer view. 7 worker threads working with the main thread to process all jobs. Again, first we see the bigger jobs, then medium, then smaller ones being processed. This is being processed in the order they were added.

Async Tasks

When spawning tasks with std::async, we don’t manually create threads, they are spawned from a thread pool.

void RunJobsOnAsync()
{
	std::vector<CalcPiJob*> jobs = GetJobVector();

	std::vector<std::future<void>> futures;
	for (int i = 0; i < jobs.size(); ++i)
	{
		auto j = std::async([&jobs, i]() {
			jobs[i]->DoWork();
			});
		futures.push_back(std::move(j));
	}

	// Wait for Jobs to finish, .get() is a blocking operation.
	for (int i = 0; i < futures.size(); ++i)
	{
		futures[i].get();
	}

	for (int i = 0; i < jobs.size(); ++i)
	{
		delete jobs[i];
	}
}

Run time: 2220 ms

(click to expand)

Overview

This time table only serves as an overview for this particular case. Of course, in real applications, results vary.

Test RunTime (ms)Improvement
Sequential 206921.x
One Thread 103961.99x
Threaded 26257.88x
Async Tasks (1)22209.3x

The sample codes are my exploration of this specific case and by no means is free of bugs. But it is interesting to see how the code would run across multiple thread, how to synchronize and make the most of my system.

All screenshots are taken with the debug version of the program, so we could clearly see the workload in the profiler. For that I used Superluminal Profiler. I found out that it is an amazing, lightweight profiler. You can also use Intel’s VTune for free.

Download code from GitHub

Continue Reading

Note

(1) I’ve revisited Async Tasks to make sure all tasks where fully complete before exiting the function call. I managed to get worse performance than the baseline test. Upon inspection with vTune, I found out, that it doesn’t launch new threads. This change happens from Debug/Release builds. I believe there is some optimization that makes it be that fast, even more so because most of the time spent on each jobs is an artificial wait. In any case, read those values with a grain of salt. Do your own tests and make your own conclusions.

Ludum Dare 38 Post Mortem : Astromike

Ready, set, GO!


TL;DR; Marco Vale and I made a game for Ludum Dare 38 in 72h that you can play here. What follows is a small post mortem.


I was awake at 3 am to know the theme for Ludum Dare 38. I still hadn’t made my mind if I was going to participate, it would depend on the theme and if I could come up with something. The theme was released, ‘A Small World’, and I immediately went to bed. There’s nothing like a good night sleep, and letting my brain think about the theme.

So in the morning, I started researching the theme, I had micro-organisms in mind, micro and macro ecosystems but nothing really that would stand out. I thought about how Small World could lead to having a simple set of rules in a small constrained world, and with the micro-organisms idea, I shifted towards a space exploration game, heavily inspired by Hitman GO and Lara Croft GO games.

Initially, the idea would be to have your main character cleanup each level from organisms there but quickly shifted to a puzzle-solver sci-fi game. Having Hitman GO in mind, I prototyped a level and started to create what would become the node-based navigation system. A collection of nodes that the main character and enemies could travel to, based on their distances.

At this point, I had a friend of mine, Marco Vale, asking if I needed an artist, so we teamed-up and he started to create assets for the game’s environment, as well as a character, gun, jetpack and a crashed ship, while I was developing game mechanics. Soon after that, I had a node-based navigation working, and something a lot prettier than cubes to show.

At the end of day 1, we had a placeholder character moving around a map, an exit to finish the level, an enemy that would follow and kill the player, items to pick up, a simple inventory system, and trigger objects that would fire events, either to open or close passages and activate traps.

On day 2, Marco started with the main character, gun, and jetpack. Meanwhile, I was working on creating more levels, introducing new components to the game and building up from the previous levels. For instance, on level 2 you grab the gun, therefore level 3 has enemies that you’re now able to shoot at in order to solve the puzzle and go on. The jetpack is unlocked further and will allow the player to jump across gaps.

At this point, I was still implementing a Hitman GO movement style, because I thought that we wouldn’t have time to make animations. Ultimately we had, so we threw that away, and gladly because it looks and feels much better.

The last couple of hours before the compo deadline, I swapped all placeholders for the gun and jetpack, integrated the animations for the main character, made some quick particle effects, created a teleport mechanic and added sound.

On day 3, we focused on making it look better and fixing a few issues that were making the game difficult to play. We also added the crash site with the spaceship, decorations, particle effects and camera post processing.

I started the Ludum Dare alone, with a faint idea of what to do, and with Marco‘s amazing help, we were able to make something really nice, and learn a lot in the process. I think this is the most fun I’ve had making games in quite a while, and definitely my best LD entry so far. 72h later, I’m proud of the game we’ve created. Feel free to try it out, and leave a comment if you like.

See you next Ludum Dare.

Postmortem: Jumpy Rope is born

Jumpy Rope is an endless arcade jumper game for iOS, and on Android platforms, featuring low poly graphic style, customizable characters and simple gameplay mechanics.

Jumpy Rope was inspired from Final Fantasy IX mini jump rope game, which has Vivi play jump the rope. I thought it could be fun to have a small game that would bring back these memories. So from the beginning, this is what I wanted to game to play like.

pmjr_3
Art Style orientation

As for visual style, Crossy Road, Monument Valley where all people talked about, and for good reason, they look very good, and they play very well. I decided to go for the same direction and aim for low poly flat shading. I was really fond of the idea to have a small, contained environment. After browsing for inspiration, I found a mix of floating islands like Captain Toad Treasure Tracker. By now, I had gameplay and art orientation, so I started to work on it.

I started to work as I always do, gameplay first, so I focused on the jumping mechanism, which is the most important part of the game, making a tutorial at the same time for that ( which as evolved from the tutorial, and is was not implemented in the game ).

A week later I had a prototype which ugly models, art, and no polish, as prototypes should be. With a couple of polish here and there, and a couple of assets from Unity’s asset store, I tough I could make it, and release it. It wouldn’t be the best thing in the world, but having no team to work with, it was all I had. Luckily, I stumble upon this:

When I saw this, I immediately knew this was the style and islands I was looking for. After contacting @fifsilva and explaining the project, she jumped along to make a couple of floating islands for Jumpy Rope.

The game went from mediocre looking to awesome! I really love those tiny floating islands she made. This pushed the overall game quality, and I had no more programmer art anymore. All I needed now was characters. I met some people and then was references to others. I was lucky enough to get concept art and models from them.

Characters Concept
Concept art by Jenny Harder

At some point the game looked this this:

And then came the environments:

Couple of months later I was working on two player mode, while waiting for characters and animation ( by Guilherme Martins ), to be finished to be integrated in the game.

While waiting for assets from the team, I had made UI, a Replay System to save people’s games and replay them as ghosts ( never made it ), local two player mode ( was removed because people didn’t play it ), changed leaderboard system to be cross-platform and integrated with facebook, changed gameplay input based on player feedback ( twice ), and of course the api to handle all that stuff online. The UI took several iterations, I really think that at each step it got better, but at some point I had to close on a final version.

In the later stages of development, Pedro Costa came in to make the game stand out with audio, sound effects and music. Now the game is freely available on the AppStore, and on GooglePlay.

Here’s the trailer for the final version:

In conclusion.
The bad: No planning whatsoever. I didn’t think ahead and plan the game through. I also probably should have got in contact with the community sooner, to ask for help or get feedback for the game. The lack of planning, made is easy to make ‘just one more feature’, that in the end, either didn’t make it to the final build or took way too much effort for what its worth.

The good: You always learn something, some new technology, new and better code practices, and most importantly, you ship a game! Jumpy Rope made me value planning, and careful consideration of features, what to kill and what to invest in. Get feedback early, get it out to the community as soon as possible.

That’s it for now.
Cheers.

Data Serialization on iOS with Unity and AOT Problems

Super quick tip!
I’ve come up with a problem when making a generic data serialization method in which I write my data to a binary file, which is an error that only occurs on iOS because it can’t run JIT ( Just-in-time ) compilation and/or AOT ( Ahead-of-time ) compilation because it doesn’t allow runtime code generation.

There are some answers on Unity Ansers, but this forum post had the same problem as I was, and there is a fix that did it for me, that’s on Unity Answers here with a pretty good explanation.

So, yup, my solution was to add that same piece of code on the file where I am invoking the code.

 void Awake( )
    {
    #if UNITY_IOS
        // Forces a different code path in the BinaryFormatter that doesn't rely on run-time code generation (which would break on iOS).
        Environment.SetEnvironmentVariable("MONO_REFLECTION_SERIALIZER", "yes");
    #endif
        // ...
        LoadData();
    }

So just in case, I’ll leave this here for future reference. ;)