Elegant Power Set Calculation Algorithm

A very easy to understand and elegant algorithm to calculate the powerset of a given set of numbers. It is based on the properties of binary numbers. An example in C code along with explanation is provided.

Hello all. While looking around for algorithms to calculate the power set of a set I came across the usual recursive algorithm which tries to exhaust all the probable sets and also its iterative version which does pretty much the same thing. But then I saw an algorithm that calculates the power set that is both simple to understand, fast and elegant and I felt the need to write about it and share a sample C implementation.

But first of all what exactly is a power set? For the whole explanation please visit the Wikipedia page. In short, the Power Set of a set S is all the possible subsets of that set including itself and the empty set. For example the power set of S = {3,4,5} is P(S) = { {3,4,5} , {3,4}, {4,5}, {3,5}, {3}, {4}, {5}, {∅} }. The number of subsets of the power set is given by 2s where s is the number of elements in S. s is also called the cardinality of S.

The algorithm works in a very simple and elegant way. You represent the original set S with a string of bits all with a value of 1. For the above example of S that number would be what? 2s-1 = 23-1 = 7. In binary seven is 111 as we can see from the figure below.

Binary representation of 7
Binary representation of 7

Now just picture that each bit of this number is associated with one element of the original set. So in our example the first bit would be associated with 3, the second bit with 4 and so on. Now with a simple iteration from 0 to this number (7 in our case) we can iterate through all the possible permutations of the set by taking the binary number representation at each iteration and matching its bits to the existence(or not) of the corresponding elements of the set. Since pretty pictures are much better than words I made the following diagram to illustrate the algorithm in a graphic manner.

Binary Power Set Representation
Binary Power Set Representation

And just like that we can obtain the power set of any given set.
Let’s see a quick implementation that I cooked up in C.

#include <math.h>
#include <stdio.h>
 
//an example set
int A[] = {1, 5, 7 , 8, 2, 10, 12};
 
//The function to calculate the subset of the original set
//corresponding to the number given at num
void printSet(unsigned int num,int* arr,unsigned int size)
{
    unsigned int i;
    printf("\t{");
    //for each bit
    for(i=0;i<sizeof(unsigned int)*8;i++)
    {
        if(num&(unsigned int)0x1)//if it's ON print the corresponding value
        {
            if(num >>1 ==0)//if it's the last ON bit
            {//print without comma and terminate the loop
                printf("%d",arr[i]);
                break;
            }
            else
                printf("%d,",arr[i]);
        }
        num = num>>1;//keep shifting right to get more bits
    }
    printf("}\n");
}
 
int main()
{
    unsigned int size,i,n;
 
    //get the cardinality of the set and check
    //if it can be represented by an unsigned int
    size = sizeof(A)/sizeof(int);
    if(size > sizeof(unsigned int)*8)
    {
        printf("The given set can not be represented by the size\
         of unsigned int in the system\nQuitting ...");
        return -1;
    }
    //get the number until which we need to iterate
    n = pow(2,size)-1;
 
    //calculate and print the powerset
    printf("The powerset of {");
    for(i=0;i< size-1; i ++)
        printf("%d,",A[i]);
    printf("%d} is:\n{\n",A[size-1]);
    for(i=0;i<=n;i++)
        printSet(i,A,size);
    printf("}\n");
 
    return 0;
}

The above is just a minimal example to demonstrate the algorithm. It should work in most C compilers, including GCC and MSVC and give the following output:

The powerset of {1,5,7,8,2,10,12} is:
{
{}
{1}
{5}
{1,5}
{7}
{1,7}
{5,7}
{1,5,7}
{8}
{1,8}
{5,8}
{1,5,8}
{7,8}
{1,7,8}
{5,7,8}
{1,5,7,8}
{2}
{1,2}
{5,2}
{1,5,2}
{7,2}
{1,7,2}
{5,7,2}
{1,5,7,2}
{8,2}
{1,8,2}
{5,8,2}
{1,5,8,2}
{7,8,2}
{1,7,8,2}
{5,7,8,2}
{1,5,7,8,2}
{10}
{1,10}
{5,10}
{1,5,10}
{7,10}
{1,7,10}
{5,7,10}
{1,5,7,10}
{8,10}
{1,8,10}
{5,8,10}
{1,5,8,10}
{7,8,10}
{1,7,8,10}
{5,7,8,10}
{1,5,7,8,10}
{2,10}
{1,2,10}
{5,2,10}
{1,5,2,10}
{7,2,10}
{1,7,2,10}
{5,7,2,10}
{1,5,7,2,10}
{8,2,10}
{1,8,2,10}
{5,8,2,10}
{1,5,8,2,10}
{7,8,2,10}
{1,7,8,2,10}
{5,7,8,2,10}
{1,5,7,8,2,10}
{12}
{1,12}
{5,12}
{1,5,12}
{7,12}
{1,7,12}
{5,7,12}
{1,5,7,12}
{8,12}
{1,8,12}
{5,8,12}
{1,5,8,12}
{7,8,12}
{1,7,8,12}
{5,7,8,12}
{1,5,7,8,12}
{2,12}
{1,2,12}
{5,2,12}
{1,5,2,12}
{7,2,12}
{1,7,2,12}
{5,7,2,12}
{1,5,7,2,12}
{8,2,12}
{1,8,2,12}
{5,8,2,12}
{1,5,8,2,12}
{7,8,2,12}
{1,7,8,2,12}
{5,7,8,2,12}
{1,5,7,8,2,12}
{10,12}
{1,10,12}
{5,10,12}
{1,5,10,12}
{7,10,12}
{1,7,10,12}
{5,7,10,12}
{1,5,7,10,12}
{8,10,12}
{1,8,10,12}
{5,8,10,12}
{1,5,8,10,12}
{7,8,10,12}
{1,7,8,10,12}
{5,7,8,10,12}
{1,5,7,8,10,12}
{2,10,12}
{1,2,10,12}
{5,2,10,12}
{1,5,2,10,12}
{7,2,10,12}
{1,7,2,10,12}
{5,7,2,10,12}
{1,5,7,2,10,12}
{8,2,10,12}
{1,8,2,10,12}
{5,8,2,10,12}
{1,5,8,2,10,12}
{7,8,2,10,12}
{1,7,8,2,10,12}
{5,7,8,2,10,12}
{1,5,7,8,2,10,12}
}

The limitation above is the size of the unsigned int in the system. If say that size is 32 then the above will only work for sets with cardinality of up to 32. This is a problem that is easy to overcome. Just use bigger integers. For example if you use a compiler that follows the C99 standard then you can include the <stdint.h> header and use uint64_t type instead of the unsigned int to be able to utilize the algorithm for sets with cardinality of up to 64. For bigger sets you would have to use some big integer library or just code it yourself.

Well that’s it. I hope some of you also find this algorithm to be elegant and nice and I wish it has been helpful for you. As always any comments, corrections and feedback are more than welcome. Please use the form below.

Additionally if the post caught your interest and you love C programming in general why not check my work in the Refu Library? It currently needs more contributors and people who are able and willing to provide feedback! You could be one of them 🙂

Playing with OpenCL: Gaussian Blurring

Introducing the use of OpenCL with a tutorial on how to setup the software development environment for it depending on the vendor of your graphics card. Also presenting a full-fledged example of its use with the implementation of a parallel processing algorithm for applying a gaussian blur filter on an image.

EDIT(2 years and something later):
The code in this tutorial is ugly. Checked it 2 years and something after writting it and can definitely say it’s bad. I don’t have enough time to “fix” it but at least I changed the github repo, to provide a simple Makefile and removed nonsensical dependency to an immature version of a library of mine.

Start of original post:
I haven’t written a blog post in ages, since I was pretty busy with my university class. Lately I have been playing around with OpenCL and completed a small demo that can perform Gaussian Blurring on a bitmap image either with the CPU or the GPU. In this post I will try to explain how to do it step by step so that people with no OpenCL knowledge may find a good starting point to get comfortable with OpenCL. This post is accompanied by source code located at Github.

First of all what is OpenCL and how do I get it? Well, OpenCL is an API interface to your graphics card that allows for GPGPU(General Purpose GPU) computing. It is an open standard that is made and maintained by the same guys that work on OpenGL, the Khronos Group. The first version of this tutorial will showcase how to work in Windows with OpenCL but in the future a Linux and maybe even Mac OS X version may be added. (Theoretically this code will work in Linux too but it has not yet been tested. Subject to change.)

Getting OpenCL

As I mentioned above OpenCL is an API for your graphics card so it stands to reason that the drivers for it and any SDK would be available from your GPU’s vendor. So in order to get the OpenCL SDK you have to visit your graphic’s card vendor website. Below you can see the links at the time of writing of this post:

Now that you have the SDK let’s see how to use it from inside a project. Just like any other library you have to link to the library object itself and expose the functions via header files. At the moment of this writing I will show how to do this under Windows for the nVidia OpenCL sdk and the gcc compiler with the Codeblocks IDE. For the Microsoft Visual studio the process is similar. If I can find more time in the future I will show how to accomplish this with other IDEs and other OSes too.

Linking to OpenCL

So assuming that your IDE is Codeblocks, fire it up and make a new project. In Codeblocks all linking options are in the menu “Project”->”Build Options”. First we should link to the OpenCL library itself, so navigate to the Linker Settings and add the directory of OpenCL.lib there as can be seen in the picture below. If you have a nVidia card and are on Windows the path should be similar to the one below.

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.2\OpenCL\common\lib\Win32\OpenCL.lib
Codeblocks Linker Settings
Linking to OpenCL in Codeblocks

Afterwards we have to tell the compiler where the OpenCL headers are located. For the nVidia and Windows 7 case they are located in:

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.2\OpenCL\common\inc

In order to make gcc understand that and while still being inside the project’s build options we navigate to search directories tab and the compiler subtab and insert the location of the headers as shown in the image below.

OpenCL Headers and Codeblocks
Showing Codeblocks the location of the OpenCL headers

Finally the last thing that we need to do is tell the compiler where the location of the library file is. That’s pretty easy since in our case as we saw above it is

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.2\OpenCL\common\lib\Win32

To show that to gcc just navigate to the linker subtab and enter the path to the library there like shown in the picture below.

Codeblocks OpenCL lib Linking
Showing Codeblocks the location of OpenCL lib

With the OpenCL SDK and drivers downloaded and installed and the project settings setup correctly now we are finally ready to run OpenCL code.

Gaussian Blurring project

As mentioned in the beginning, all the code for this tutorial can be found in Github. Just visit the link and download/clone the repo. Just create a new project in the IDE of your choice and link against the OpenCL library as shown above and also the refu.dll or refu.so (depending on your OS) and the refu headers included in the repository. Remember that at the moment of writing this code has only been tested in Windows but should theoretically work for Linux too with minor to no changes. When I find the time I will test it on Linux too and adjust the repository so that it’s guaranteed to work for both OSes.

The code that you downloaded creates a program that accepts some arguments and according to them can blurs bitmap images with both the CPU and the GPU. These arguments are given below:

  • -h   Displays a help message
  • -i   The name of the input image for blurring
  • -g   The size of the gaussian kernel. Default is 3
  • -s   The sigma parameter of the gaussian. Default is 0.8

Bitmaps files

The tutorial’s code contains some functions to load and save .bmp bitmap images. It is code that you should use only for this project and is definitely not well written enough to be used generally. It only works with 24 bits per pixel uncompressed bitmap images. The test images included inside the repository are all compatible so I recommend you use them for the purpose of this tutorial.

The functions can be seen below

//! Initializes a bmp image object from a file
//!
//! @param bmp The bmp object to initialize
//! @param filename The name of the file from which to initialize the image
//! @return true for succesfull initialization and false otherwise
char meImageBMP_Init(ME_ImageBMP* bmp,char* filename);
 
//! @brief A bitmap saving to disk function.
//!
//! Saves the image only as 24 BPP uncompressed bitmap
//! @param bmp The bitmap image
//! @param filename The name of the file to save
int meImageBMP_Save(ME_ImageBMP* bmp,char* filename);

The program uses the first function to read the data of the bitmap, then performs gaussian blurring and in the end saves the blurred image back to the disk with the second function.

Blurring on the CPU

Performing Gaussian blurring on the CPU is a linear process. First a Gaussian kernel(also known as mask or filter) is created using the following equation.

The 2D Gaussian Function
The 2D Gaussian Function

The result is a square matrix whose greatest value is located at the center and as you gradually progress outwards the values get smaller. For example for sigma=1 and the size of the square gaussian being 5 the following kernel is generated. You may notice that the values are normalized. We perform normalization of the gaussian because if we did not then the final image may contain excessive light or dark areas.

0.002969 0.013306 0.021938 0.013306 0.002969
0.013306 0.059634 0.098320 0.059634 0.013306
0.021938 0.098320 0.162103 0.098320 0.021938
0.013306 0.059634 0.098320 0.059634 0.013306
0.002969 0.013306 0.021938 0.013306 0.002969

Performing gaussian blur is as simple as sliding this kernel along the image and convolving the pixels of the image with the kernel’s values. It is more computationally effective to convolve two 1-dimensional gaussian kernels with the image, one in the horizontal direction and one in the vertical direction but since here article we are just dealing with OpenCL this is out of the scope of the tutorial.

Below we can see the actual function that performs the gaussian blurring on the CPU.

//Blurs the given image using the CPU algorithm
char pna_blur_cpu(char* imgname,uint32_t size,float sigma)
{
    uint32_t i,x,y,imgLineSize;
    int32_t center,yOff,xOff;
    float* matrix,value;
    matrix = createGaussianKernel(size,sigma);
    //read the bitmap
    ME_ImageBMP bmp;
    if(meImageBMP_Init(&bmp,imgname)==false)
    {
        printf("Image \"%s\" could not be read as a .BMP file\n",imgname);
        return false;
    }
    //find the size of one line of the image in bytes and the center of the gaussian kernel
    imgLineSize = bmp.imgWidth*3;
    center = size/2;

At the beginning of the function the gaussian kernel is created with the arguments that the user of the program provides at execution. Then the bitmap image is read and the size of a line in bytes and the center of the gaussian kernel is found.

    //convolve all valid pixels with the gaussian kernel
    for(i = imgLineSize*(size-center)+center*3; i < bmp.imgHeight*bmp.imgWidth*3-imgLineSize*(size-center)-center*3;i++)
    {
        value = 0;
        for(y=0;y<size;y++)
        {
            yOff = imgLineSize*(y-center);
            for(x=0;x<size;x++)
            {
                xOff = 3*(x - center);
                value += matrix[y*size+x]*bmp.imgData[i+xOff+yOff];
            }
        }
        bmp.imgData[i] = value;
    }

Afterwards as we can see above, for all the pixels on which the gaussian kernel can be centered on we start a loop. In this loop the pixel’s new value is the convolution of itself and its surrounding pixels with the gaussian kernel.

    //free memory and save the image
    free(matrix);
    meImageBMP_Save(&bmp,"cpu_blur.bmp");
    return true;

Finally in the end, memory is freed and the blurred image is saved on disk.

Gaussian Blur
Gaussian blur result with size=5 and sigma=0.8

Blurring on the GPU

Blurring on the GPU is a little more complicated but is also the interesting part of the article. This is where we will utilize the OpenCL API. The beginning of the function is of similar fashion to the CPU one.

//Blurs the given image using the GPU
char pna_blur_gpu(char* imgname,uint32_t size,float sigma)
{
    uint32_t imgSize;
    float* matrix;
    cl_int ret;//the openCL error code/s
    //get the image
    ME_ImageBMP bmp;
    meImageBMP_Init(&bmp,imgname);
    imgSize = bmp.imgWidth*bmp.imgHeight*3;
    //create the gaussian kernel
    matrix = createGaussianKernel(size,sigma);
    //create the pointer that will hold the new (blurred) image data
    unsigned char* newData;
    RF_MALLOC(newData,imgSize);

The gaussian kernel is created and the bitmap image is read.

This is where things start to diverse. Let’s see how OpenCL treats your graphics card. OpenCL considers your GPU as an OpenCL-enabled device. In order to do non-graphics calculations in this device you need to write a program for it. This program is written in a language almost identical to C. It is sent to the Graphics card, compiled there and waiting to run whenever we need it to. It is also called a kernel. In the tutorial’s repository the kernel’s code is located inside the “kernel.cl” file, whose contents we will soon examine. For now let’s just see how to use it from the side of our machine.

    // Read in the kernel code into a c string
    FILE* f;
    char* kernelSource;
    size_t kernelSrcSize;
    if( (f = fopen("kernel.cl", "r")) == NULL)
    {
        fprintf(stderr, "Failed to load OpenCL kernel code.\n");
        return false;
    }
    RF_MALLOC(kernelSource,MAX_SOURCE_SIZE)
    kernelSrcSize = fread( kernelSource, 1, MAX_SOURCE_SIZE, f);
    fclose(f);

The code above just opens the file that holds the kernel code and reads it into a buffer, the kernelSource.

Afterwards we are starting to call OpenCL related code as we can see below.

    //Get platform and device information
    cl_platform_id platformID;//will hold the ID of the openCL available platform
    cl_uint platformsN;//will hold the number of openCL available platforms on the machine
    cl_device_id deviceID;//will hold the ID of the openCL device
    cl_uint devicesN; //will hold the number of OpenCL devices in the system
    if(clGetPlatformIDs(1, &platformID, &platformsN) != CL_SUCCESS)
    {
        printf("Could not get the OpenCL Platform IDs\n");
        return false;
    }
    if(clGetDeviceIDs(platformID, CL_DEVICE_TYPE_DEFAULT, 1,&deviceID, &devicesN) != CL_SUCCESS)
    {
        printf("Could not get the system's OpenCL device\n");
        return false;
    }

At first we call the clGetPlatformIDs() function to get the ID of the OpenCL platform and then the clGetDeviceIDs() function to get the ID of the OpenCL device if any is available in the machine. Even though the arguments should be self-explanatory feel free to visit the documentation of the API functions to learn more about each one of them.

// Create an OpenCL context
    cl_context context = clCreateContext( NULL, 1, &deviceID, NULL, NULL, &ret);
    if(ret != CL_SUCCESS)
    {
        printf("Could not create a valid OpenCL context\n");
        return false;
    }
    // Create a command queue
    cl_command_queue cmdQueue = clCreateCommandQueue(context, deviceID, 0, &ret);
    if(ret != CL_SUCCESS)
    {
        printf("Could not create an OpenCL Command Queue\n");
        return false;
    }

Following we have to create an OpenCL context, just like we would have to with OpenGL and by using the clCreateContext() function. As you can observe the context is created for the device whose ID we had obtained before. Finally a command queue is created using the clCreateCommandQueue() function, inside which OpenCL functions shall be entered and executed.

The bitmap image and the gaussian kernels are basically buffers of bytes allocated with malloc() on the side of our machine. In order to send them to the GPU we have to create buffers in the Graphics memory. They are called (just like in OpenGL) buffer objects.

 /// Create memory buffers on the device for the two images
    cl_mem gpuImg = clCreateBuffer(context,CL_MEM_READ_ONLY,imgSize,NULL,&ret);
    if(ret != CL_SUCCESS)
    {
        printf("Unable to create the GPU image buffer object\n");
        return false;
    }
    cl_mem gpuGaussian = clCreateBuffer(context,CL_MEM_READ_ONLY,size*size*sizeof(float),NULL,&ret);
    if(ret != CL_SUCCESS)
    {
        printf("Unable to create the GPU image buffer object\n");
        return false;
    }
    cl_mem gpuNewImg = clCreateBuffer(context,CL_MEM_WRITE_ONLY,imgSize,NULL,&ret);
    if(ret != CL_SUCCESS)
    {
        printf("Unable to create the GPU image buffer object\n");
        return false;
    }

The buffer objects are created using the clCreateBuffer() function. The image data and the gaussian kernel buffers are created with the CL_MEM_READ_ONLY flag since the GPU kernel program will only read from these buffers and the blurred image to be returned from the GPU is created with the CL_MEM_WRITE_ONLY flag since it will only be populated by the GPU kernel program.

After the buffers have been created in the graphics memory we enqueue the data we want to send them to the command queue that we created before.

    //Copy the image data and the gaussian kernel to the memory buffer
    if(clEnqueueWriteBuffer(cmdQueue, gpuImg, CL_TRUE, 0,imgSize,bmp.imgData, 0, NULL, NULL) != CL_SUCCESS)
    {
        printf("Error during sending the image data to the OpenCL buffer\n");
        return false;
    }
    if(clEnqueueWriteBuffer(cmdQueue, gpuGaussian, CL_TRUE, 0,size*size*sizeof(float),matrix, 0, NULL, NULL) != CL_SUCCESS)
    {
        printf("Error during sending the gaussian kernel to the OpenCL buffer\n");
        return false;
    }

This is accomplished using the clEnqueueWriteBuffer() function as we can observe from the above code.

At this point we should talk about the kernel, the code that runs on the GPU. As we said in our example code this is contained inside “kernel.cl”. Let’s start taking a look at its code below.

//The Gaussian blur function that runs on the gpu
__kernel void gaussian_blur(__global const unsigned char *image, __global const float* G, const int W,const int H,const int size, __global unsigned char* newImg) 
{ 
	unsigned int x,y,imgLineSize;
	float value;
	int i,xOff,yOff,center;
    //Get the index of the current element being processed
    i = get_global_id(0);

As you can immediately notice the kernel code is indeed C, with a few distinct differences. The function is preceded by the __kernel keyword. In an OpenCL program only one function can be so preceded and that function acts as the main function of the program. This is what runs in each GPU core. As far as the function’s arguments are concerned, image is the bitmap image buffer, G is the gaussian kernel, W and H are the image’s width and height respectively, size is the size of the gaussian kernel and newImg is the buffer where the blurred image will be returned. Arguments preceded by the __global keyword refer to memory objects allocated from the global memory pool.A final noticeable part is the call to get_global_id() which basically returns the index of the work-item that is currently being processed in this particular call.

Further below in the GPU kernel program the code becomes quite similar to the CPU one.

	//Calculate some needed variables
	imgLineSize = W*3;
	center = size/2;
	//pass the pixel through the kernel if it can be centered inside it
	if(i >= imgLineSize*(size-center)+center*3 &&
	   i < W*H*3-imgLineSize*(size-center)-center*3)
	{
		value=0;
	  for(y=0;y<size;y++)
          {
            yOff = imgLineSize*(y-center);
            for(x=0;x<size;x++)
            {
                xOff = 3*(x-center);
                value += G[y*size+x]*image[i+xOff+yOff];
            }
          }
          newImg[i] = value;
	}
	else//if it's in the edge keep the same value
	{
		newImg[i] = image[i];
	}

The main difference is that there is no outer for() loop since the index is known from get_global_id() and that for the edge pixels the original value has to be preserved.

Returning back to the sourcecode of our example that will be running on the machine we will atempt to build the “kernel.cl” program from the source that we read from the file into the kernelSource buffer in the beginning of the program.

    //Create a program object and associate it with the kernel's source code.
    cl_program program = clCreateProgramWithSource(context, 1,(const char **)&kernelSource, (const size_t *)&kernelSrcSize, &ret);
    free(kernelSource);
    if(ret != CL_SUCCESS)
    {
        printf("Error in creating an OpenCL program object\n");
        return false;
    }
    //Build the created OpenCL program
    if((ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL))!= CL_SUCCESS)
    {
        printf("Failed to build the OpenCL program\n");
        //create the log string and show it to the user. Then quit
        char* buildLog;
        RF_MALLOC(buildLog,MAX_LOG_SIZE);
        if(clGetProgramBuildInfo(program,deviceID,CL_PROGRAM_BUILD_LOG,MAX_LOG_SIZE,buildLog,NULL) != CL_SUCCESS)
        {
            printf("Could not get any Build info from OpenCL\n");
            free(buildLog);
            return false;
        }
        printf("**BUILD LOG**\n%s",buildLog);
        free(buildLog);
        return false;
    }

The program creation from source happens with the call of the clCreateProgramWithSource() function. Right after that we are are attempting to build the newly created program with the clBuildProgram() function.

If there is a problem at building the program, then by calling the clGetProgramBuildInfo() function with the CL_PROGRAM_BUILD_LOG argument for the program that failed to build, we can obtain a build log which will contain any possible build errors/warning which we subsequently can print and show the user what the problem appears to be.

If the program is built succesfully we have to create the kernel of the program and set its arguments.

// Create the OpenCL kernel. This is basically one function of the program declared with the __kernel qualifier
    cl_kernel kernel = clCreateKernel(program, "gaussian_blur", &ret);
    if(ret != CL_SUCCESS)
    {
        printf("Failed to create the OpenCL Kernel from the built program\n");
        return false;
    }
    ///Set the arguments of the kernel
    if(clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&gpuImg) != CL_SUCCESS)
    {
        printf("Could not set the kernel's \"gpuImg\" argument\n");
        return false;
    }
    if(clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&gpuGaussian) != CL_SUCCESS)
    {
        printf("Could not set the kernel's \"gpuGaussian\" argument\n");
        return false;
    }
    if(clSetKernelArg(kernel, 2, sizeof(int), (void *)&bmp.imgWidth) != CL_SUCCESS)
    {
        printf("Could not set the kernel's \"imageWidth\" argument\n");
        return false;
    }
    if(clSetKernelArg(kernel, 3, sizeof(int), (void *)&bmp.imgHeight) != CL_SUCCESS)
    {
        printf("Could not set the kernel's \"imgHeight\" argument\n");
        return false;
    }
    if(clSetKernelArg(kernel,4,sizeof(int),(void*)&size) != CL_SUCCESS)
    {
        printf("Could not set the kernel's \"gaussian size\" argument\n");
        return false;
    }
    if(clSetKernelArg(kernel, 5, sizeof(cl_mem), (void *)&gpuNewImg) != CL_SUCCESS)
    {
        printf("Could not set the kernel's \"gpuNewImg\" argument\n");
        return false;
    }

That is accomplished as we can see above by calling the clCreateKernel() function on the succesfully built program. The second argument is the name of the kernel function, the function of the program sent to the GPU that is preceded by __kernel.

After creating the kernel we also have to set its arguments, one by one. This is accomplished by calling clSetKernelArg() for each argument of the kernel. The arguments of this function are easy to deduce, especially if you visit the link to its documentation. Something that requires attention here is that the buffers (image data, gaussian kernel and returned image) are not set to point at the machine’s buffer but at the memory buffer objects that have been created on the graphics memory side for them.

Finally we come to the point where we actually call on the GPU to act on the compiled kernel program.

    ///enqueue the kernel into the OpenCL device for execution
    size_t globalWorkItemSize = imgSize;//the total size of 1 dimension of the work items. Basically the whole image buffer size
    size_t workGroupSize = 64; //The size of one work group
    ret = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, &globalWorkItemSize, &workGroupSize,0, NULL, NULL);

This is accomplished by enquing the kernel into the OpenCL device with a call to clEnqueueNDRangeKernel(). This function also chooses the number of work items for this operation, which in our case is the size of the image(imgSize) and the number of items in a workgroup. For a more detailed explanation of the arguments visit the function’s documentation link.

After the kernel is executed we need to read the results of the blurring.

///Read the memory buffer of the new image on the device to the new Data local variable
    ret = clEnqueueReadBuffer(cmdQueue, gpuNewImg, CL_TRUE, 0,imgSize, newData, 0, NULL, NULL);

This is accomplished by a call to the function clEnqueueReadBuffer() with arguments similar to that of clEnqueueWriteBuffer() but this time we are writing to the newData buffer allocated at the beginning of the function.

Finally at the end, all OpenCL related structures and buffers have to be cleaned up and the resulting blurred image saved to the disk.

///Clean up everything
    free(matrix);
    clFlush(cmdQueue);
    clFinish(cmdQueue);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(gpuImg);
    clReleaseMemObject(gpuGaussian);
    clReleaseMemObject(gpuNewImg);
    clReleaseCommandQueue(cmdQueue);
    clReleaseContext(context);
    ///save the new image and return success
    bmp.imgData = newData;
    meImageBMP_Save(&bmp,"gpu_blur.bmp");
    return true;

Analyzing the results

Well, since we went into all the trouble of making this small OpenCL demo why don’t we measure its performance and compare the CPU and GPU on different sample images? This is what we will see in this section. The images used are all included inside the repository. These tests are ran on a Windows 7 machine equipped with a nVidia Geforce 320M graphics card. The gaussian kernel used on all of them is the same. A square gaussian kernel with a size of 5 and a sigma value of 0.8.

(edit: 2 years later, the timer code is removed from the repository. Feel free to replace with your own timing functions”)

In order to perform these tests we measure the performance of the cpu and the gpu blurring functions using timers.

 //start a timer to time the CPU performance
    RF_Timer timer;
    rfTimer_Init(&timer,RF_TIMER_MICROSECONDS);
    //perform CPU blurring
    if(pna_blur_cpu(imgname,gaussianSize,gaussianSigma)==false)//time it
        return -2;
    //query the timer for the CPU
    units  = rfTimer_Query(&timer,RF_TIMER_MICROSECONDS);
    printf("CPU Gaussian blurring took %f microseconds\n",units);
    //reset the timer for the GPU
    rfTimer_Query(&timer,RF_TIMER_MICROSECONDS);
    //perform GPU blurring and then read the timer
    if(pna_blur_gpu(imgname,gaussianSize,gaussianSigma)==false)
        return -3;
    units  = rfTimer_Query(&timer,RF_TIMER_MICROSECONDS);//time it
    printf("GPU Gaussian blurring took %f microseconds\n",units);

As can be seen above a new RF_Timer is initialized and starts counting right before the blurring functions with a call to the rfTimer_Init() function. After each blurring call finishes we query the timer to learn how much time elapsed in microseconds with calls to rfTimer_Query().

Below we can see the results of running this experiment for all 4 sample images.

CPU vs GPU
Comparing CPU and GPU performance on blurring

What is immediately noticeable is that as the size of the image increases so does the benefit of using OpenCL and the GPU to parallelize the blurring process. On the contrary as the image gets smaller and smaller the benefits are not so clear. And we can even see for a 0.1 MB sized bitmap that it’s actually faster to use Gaussian blurring on the CPU rather than on the GPU.

Why do smaller image sizes mean less parallelization benefits? Well there are many reasons but the main one of them is the communication overhead that’s added. When using the GPU to perform parallel operations we lose lots of precious time while compiling the kernel program, creating the buffers in the graphics memory, sending the contents of the buffer to the graphics memory and reading back the results.

Total performance vs communication overhead
Comparing GPU total time with communication overhead

In the graph above we can see the GPU total time taken to blur an image (blue bar) compared to how much of it is used for communicating to/from the GPU(red bar). What we can see is that naturally as the size of the data gets bigger the communication overhead takes up a bigger part of the computation because of the size of the created buffers.

A surprising result is that even though for the smaller image sizes the communication overhead gets reduced to a very small percentage of the total GPU calculation time, as we can deduce from the first graph it along with the differences between the CPU and GPU implementation are enough to make blurring faster in the CPU for those sizes.

And thus concludes the post on playing with OpenCL and Gaussian Blurring. It turned out to be much bigger than I originally planned. I will be awaiting your comments, feedback and constructive criticism either in the comments section below or by email at: lefteris *at* refu *dot* co