Counting pairs of numbers with a given difference out of a set

This is just a post detailing the solution to a small algorithmic challenge from HackerRank. The exercise is to find the pairs of numbers out of a set of numbers N that have difference K. The link to the actual HackerRank problem can be seen here. I am not sharing my solution because it’s unique or even good. There may be much better solutions out there and in fact if someone knows something better I encourage you to comment on it so that both me and the readers of this blog can learn something new!

All of the code mentioned in this tutorial can be cloned from Github

The actual text of the challenge is given below:

Given N numbers [N<=10^5], count the total pairs of numbers that have a difference of K. [K>0 and K<1e9]

Input Format:

1st line contains N & K (integers).
2nd line contains N numbers of the set. All the N numbers are assured to be distinct.

Output Format:

One integer saying the no of pairs of numbers that have a diff K.

Sample Input #00:

5 2
1 5 3 4 2

Sample Output #00:

3


Now with challenges like this, speed is what is important. We want our program to solve the problem for all possible input in the least amount of computational time. I will be using C to solve this problem. First of all we have to figure out a way to read in the input from the stdin as is described in the problem definition.

1
2
3
4
5
6
7
8
9
10
int N,K,*set,i=0,j;
char* buff,* token;
int result = 0;
 
 
//read 1st input line
scanf("%d %d \n",&N,&K);
//allocate space for the set numbers string and the actual set
buff = malloc(13*N);//13 here is for ULONG_MAX number of digits + space
set = malloc(sizeof(int)*N);

Above is the beginning of main() for the problem. N and K are the numbers given in the problem definition. set will be an array holding all of our numbers while buff will be a buffer holding the string of the second line from which to read the numbers into set. After reading both N and K with scanf() we allocate the buffer to hold the string. It may be a bit sloppy doing it like that but here we allocate 13 bytes for each number. This is an approximation judging from the typical maximum value of ULONG_MAX as shown here, plus the space separating each number. It should be safe since we are using ints and ULONG_MAX is for unsigned ints. We also allocate the array of numbers.

//read second line of input
gets(buff);
//find the individual numbers
token = strtok(buff," ");
while(token != NULL)
{
    sscanf(token,"%d",&set[i]);
    token = strtok(NULL," ");
    i++;
}

Further down in reading the input we can see from above that we read the second input line into the buffer. Then since we know from the problem definition that the numbers are space separated we can simply use libc’s strtok() function to obtain each number of the set. This is what is accomplished by the while loop in there. It tokenizes the string located inside buff until all of the numbers are read into set.

Now since we have read all of the input we have to decide on the strategy with which to determine how many pairs of numbers have difference of K. The naive approach would be to simply go through the array NxN times and count the difference of all numbers increasing a counter for those which match K. Something like below

//naive approach
for(i=0;i<N;i++)
{
    for(j=0;j<N;j++)
    {
        if( abs(set[i] - set[j]) == K)
            result++;
    }
}

This approach is not very good and it will actually fail the test in the last 2 data sets in HackerRank. The complexity here is O(N2) and is not good. We can do better than that! For reference let me mention the processing times that HackerRank returned for this approach:

Status: Wrong Answer

# 0 0.06s : Success

# 1 0.08s : Success

# 2 0.14s : Success

# 3 0.23s : Success

# 4 0.36s : Success

# 5 0.54s : Success

# 6 0.74s : Success

# 7 0.98s : Success

# 8 1.26s : Success

# 9 1.58s : Success

# 10 1.93s : Success

# 11 2.32s : Success

# 12 2.74s : Success

# 13 3.22s : Time limit exceeded

# 14 3.22s : Time limit exceeded

So what could an alternate way to approach this problem be? Well let’s see! We have an array of numbers and these numbers are also guaranteed to be positive. The relation between the numbers that we are trying to explore is their difference. What operation on the array could allow us to deal with the difference in a much better way? Yes you guessed it correctly! We can sort the array! By having a sorted array of numbers we can easily get the differences between numbers without wasting too much time.

Counting pairs of numbers with a given difference out of a set of numbers

//comparison function for qsort()
int compare(const void* a, const void* b)
{
    //a less than b
    if( *(int*)a < *(int*)b)
        return -1;
    //a greater than b
    return 1;
    //we don't take into account a == b since according to the problem definition it is impossible
}
//sort the array
qsort(set,N,sizeof(int),compare);

As we can see from the above code the C standard library includes an implementation of quicksort as long as you #include <stdlib.h>. Since C has no templates you need to implement a comparison function using the signature int compare(const void* a, const void* b) which will be returning -1 in case of a being less than b, 0 in case they are equal and 1 in case a is greater than b. But in our comparison implementation we omit one of those checks. The problem definition clearly states that the numbers in the array are unique and also that the difference K can’t be 0. So there is no need for a and b being equal comparison.

So afterwards we can call qsort() on the array which will sort it in place. For more information on stdlib quicksort implementation take a look at the relevant documentation.

int diff;
for(i=0;i<N;i++)
{
    for(j=i+1;j<N;j++)
    {
        diff = set[j]-set[i];
        if(diff > K)
            break;
        if( diff == K)
        {
            result++;
            break;
        }
    }
}
printf("%d",result);

And here is the actual counting part of this approach. We check each number of the array against each other trying to find number pairs which satisfy the difference of K. Thanks to the sorting of the array we also know that if we go to differences greater than K we can safely proceed to the next number since all subsequent numbers would be greater and so would the difference. In addition, we don’t need to be checking the previous numbers that were already checked and we know that since K>0, the second for() loop starts at i+1.

This approach shows a very big difference in the results as can be seen by the details from the HackerRank data sets.



# 0 0.03s : Success

# 1 0.03s : Success

# 2 0.03s : Success

# 3 0.04s : Success

# 4 0.03s : Success

# 5 0.04s : Success

# 6 0.04s : Success

# 7 0.05s : Success

# 8 0.05s : Success

# 9 0.05s : Success

# 10 0.05s : Success

# 11 0.06s : Success

# 12 0.08s : Success

# 13 0.08s : Success

# 14 0.08s : Success


There is a small way to squeeze out a bit more performance for this algorithm though. It does not involve a change of strategy but rather a change of implementation. C stdlib quicksort has one disadvantage. In order to provide for any possible type of input it uses void* and a function pointer for the comparison of the elements of the array that has to be defined by the user. C++ makes up for that with the use of templates but since we are using C there is another way. We can have a dedicated quicksort implementation only for integers. I actually had one lying around and so I inserted it into the code to compare.

int partition(int* a,int left,int right,int pivotI)
{
    int swap,newPivotI,i;
    int pivotValue = a[pivotI];
    //move the pivot to the end
    a[pivotI] = a[right];
    a[right] = pivotValue;
    newPivotI = left;//initialize the new pivot position to the leftmost index
    //find the new pivot position
    for(i=left;i<right;i++)
    {//if the value at the iterated index is less than the pivot value then swap that value
    // with the potential pivot position and increase the potential pivot position
        if(a[i] <= pivotValue)
        {
            swap = a[i];
            a[i] = a[newPivotI];
            a[newPivotI] = swap;
            newPivotI++; //increase the potential pivot index
        }
    }
    //in the end move the pivot to its final position
    a[right] = a[newPivotI];
    a[newPivotI] = pivotValue;
    return newPivotI;
}
 
void quicksort_inplace(int* a,int left,int right)
{
    int pivotI,newPivotI;
    if(left < right)//if the list has at least 2 items
    {
        //choose the pivot
        pivotI = left+(right-left)/2;
        //partition the list
        newPivotI = partition(a,left,right,pivotI);
        //recursively quick sort elements smaller than the pivot
        quicksort_inplace(a,left,newPivotI-1);
        //recursively quick sort elements greater than the pivot
        quicksort_inplace(a,newPivotI+1,right);
    }
}
 
void quicksort(int* a,int size)
{
    quicksort_inplace(a,0,size-1);
}

As can be seen from the code above this is a standard in-place quicksort implementation for integers. It involves a partition() function and a function that’s called recursively on the created partitions to sort the array in place. It’s easy to study the code and I believe it is well commented. If you want more information on the actual algorithm you can find it here.

The only change that needs to be done in the code that was presented before is swapping the call to stdlib’s qsort() with the following:

quicksort(set,N);

The results as can be seen from the HackerRank data set details showed a very small improvement. So maybe for this data and this problem there is an improvement but whether it’s worth the effort is up for discussion.

# 0 0.02s : Success

# 1 0.03s : Success

# 2 0.03s : Success

# 3 0.03s : Success

# 4 0.03s : Success

# 5 0.04s : Success

# 6 0.04s : Success

# 7 0.04s : Success

# 8 0.04s : Success

# 9 0.05s : Success

# 10 0.05s : Success

# 11 0.05s : Success

# 12 0.05s : Success

# 13 0.06s : Success

# 14 0.06s : Success



A final and perhaps not so important improvement would be to remember the last j index that was reached in the inner loop. This way we can compare and see if the difference of the value at i and at the last j is less than K. If it is, then the inner loop can actually start from the next item after the last j.

int diff;
int lastJ=1;
for(i=0;i<N;i++)
{
    j = i+1;
    if(set[lastJ]-set[i]<K)
        j = lastJ+1;
    for(;j<N;j++)
    {
        diff = set[j]-set[i];
        if(diff > K)
            break;
        lastJ = j;
        if( diff == K)
        {
            result++;
            break;
        }
    }
}
printf("%d",result);

Above we can see what I said implemented in code. There is an extra if check but depending on the data some improvement could be gained. But as far as HackerRank’s data is concerned there was no visible improvement when I ran this code through it.

EDIT1:
Christian Sommer made a very useful comment below detailing a very clever approach to this problem! Taking advantage of the fact that all the values are guaranteed to be unique and that we simply want to count the number of the pairs and not the pairs themselves there is an even smarter approach.

If during reading a number M from the input stream we have already read another number satisfying either M+K or M-K then that means that we have found a pair. It’s that simple! The specific code given by Christian is in C++ and utilizes std::set from the STL. It assumes that stream is an ifstream and that we have already read the first line that contains N and K.

int pairCount = 0;
int el;
std::set<int> s;
while(stream >> el) {
if (s.count(el+K) > 0 || s.count(el-K) > 0) ++pairCount;
s.insert(el);
}
return pairCount;

Well that’s all I guess. I simply wanted to share my implementation of a solution to this particular problem. I hope it helped the readers and maybe even taught something new to some people. If a reader has a better and faster solution I encourage you to share it with me in the comments section. I am really looking forward to seeing if there are more awesome ways to approach this problem.

Reading .pdf files comfortably on Kindle

Lately I have been playing around with the idea of having an e-book reader. I read so many papers, academic books and novels that it would be a real big savior if I could keep some of them in a gadget that is made just for that. As far as the academic papers are concerned, it could save whole forests just because I wouldn’t have to actually print them on paper in order to read them.

Just as I was thinking about this a certain special somebody gave me the following kindle paper-white as a gift, one of the best and most appropriately targeted gifts I have ever received.

A kindle paperwhite device

Reading e-books using a kindle paperwhite is a breeze and an amazing experience. You can read them under any possible lighting conditions and without having to worry at all about the battery running out. Moreover they fit nicely to the screen of your e-reader and you can adjust the font-size to fit your preferred reading style. PDF files are, unfortunately, not that easy to work with from the perspective of an e-reader.

All the academic papers that I know and am interested in reading come in .pdf format. As I said above the .pdf format is really not the most suitable format for an e-reader. From my experience in kindle paperwhite you have the ability to pinch-zoom but it is so difficult and so hard to achieve a desired zoom that in my opinion it just isn’t worth it. There is always the possibility to use popular software like Calibre to convert your documents from .pdf to a more e-reader friendly format but in my experience the end result is not very good and does not constitute a pleasant reading experience.

After a little research around the web I came across this very informative post that did an analysis of all the options we have as far as reading .pdfs from an e-reader is concerned. Here I would like to basically extend my succesfull experience of using the last software he presents in his post, the k2pdfopt.

k2pdfopt is a command line tool that allows you to turn any .pdf into yet another .pdf that is adjusted and resized exactly for the dimensions of your particular e-reader device. It also has the ability to use OCR libraries to recognize the text of the .pdf and make it possible for you to take notes on it, highlight the text or even use the dictionary on some words of the PDF text. Let’s see the steps that you have to follow to make it work for you.

  • Download k2pfopt: First of all go and download k2pdfopt from the download link and choose your system. It is available for Windows, Linux and MacOsX!
  • Download Tesseract OCR: You can omit this step if you don’t want to be able to highlight text, recognize it and use the dictionary but why wouldn’t you? It’s very simple to accomplish. Go to the Tesseract download page and choose the appropriate language data file for your language. For example in my case and at the time of writing of this post (the 3.02 version of tesseract was the latest) I downloaded “tesseract-ocr-3.02.eng.tar.gz” for the English language data and “tesseract-ocr-3.02.jpn.tar.gz” for the japanese OCR data.
  • Installing Tesseract OCR: Put all the language data inside a directory in your computer. Let’s assume that this directory is Path/To/Tesseract/. Now depending on your Operating system you will have to create and set an environment variable called TESSDATA_PREFIX. Set this variable to the value of the directory you keep the downloaded data, in our case Path/To/Tesseract/ and you will be set to go. Remember that in Windows you may need to restart your system after setting the environment variable. For a more in-depth analysis of how to do this check the k2pdfopt OCR page and the Tesseract Read-me site.

After these steps are done you are ready to use the software to make those nice .pdfs nicely viewable in your kindle or any other e-reader device you may have. All you have to do is open a terminal window and call the k2pdfopt program with the right parameters. Look below for an example invocation of the program

k2pdfopt document.pdf -ocr -ocrlang eng -dev kpw -bp -f2p -1

Let us analyze the call to the program a bit here.

  • document.pdf: This is the input .pdf file we would like to convert for comfortable reading in the e-reader device.
  • -ocr:This option enables Optical Character Recognition (OCR) with the Tesseract engine that we downloaded above.
  • -ocrlang eng: This option selects the OCR language that Tesseract will use. If for example you had a Japanese text you would have to use -ocrlang jpn and the program would perform OCR on the Japanese text. It works, I have tried it.
  • -dev kpw: This option selects the resolution of the device that we would like the new .pdf to be optimized for. In the example above I used the value kpw which stands for Kindle Paper White but the program offers many other precomputed values for various e-reader devices such as k2 for Kindle 2 and nookst for Nook Simple Touch. In the worst case you can specify the dimensions of your e-reader manually via the -w and -h options.
  • -bp: This is a very important option that instructs the program to force break the pages of the output when the input document has a page break. You need this option turned on unless you like having the output document having page breaks in random places.
  • -f2p <val>: Fit-to-page option (Taken directly from the program’s documentation). The quantity controls fitting tall or small contiguous objects (like figures or photographs) to the device screen. Normally these are fit to the width of the device, but if they are too small or too tall, then if =10, for example, they are allowed to be 10% wider (if too small) or narrower (if too tall) than the screen in order to fit better. Use -1 to fit the object no matter what. Use -2 as a special case–all “red-boxed” regions (see -sm option) are placed one per page. Default is -f2p 0. See also -jf. Note: -f2p -2 will automatically also set -vb -2 to exactly preserve the spacing in the red-boxed region. If you want to compress the vertical spacing in the red-boxed region, use -f2p -2 -vb -1.

After we run this command we will get another pdf file perfectly formatted and OCRed for our e-reader device. It will have exactly the same name as the input document only with a _k2opt suffix appended to it. All you have to do is transfer it to your e-reader and enjoy reading and learning.

I hope this post comes of use to some of you in the same position as me, trying to figure out a way to utilize your kindle/e-reader to make your research and reading activities easier. If you have any comments, suggested feedback or questions do not hesitate to leave a comment below.