Reading .pdf files comfortably on Kindle

In this post we see a method to view PDF files into a Kindle or other e-reader devices almost as comfortably as we would a more native format like .mobi or .epub. We explain how to use a piece of software called k2pdfopt in order to achieve that. Steps to install this software along with the Google Tesseract OCR engine are included. Furthermore an example usage scenario along with explanations of the parameters utilized by the software are given so that the user can learn how to use it in order to be able to read his PDF files in his e-reader device comfortably.

Lately I have been playing around with the idea of having an e-book reader. I read so many papers, academic books and novels that it would be a real big savior if I could keep some of them in a gadget that is made just for that. As far as the academic papers are concerned, it could save whole forests just because I wouldn’t have to actually print them on paper in order to read them.

Just as I was thinking about this a certain special somebody gave me the following kindle paper-white as a gift, one of the best and most appropriately targeted gifts I have ever received.

A kindle paperwhite device

Reading e-books using a kindle paperwhite is a breeze and an amazing experience. You can read them under any possible lighting conditions and without having to worry at all about the battery running out. Moreover they fit nicely to the screen of your e-reader and you can adjust the font-size to fit your preferred reading style. PDF files are, unfortunately, not that easy to work with from the perspective of an e-reader.

All the academic papers that I know and am interested in reading come in .pdf format. As I said above the .pdf format is really not the most suitable format for an e-reader. From my experience in kindle paperwhite you have the ability to pinch-zoom but it is so difficult and so hard to achieve a desired zoom that in my opinion it just isn’t worth it. There is always the possibility to use popular software like Calibre to convert your documents from .pdf to a more e-reader friendly format but in my experience the end result is not very good and does not constitute a pleasant reading experience.

After a little research around the web I came across this very informative post that did an analysis of all the options we have as far as reading .pdfs from an e-reader is concerned. Here I would like to basically extend my succesfull experience of using the last software he presents in his post, the k2pdfopt.

k2pdfopt is a command line tool that allows you to turn any .pdf into yet another .pdf that is adjusted and resized exactly for the dimensions of your particular e-reader device. It also has the ability to use OCR libraries to recognize the text of the .pdf and make it possible for you to take notes on it, highlight the text or even use the dictionary on some words of the PDF text. Let’s see the steps that you have to follow to make it work for you.

  • Download k2pfopt: First of all go and download k2pdfopt from the download link and choose your system. It is available for Windows, Linux and MacOsX!
  • Download Tesseract OCR: You can omit this step if you don’t want to be able to highlight text, recognize it and use the dictionary but why wouldn’t you? It’s very simple to accomplish. Go to the Tesseract download page and choose the appropriate language data file for your language. For example in my case and at the time of writing of this post (the 3.02 version of tesseract was the latest) I downloaded “tesseract-ocr-3.02.eng.tar.gz” for the English language data and “tesseract-ocr-3.02.jpn.tar.gz” for the japanese OCR data.
  • Installing Tesseract OCR: Put all the language data inside a directory in your computer. Let’s assume that this directory is Path/To/Tesseract/. Now depending on your Operating system you will have to create and set an environment variable called TESSDATA_PREFIX. Set this variable to the value of the directory you keep the downloaded data, in our case Path/To/Tesseract/ and you will be set to go. Remember that in Windows you may need to restart your system after setting the environment variable. For a more in-depth analysis of how to do this check the k2pdfopt OCR page and the Tesseract Read-me site.

After these steps are done you are ready to use the software to make those nice .pdfs nicely viewable in your kindle or any other e-reader device you may have. All you have to do is open a terminal window and call the k2pdfopt program with the right parameters. Look below for an example invocation of the program

k2pdfopt document.pdf -ocr -ocrlang eng -dev kpw -bp -f2p -1

Let us analyze the call to the program a bit here.

  • document.pdf: This is the input .pdf file we would like to convert for comfortable reading in the e-reader device.
  • -ocr:This option enables Optical Character Recognition (OCR) with the Tesseract engine that we downloaded above.
  • -ocrlang eng: This option selects the OCR language that Tesseract will use. If for example you had a Japanese text you would have to use -ocrlang jpn and the program would perform OCR on the Japanese text. It works, I have tried it.
  • -dev kpw: This option selects the resolution of the device that we would like the new .pdf to be optimized for. In the example above I used the value kpw which stands for Kindle Paper White but the program offers many other precomputed values for various e-reader devices such as k2 for Kindle 2 and nookst for Nook Simple Touch. In the worst case you can specify the dimensions of your e-reader manually via the -w and -h options.
  • -bp: This is a very important option that instructs the program to force break the pages of the output when the input document has a page break. You need this option turned on unless you like having the output document having page breaks in random places.
  • -f2p <val>: Fit-to-page option (Taken directly from the program’s documentation). The quantity controls fitting tall or small contiguous objects (like figures or photographs) to the device screen. Normally these are fit to the width of the device, but if they are too small or too tall, then if =10, for example, they are allowed to be 10% wider (if too small) or narrower (if too tall) than the screen in order to fit better. Use -1 to fit the object no matter what. Use -2 as a special case–all “red-boxed” regions (see -sm option) are placed one per page. Default is -f2p 0. See also -jf. Note: -f2p -2 will automatically also set -vb -2 to exactly preserve the spacing in the red-boxed region. If you want to compress the vertical spacing in the red-boxed region, use -f2p -2 -vb -1.

After we run this command we will get another pdf file perfectly formatted and OCRed for our e-reader device. It will have exactly the same name as the input document only with a _k2opt suffix appended to it. All you have to do is transfer it to your e-reader and enjoy reading and learning.

I hope this post comes of use to some of you in the same position as me, trying to figure out a way to utilize your kindle/e-reader to make your research and reading activities easier. If you have any comments, suggested feedback or questions do not hesitate to leave a comment below.