Makoto Development Journal 2: Character Strings

This is a post providing some details on the development process of the Makoto Engine character strings. This is an out of date post because since then, those strings got moved inside the Refu liibrary as the Refu Strings.

This is another one of the posts regarding the Makoto Engine. The engine is being developed in all the free time that attending graduate studies in Tokyo University allows me. In this post I would like to address the topic of character strings. We have all tried to use the std::string and know what a frustration it is to have to deal with its deficiencies. It is not well written and does not provide enough functionality. Over the years I have used many String classes such as wxString of the wxWidgets Library, QString from the QT Library and the boost library’s string. From the above I would say that the most useful must have been the QString from QT just for the amazing integration it has with the library and then the boost library’s string coming in as a close second. The boost string library is a very complete library providing an enormous amount of text processing functions for Strings.

But in the case of the Makoto engine smooth integration with the rest of the engine was a top priority. A complete string library that can be extended as needed was a requirement. Moreover since the Makoto Engine is mostly written in C, we also would need Strings compatible with C and not a String class to be provided to the engine. For that reason a library I have been developing for a very long time just for use in my personal projects called “Refu Library” has been chosen. It contains useful code such as a complete string library, xml reader,threads, e.t.c. which are things I use in more than one project. Funny trivia is that I named the library after the nickname the Japanese had given me to shorten my name when I first visited Japan with a youth exchange program. In every part of Refu library I strive for optimality and speed but I don’t claim that it’s better than any of the alternatives available out there as far as Strings are concerned. Just that I have been working with it for years and as such is the one I feel most comfortable to work with. Not to mention the fact that any changes I would like to make I can make them at will.

So let us see how the String is defined. In C the string is defined as a struct containing an array of bytes. There are two types of Strings. The normal and the X String both shown below in their C versions. The difference is that the normal String’s internal representation is as minimal as that of a C string to save space and speed during calculations. On the other hand StringX contains two additional parameters which prove to be quite useful if you are manipulating frequently changing text or are parsing text from a file. So whenever doing text editing StringX is the better choice.

typedef struct RF_String
{
    //! The string's data
    char* bytes;
}RF_String;
 
typedef struct RF_StringX
{
    //! The string's data
    char* bytes;
    //! The buffer index, denotes how far from the start of the buffer the start of the string has moved
    int bIndex;
    //! The size of the buffer allocated for this extended String
    int bSize;
}RF_StringX;

Of course a String library would not even deserve to name itself one if it did not support Unicode. Refu String is by default unicode. In the past I had another option reserved for ASCII strings but I decided against it since I saw no use for it. The internal representation is in UTF-8 and functions to convert to and from all the other unicode representations exist. Why choose UTF-8 over the other encodings?

The advantages of UTF-8 are many and for a complete list do check the wikipedia article on UTF-8. From those the three biggest advantages in my opinion are:

  • Being UTF-8 all the ASCII characters are represented with the same bytes as in ASCII so a text written in English totally oblivious of the existence of Unicode can still be read by the String and of course an application written a decade ago can still read UTF-8 strings containing only English characters.
  • UTF-8 is of course the most memory efficient encoding scheme, so saving memory is a big advantage it has. The only memory disadvantage is in the encoding of East Asian Languages where in UTF-16 they would be taking 2 bytes but here in UTF-8 they require 3.
  • But above all the biggest advantage and one that particularly makes a big difference in a String library is the fact that the End of String character ‘\0’ is represented by one zero byte and not two as in UTF-16 or 4 as in UTF-32. That means that it is the same as in ASCII, which in turn means that all the original c string functions can work with UTF-8 strings albeit being oblivious to the actual number of characters inside the string.

For example the functions below will work

//assume we got two strings encoded in UTF-8
char* str = func_that_inits_string_utf8();
char* otherstr = func_that_inits_another_string_utf8();
 
//The string comparison function still compares the strings byte per byte oblivious to the fact that they are UTF-8
char equal = strcmp(str,otherstr);
 
//Will actually get the BYTE length of the string , be careful not the character length, since we would
//need to traverse it as UTF-8 and count chars with a special function to get that one
unsigned int byteLength = strlen(str);
 
//Will copy "str" into otherstr succesfully, byte to byte
strcpy(otherstr,str);
 
// works as intended. Will return a pointer to the position of "otherstr" inside "str"
char* pos = strstr(str,otherstr);

So as you can understand with a little bit of care using UTF-8 will make things really easy as far as using C string functions is concerned.

A Few RF_String functions

Now let’s see some of the actual string functions implemented in the Refu String library. Below we have a function that retrieves the code point, that is the unicode code of a character in the String.

//! Retrieves the unicode code point of the parameter character. <i>Can be used with StringX</i>
//! @param thisstr The string whose character code point we need
//! @param c The character index whose unicode code point to return. Must be a positive (including zero) integer.
//! @return Returns the code point or OPERATION_FAILURE in case of character index out of bounds
int rfString_GetChar(RF_String* thisstr,unsigned int c);

Here we have a function that returns a substring existing between two strings.

//! Returns the first substring existing between the left and right parameter substrings.  <i>Can be used with StringX</i>
//! @note The Returned Substring needs to be freed by the user. BEWARE when assigning to a string using this function since if any previous string exists there IS NOT getting freed. You have to free it explicitly
//! @param thisstr This current string
//! @param lstr The left substring that will define the new substring
//! @param rstr The right substring that will define the new substring
//! @return Returns the substring between left and right substrings if they are found. If they are not returns a null pointer.
RF_String* rfString_Between(RF_String* thisstr,RF_String* lstr,RF_String* rstr);

This is a function to append another String to this String.

//! Appends the parameter String to this one. <b>Can't be used with RF_StringX</b>
//! @param thisstr The string to append to
//! @param other The string to add to this string
void rfString_Append(RF_String* thisstr,RF_String* other);

This function removes characters from inside the string at a given position counting backwards from that position

//! Removes n characters from the position p (including the character at p) of the string counting backwards. If there is no space to do so, nothing is done and returns false.
//! <i>Can be used with StringX</i>
//! @param thisstr The string to prune from
//! @param p The position to remove the characters from. Must be a positive integer. Indexing starts from zero.
//! @param n The number of characters to remove from the position and back.Must be a positive integer.
//! @return Returns true in case of succesfull removal and false in any other case.
char rfString_PruneMiddleB(RF_String* thisstr,unsigned int p,unsigned int n);

A Few RF_StringX functions

In this section we have examples of StringX functions. These are used for Strings that are intended for heavy text editing use.
Below we can see one function that inserts a character inside a position in a string

//! Inserts a string to this extended string at the parameter character position.
//! @param thisstr The extended string to insert to
//! @param pos     The character position in the string to add it. Should be a positive (or zero) integer. If the position is over the string's size nothing happens.
//! @param other   The string to add
void rfStringX_Insert(RF_StringX* thisstr,unsigned int pos,RF_String* other);

This function here replaces any occurence of substrings existing between left and right inside the string with the to replace string

//! Replaces what exists between the ith left and right substrings of this extended String. Utilizes the internal string pointer.
//! @param thisstr The extended string to work on
//! @param left The left substring that will define the new substring
//! @param right The right substring that will define the new substring
//! @param rstr The string to act as replacement
//! @param i The specific between occurence to replace. Should range between 1 and infinity. If 0 all occurences will be replaced
//! @return Returns true if the replacing happened and false if either the left or the right strings were not found
char rfStringX_ReplaceBetween(RF_StringX* thisstr,RF_String* left,RF_String* right,RF_String* rstr,int i);

Finaly below is a function exhibiting the internal pointer of StringX where it returns a substring located between two specific sequences in the string and also moves the pointer after them.

//! Returns the first substring existing between the left and right substrings of this String and moves the internal pointer right after them
//! @note The Returned Substring needs to be freed by the user. BEWARE when assigning to a string using this function since if any previous string exists there IS NOT getting freed. You have to free it explicitly
//! @param thisstr The extended string to work on
//! @param left The left substring that will define the new substring
//! @param right The right substring that will define the new substring
//! @return Returns the substring between left and right substrings if they are found. If they are not returns a NULL String.
RF_StringX* rfStringX_BetweenMove(RF_StringX* thisstr,RF_String* left,RF_String* right);

C++ String Wrapper

Of course as mentioned above this is a C library with all the functions being written for use in C but for usage in C++ a wrapper is provided which presents the String as a C++ class.

class RF_String
{
    public:
        /** String Constructors/Destructor **/
 
        //! The string's main constructor
        //! @param str the string's content in UTF-8 encoding
        RF_String(const char* str);
        //! The string default constructor, for uninitialized NULL strings
        RF_String();
 
        //e.t.c.
        ...
        ...
};

And as an example below we can see functions of the c++ class wrapping

        //! Adds two strings together
        //! @param s1 A constant reference to the first string to be added
        //! @param s2 A constant reference to the second string to be added
        //! @return Returns the new string which is the addition of s1 and s2
        friend RF_String operator+(RF_String const& s1,RF_String const& s2);
        //! Adds a string and an integer, converting the integer to a string in the process
        //! @param s1 A constant reference to the string to be added
        //! @param num A constant reference to the number to be added
        //! @return Returns the new string which is the addition of the string and the number
        friend RF_String operator+(RF_String const& s1, const int& num);

with their implementations being as simple as just calling the equivalent c functions from the c library

 RF_String operator+(RF_String const& s1,RF_String const& s2)
{
    rfString_Append(s1->str,s2->str);
}
 
RF_String operator+(RF_String const& s1, const int& num)
{
    rfString_Append_i(s1->str,num);
}

which in turns allows nice things not available in C, namely operator overloading such as

RF_String str("This is No.");
str+=5;
//now str contains :"This is No.5"

So basically the ReFu Strings can be used by both C and C++ projects, but what is used in the core of the Makoto Engine is C since the engine itself is written in C.

In conclusion the Refu String library is one that is continuously being developed with adding functionality whenever that is deemed necessary and has been very well combined with the development of the Makoto Engine. It is a very useful String library but as I see it it has one big disadvantage which I plan to correct in the near future. It has no implementation of Regular Expressions which are very useful in String manipulation. Soon regular expression functions will be added to it. Finally as soon as the Makoto Engine gets released the String will be available to all users since it is the String universally used by the engine and the users can utilize it for whatever purpose they like.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.