Saturday, December 15, 2018

String tokenization in C

Let's say you have to write a C program to tokenize a string that contains a list of tokens separated by some delimiter, say a comma.

You could figure out your own algorithm for doing that, but usually a saner approach is to find a library function that already does that.

And C Programming language library does provide a function, called strtok that does exactly that. You provide it with a string and a delimiter string, and it helps you in splitting the string based on that delimiter.

So instead of writing our own function, we shall go forward and try to use strtok for our task. In doing so, we shall learn how to use it and other such functions.


Here is the function declaration of strtok function:

char *strtok(char *str, const char *delim);


From the function declaration above, we can see that strtok takes the string as its first argument and the delimiter/separator string as its second argument.
The first call to strtok shall contain the string to be parsed and the delimiter. This shall return the text between the start of the string and the first occurance of the delimiter string. To get the next token, we only need to pass a NULL as the first argument and the delimiter as the second. strtok maintains the context and knows it has to parse the same string that you provided earlier.

The following program shows an example.


#include <stdio.h>
#include <string.h>
int main()
{
    char str[50];
    char *token;

    strcpy(str,"abc,def,ghi");
    token = strtok(str,",");
    printf("%s \n",token);
    
    token = strtok(NULL,",");
    printf("%s \n",token);
    
    token = strtok(NULL,",");
    printf("%s \n",token);
    
    return 0;
}


In the above example, we have a string "abc,def,ghi" and we parse it into tokens abc, def and ghi.
Note, how we pass str as the first argument in the first call to strtok and NULL as the first argument in the subsequent calls. Here is how the output of this program is going to look like:

abc
def
ghi



strtok() returns a NULL if it is not able to find any tokens. We can use this property when there are unknown number of tokens to be parsed. We can keep calling strtok() in a loop after the first call, until it returns NULL.
#include <stdio.h>
#include <string.h>
int main()
{
    char str[50];
    char *token;

    strcpy(str,"abc,def,ghi,jkl,mno");
    token = strtok(str,",");
    
    while ( token != NULL ) 
    {
        printf("%s \n",token);
        token = strtok(NULL,",");
    }
    
    return 0;
}
Here is the output of above program:
abc
def
ghi
jkl
mno


We can also provide multiple delimiters to strtok, as shown below.
#include <stdio.h>
#include <string.h>
int main()
{
    char str[50];
    char *token;

    strcpy(str,"abc,def:ghi;jkl,mno");
    
    token = strtok(str,",;:");
    
    while ( token != NULL ) 
    {
        printf("%s \n",token);
        token = strtok(NULL,",;:");
    }

    return 0;
}
Here we pass a delimiter string which contains a comma, semicolon and a colon. strtok function checks for each of these and identifies a token if it finds any of these characters.
Output of above program:
abc
def
ghi
jkl
mno
Another thing about strtok that we need to know is that when it encounters more than one delimiters in succession, it considers them to be a single delimiter. Program below shows this scenario:
#include <stdio.h>
#include <string.h>
int main()
{
    char str[50];
    char *token;

    strcpy(str,"abc,,,def,,,,,,ghi");
    
    token = strtok(str,",");

    while ( token != NULL ) 
    {
        printf("%s \n",token);
        token = strtok(NULL,",");
    }

    return 0;
}
Output of the above program:
abc
def
ghi 
Also, if  a delimiter is encountered at the start or end of a string, strtok() ignores them, as shown below.
#include <stdio.h>
#include <string.h>
int main()
{
    char str[50];
    char *token;

    strcpy(str,",,,abc,,,def,,,,,,,ghi,,,,,,");
    
    token = strtok(str,",");
    while ( token != NULL ) 
    {
        printf("%s \n",token);
        token = strtok(NULL,",");
    }

    return 0;
}
Output of the above program:
abc
def
ghi 

There are a few common pitfalls that we should avoid while using strtok().

Never pass a constant char pointer as the first argument of strtok. That's because strtok changes the first argument internally, so passing a constant pointer shall result in a crash ( unless you have a signal handler implemented ).

Something like below will result in a crash:
#include <stdio.h>
#include <string.h>
int main()
{
    char *str = "abc,def";
    char *token;

    token = strtok(str,",");
    printf("%s \n",token);

    return 0;
}

Also, if want to maintain the string that is to be parsed for further usage, you should avoid passing the pointer to the string directly to strtok, as it changes its first argument internally. It's advisable to copy the string into a temporary buffer and pass that to strtok to get your tokens.

Next, strtok is not thread-safe. That's because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.

If you have a multithreaded program, then you should be using strtok_r function instead of strtok. strtok_r is a reentrant version of strtok. Let's understand how we can use strtok_r.

Here is the function declaration:

char *strtok_r(char *str, const char *delim, char **saveptr);


strtok_r has an additional third argument, saveptr, a pointer to a char pointer that is provided by the caller. strtok_r uses this saveptr to maintain context between subsequent calls for the same string. The value of saveptr should remain unchanged in all calls to the strtok_r for it work correctly.


Here is an example program showing strtok_r usage
#include <stdio.h>
#include <string.h>


int main()
{
    char str[] = "apple,orange,banana";
    char *saveptr;
    char *token;

    token = strtok_r(str,",",&saveptr);

    while ( token != NULL )
    {
        printf("%s \n",token);
        token = strtok_r(NULL,",",&saveptr);
    }

    return 1;
}
Output of the above program:
apple
orange
banana

Both strtok and strtok_r change their first argument, the pointer to the string supplied to it in the first call. As a programmer, you should be aware of this, while using strtok or strtok_r in your programs.


Now, let's say that you would like to know if there was no content between successive delimiters in the input string. With strtok you wouldn't be able to get this information, because it treats successive delimiters as one. For this task, its best to use strsep() function instead of strtok/strtok_r.
Here is an example of strsep usage:
#define TESTSTRING ("abc,,,def,,,ghi")
int main()
{
    char *str; 
    char *token;

    str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));

    strcpy(str,TESTSTRING);
    
    token = strsep(&str,",");

    while (token != NULL)
    {
        if ( strlen(token) == 0 )
        {
            printf("No Content\n");
        }
        else
        {
            printf("%s\n",token);
        }

        token = strsep(&str,",");
    }

    return 0;
}
Here is the output of the above program:
abc
No Content
No Content
def
No Content
No Content
ghi 

Again with strsep, we have to keep in mind that it changes the pointer whose address is passed to it as the first argument.

With both strsep and strtok/strtok_r functions, its best to copy the string to be split into a temporary buffer and then use these functions for tokenization. By doing this we don't have to worry about changing the original string pointer, as it could have been passed to you from another function and you may not really know exactly which memory the string resides in or what the caller intends to do with it after you have returned.

So, as we can see, C library does provide us functions that help us with string tokenization, but we need to take care how we use them.

1 comment:

  1. So: strtok sucks, strtok_r still sucks, and strsep isn't portable.

    ReplyDelete