Sunday 16 October 2016

Software Engineering : RegEx in C++ 11/14 with STL

I want to show you how the STL regular expressions in C++ work.... Note, complete source code & makefile at the bottom.

First we'll need a make file, I'm using Ubuntu Linux with GNU G++ v5.4.0, opening a terminal I get a text editor up and we create this makefile:

CC=g++
STD=c++14
WARNINGS=-Wall -Wfatal-errors
FLAGS=-pedantic
OUTPUT=application
FILES=main.cpp

all:
$(CC) -std=$(STD) $(WARNINGS) $(FLAGS) $(FILES) -o $(OUTPUT)
clean:
rm $(OUTPUT)
And I save that as "makefile"... Next we need a simple "main.cpp" file to test this with, so go a head and write:

#include <iostream>
#include <string>

int main ()
{
std::cout << "Hello World" << std::endl;
}

Save that and we can be in the folder and simply type "make".  Everything should complete cleanly, and you can then type "./application" to run the resulting "Hello World" application.

Now lets go back into the main.cpp, and we'll write a function... I'm not going to show you this as proper C++, I am not going to teach you about classes, so just go with me, we'll write this function above the "int main()" we just created, and then we'll define a function to split a string into words whenever it finds a space....

#include <regex>

std::vector<std::string> SplitString (const std::string& p_Source)
{
std::vector<std::string> l_result;
// The actual regular expression
std::regex l_regularExpression ("(\\S+)");
// Process the whole source string through the filter
auto l_regularExpressionResult = std::sregex_iterator(
p_Source.begin(),
p_Source.end(),
l_regularExpression);
// Use the result iterator to get all the individual strings
// into the result vector of strings
for (auto i = l_regularExpressionResult;
i != std::sregex_iterator();
++i)
{
auto l_item = (*i);
std::string l_TheString = l_item.str();
l_result.push_back(l_TheString);
}
// Return the result
return l_result;
}

Lets just take a look at this working, into your main and do this:

int main ()
{
const std::string l_SourceString ("Mary Had a Little Lamb");
std::vector<std::string> l_words = SplitString(l_SourceString);
for (auto i = l_words.cbegin();
i != l_words.cend();
++i)
{
std::cout << (*i) << std::endl;
}
}

We can save, exit and build the program again, running it we see this:

Mary
Had
a
Little
Lamb

So what did our new "SplitString" function do?  Well, lets first of all hope you're comfortable, with STL iterators, because we use one to go through the source string and then another to go through the expression result.

Our important lines of code are, std::regex l_regularExpression ("(\\S+)");  where we define the regular expression string, no I'm not going to teach you all the ins and outs of creating those strings, this expression however just gets individual strings.

The next important line is: auto l_regularExpressionResult = std::sregex_iterator(  where we are going to use the sregex_iterator constructor to actually apply the filter we created on the previous line, and we apply it to the span of the whole source string "begin()" to "end()" on the std::string::iterator there.

We could try to use the std::string::const_iterator too, by simply substituting with "cbegin()" and "cend()".

The final parameter is passing the actual filtering regular expression into place.

The result, and we don't need to worry about the type as we're leveraging auto there, is a copy of the iterator.  Depending on the STL implementation you have will define when the processing takes place, some versions will process as you iterate over the sregex_iterator, making you process the input on the fly, whilst others pre-process everything, holding off your code moving to the next line of code (when you step through) until the complete source has been processed through the regular expression.  This can be a performance trap for some, as they either think it will process, when it does not, or it does not until you iterate, and confusion ensues.  Especially when you are writing cross platform code and the platforms express different behaviours.

The last important piece of code is actually going through the result to see if there is anything in the resulting iterator.

The awkward piece of using auto shows up here, because on some platforms when you try to iterate through the result and get each string you might want to do "(*i).str()" rather than assigning the dereference (*i) to an auto first.  However, some compilers (especially when using -pedantic, GCC on this one) don't like this, so to make the code more maintainable and pre-empt it being on any platform where the dereference of the iterator is reported to "not contain a definition for "str()", I simply assign the dereference to an auto called "l_item" and then use "l_item.str()"... That's a lesson in maintainable code right there folks.

That is a very basic introduction to regular expressions, you can see why I have gone through this below.

Right now through, lets use a more complex regualr expresion, and avoid the complexity of the interator stuff, lets just validate a string as a UK Postcode:

const bool ValidatePostcode (const std::string& p_UKPostcode)
{
std::regex l_Validate ("^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$");

return std::regex_match(p_UKPostcode.c_str(), l_Validate);
}

This might look a little cramped, but I never wanted to make a mistake with the reg-ex.  This isn't a perfect solution btw, I'm still writing a test routine to check it against a full list of UK postcodes online, I think it will let some stranger codes through as valid, but they are edge cases, this will work for 99.5% of addresses, and 100% of those I've tested so far.

There you go, good luck!


=== WHY DOES THIS EXISTS ===
Today I've been using regular expressions in C++, some might consider this dark magic, however, I assure you it is all above board, the problem was the validation of a UK postcode, there was a quite terrible function:

bool Validate(char *Code);

Defined, which had all manner of hackery and trouble within, not least it could not handle some London Postcodes, we'll come back to postcodes later, however I replaced all the functionality of this code with two lines of code... Literally two, it went from around 500 lines of un-maintainable junk, to the two lines of active code to manage which you see above, I in fact could have placed the regular expression string into our master list of "strings" to yet further minimise where constants are defined, but I left that to him, left him a small victory to coerce acceptance of my drastically demonstrating his not thinking about the code changes needed, and spending all week on something which took me two lines and about 10 minutes to make sure the regex was right!

Handing it back to the owner, after my peer review, I think they wanted to cry, instead they rushed off to our common Director, avoiding all code managerial level input from fellow programmers, and said I had "shown them up by using a third party library".

I had used STL, something we use elsewhere, I had also followed the coding standards which exist, so the function had become:

const bool ValidatePostcode(const std::string& p_UKPostcode) const;

This, I think you must agree, is more informative as to what it does, it tells us we can't edit the values, we're still passing everything by reference but we're not changing the type of our system string handing from "std::string" to "char*" and we also define that the function changes nothing in the class it is within with a trailing const.

All these rules are in the coding standard, folks before you go around a peer to complain; a more senior peer at that; please check you are in fact on the right track.

So, having validated my changing the function prototype, I had to explain why I had used a third party library (as all such libraries need formal evaluation)... "Regular Expressions are in the standard library".... Was my simply reply... "Only in the latest technical release!"... Was the mouth frothing reply from the hurt chap.  "No, they've in C++11, we use STL all over the code, it is formally evaluated and signed off by everyone, including yourself".

The guy looked extremely crest fallen, and whatever his motivations for having a go at myself, I realised he just didn't know, he'd not read the books I had, he's not used the code as I have, and he'd simply always used regular expressions from third party sources, and that's fine, but please folks just check  your coding standard and have at least a look on google, before you go shouting to those above in an unprofessional manner.


---- THE COMPLETE SOURCE (main.cpp) ----

#include <iostream>
#include <string>
#include <regex>

std::vector<std::string> SplitString (const std::string& p_Source)
{
std::vector<std::string> l_result;
// The actual regular expression
std::regex l_regularExpression ("(\\S+)");
// Process the whole source string through the filter
auto l_regularExpressionResult = std::sregex_iterator(
p_Source.begin(),
p_Source.end(),
l_regularExpression);
// Use the result iterator to get all the individual strings
// into the result vector of strings
for (auto i = l_regularExpressionResult;
i != std::sregex_iterator();
++i)
{
auto l_item = (*i);
std::string l_TheString = l_item.str();
l_result.push_back(l_TheString);
}
// Return the result
return l_result;
}

const bool ValidatePostcode (const std::string& p_UKPostcode)
{
std::regex l_Validate ("^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$");

return std::regex_match(p_UKPostcode.c_str(), l_Validate);
}

int main ()
{
const std::string l_SourceString ("Mary Had a Little Lamb");
std::vector<std::string> l_words = SplitString(l_SourceString);
for (auto i = l_words.cbegin();
i != l_words.cend();
++i)
{
std::cout << (*i) << std::endl;
}

// Postcodes
std::cout << "--- Postcodes ---" << std::endl;
std::cout << ValidatePostcode("NG16 5BP") << std::endl;
std::cout << ValidatePostcode("NG10 1NQ") << std::endl;
std::cout << ValidatePostcode("Robert") << std::endl;
std::cout << ValidatePostcode("FP52 JTY") << std::endl;
}

---- makefile ----

CC=g++
STD=c++14
WARNINGS=-Wall -Wfatal-errors
FLAGS=-pedantic
OUTPUT=application
FILES=main.cpp

all:
$(CC) -std=$(STD) $(WARNINGS) $(FLAGS) $(FILES) -o $(OUTPUT)
clean:
rm $(OUTPUT)


P.S. Yes this will all work with "STD=c++11" in the make file!

No comments:

Post a Comment