For a while now, I’ve been meaning to start dabbling with regular expressions in C#. I’ve held off though, mostly because I just haven’t had the chance to really get into anything with C# in depth on top of work and family. At this end of this past week, Zibings finally got started on a very long-awaited project which will be done in ASP.NET, and in starting I found myself needing to port my old (almost 11 years old) email validation algorithm from C++ to C#. I ported this to PHP years ago, and it works beautifully there, but porting it to C# offered me the ability to tweak the algorithm to work with some new tools available in .NET.
I went through, and first did an almost exact port of the script. It worked, but I thought I should check it’s performance. It seemed to be slow, so I tried a version of it using C#’s List datatype. This seemed faster, but I thought I could do better. I asked for a bit of help from some nice people on FreeNode’s ##asp.net channel (specifically Kim^J) and was given a pretty blazingly fast regular expression version.
Even so, I felt that it was odd you couldn’t create something that didn’t use a complex system like regular expressions which outperformed regular expressions, so I went back and started tweaking my List and Manual versions of the validation routine. After a lot of work, I actually have made both into something consistently faster (on average) than a comparatively accurate compiled regular expression. Before I go further, here’s a sampling of the rather consistent results:
Also, you can view the entire source code of the .cs file here.
The three ‘algorithms’ are each enclosed in their own class. The only difference between the List and Manual classes is that instead of using a List collection to store and search acceptable characters the Manual class simply traverses an array of the characters. Otherwise, they should be identical in the logical patterns they use to verify that the email address and domain name are valid.
The regular expression is almost entirely based off of one readily available at this site, so if anyone out there has a pattern they know to be better/faster that should be tried I would love to hear about it, I am not a RegExpert in the least.
The above image was the result of the source code I’ve uploaded, and are derived from running three emails through a test 500 times, taking the average, and then running through again. All told, the above test did validation of 15,000 email addresses (but of course they were the same 3 addresses). I have run the test with a few as 5 attempts and as many as 50,000 attempts. Regardless of the number of attempts or how many times I tell it to run the test, the order is always the same. First and fastest is always the List version, second and mostly consistent is the Manual version and the Regex version ends up in last place by various margins.
I’ve always known it to be general knowledge that doing things by hand are faster from a computer’s perspective. Interestingly enough though, this actually proves that at least in C#, that’s not always true. The List approach uses a supplied method, the Contains() method, to search for the existence of a character within the List instead of looping through the entire list and bailing out when the first match is found (as the Manual approach basically does).
It should also be noted that just because the computer has an easier time handling the List/Manual methods, doesn’t mean it’s necessarily faster. Most people are not going to be trying to validate 50,000 email addresses in a few seconds regardless of what they’re doing, so the time that I took writing this algorithm all those years ago (and today) were really wasted in a sense, as it would take a very long time to make up the time in saved milliseconds. Regardless, I had a lot of fun working with Kim^J to look into the possibilities here.
If anyone finds anything that could help any of the algorithms become faster (and remain accurate), I’d be really excited to hear about your ideas. Thanks again to Kim^J for the help with the regular expression version and with the test code.
- Andy
Labels: .NET Framework, ASP.NET, C#, Development, Email Validation, N2 Framework, Regular Expressions, Self