Editor's
Note:
Jackie Rosenberger
Staff Writer
04.19.03
If you run SQL 2000, then you've definitely heard about the Slammer worm. I hope
you patched in time and your servers stayed safe! Much has been said about the
worm itself, yet little written about how the Slammer problem got out of hand
so fast (from nothing to a crisis in ten minutes, by some estimates). Read more
on Slammer here: http://www.sysadminnews.com/articles/0204.html.
Today's article deals with a more mundane but useful topic - regular expressions. Dan Farino explains how to use this extremely useful technique to simplify your daily work. Enjoy!
xp_regex: Regular Expressions in SQL Server 2000 By Dan Farino
Download source files/compiled DLL - 58 Kb
Regular Expressions are probably the best tool for text parsing. Using Regular Expressions greatly simplifies anything for which you would normally use CHARINDEX, PATINDEX, SUBSTRING, STUFF, etc.
xp_regex is an Extended Stored Procedure for SQL Server 2000 written in Managed C++ that lets you use Regular Expressions from T-SQL. In case you’re skeptical about the performance when mixing non-.NET code (in this case SQL Server) and .NET code in the same process, I can tell you that I wrote a completely non-.NET version using the Boost Regex++ Regular Expression library and the speed was actually slightly slower. Plus, the .NET Regular Expression library is pretty much a complete implementation of Perl 5.6's (the de factostandard).
There are two Extended Stored Procedures in the DLL:
xp_regex_split
xp_regex_format
1. XP_REGEX_FORMAT
Syntax:
xp_regex_format @input, @regex, @format_string, @result OUTPUT
@input is the text to parse.
@regex is the regular expression to match.
@format_string is used to format the results
@result is an output parameter that will hold the formatted results
All parameters are either VARCHAR or CHAR of any length.
(TEXT might work too. Haven't tried…)
xp_regex_format is used to parse an input string and format the results. Probably the best example to demonstrate is by doing some telephone number parsing. I’ll assume you know the Regular Expression syntax for this documentation.
The regex [^d]*(d{3})[^d]*(d{3})
[^d]*(d{4}) will parse just about any
phone-number-like string you throw at it. For instance, this code:
DECLARE @out VARCHAR(50)
EXEC xp_regex_format '(310)555-1212', '[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})' , '($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format '310.555.1212', '[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})'
, '($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format ' 310!555 hey! 1212', '[^d]*(d{3})[^d]*(d{3})[^d]* (d{4})','($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format ' hello, ( 310 ) 555.1212 is my phone number. Thank you.', '[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})', '($1) $2-$3',
@out OUTPUT
PRINT @out
prints out:
(310) 555-1212
(310) 555-1212
(310) 555-1212
(310) 555-1212
(Cool!)
| |
FREE
Trial! Integral
Accounting Professional. - a complete, powerful and fully customizable accounting
system implemented in Microsoft Visual Basic 6.0, ADO, and SQL Server for great
performance, stability and reliability! |
2. XP_REGEX_SPLIT
Syntax:
xp_regex_split @input, @regex, @column_number, @result OUTPUT
@input is the text to parse.
@regex is a regular expression that matches the delimiter
@column_number: We’re basically doing a "text-to-columns" here, so
@column_number lets you specify which of the resulting columns should be passed
back in the @result parameter
@result is an output parameter that will hold the formatted
results
@input, @regex and @result are either VARCHAR
or CHAR of any length.
@column_number is an INT. Columns are numbered starting at 1.
This function splits text data on some sort of delimiter (comma, pipe,
whatever). The cool thing about a split using regular expressions is that the
delimiter does not have to be as consistent as you would normally expect.
For example, take this line as your source data:
one ,two|three : four
In this case, our delimiter is either a comma, pipe or colon with any number of
spaces either before or after (or both). In regex form, that is written: s*[,|:]s*
For example:
DECLARE @out VARCHAR(8000)
EXEC xp_regex_split 'one ,two|three : four', 's*[,|:]s*', 1, @out OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four', 's*[,|:]s*', 2, @out OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four', 's*[,|:]s*', 3, @out OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four', 's*[,|:]s*', 4, @out OUTPUT
PRINT @out
prints out:
one
two
three
four

Please note that in the case above, there is no performance penalty for running
the same split more than once. xp_regex_split caches both the input
string and the regular expression. So calling it multiple times while changing
only the @column_number parameter is perfectly fine. The actual split
is only done the first time and the rest is pulled from cache.
3. FN_XP_REGEX_SPLIT and FN_XP_REGEX_FORMAT
These are user-defined functions that wrap the stored procedures. This way you
can use the function as part of a select list or a where clause:
SELECT master.dbo.fn_regex_format(phone_number,
'[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})', '$1 $2 $3')
FROM
customers
This would format every phone number in the "customers" table.
4. Installation
a. Copy xpRegex.dll to your Program FilesMicrosoft SQL ServerMSSQLinn
folder.
b. Run the SQL script INSTALL.SQL. This will register the procedures and
create the user-defined functions.
5. Misc
I've created a set of Performance Counters (the Performance Object is called
"xp_regex"). You can use the Windows 2000 System Monitor to see various
statistics about xp_regex while it is running.
Comments/corrections/additions are welcome. Please let me know if you find
this useful! Thanks!
Originally appeared at http://www.codeproject.com/useritems/xpRegex.asp.
dan@stamps.com
Stamps.com, Inc.
Postage from Your Printer |