PHP Funda

Chirag

What are Regular Expressions?

The term regular expressions is one of those technical terms where the words do very little to explain what the technology does. regular expressions are a feature common to many programming languages and are a topic on which entire can, and indeed, have been written. Fortunately (or unfortunately depending on your perspective), MySQL does not provide as extensive support for regualr expressions as some other programming languages. This is good in that it makes the subject area easier to learn, but may be frustrating if you are already proficient with the rich regular expression syntax available in other languages.

Regular expressions are essentially a set of instructions using a predefined syntax for matching text in a variety of flexible ways. For example, you might want to extract all the occurrences of a particular word sequence from a block of text. You might also want to perform a search for a particular piece of text and replace it with some alternate text.

Regular Expression Character Matching

In order to introduce the REGEXP operator, we will begin by looking at a use of regular expressions that could similarly be used with the LIKE operator. As in the previous chapter we need to retrieve rows from a table taking into consideration the difference in spelling of the color gray (grey). To perform this retrieval we will use the regex dot character matching (.). Rather like the LIKE underscore wildcard, this character indicates that any character in this position in the text will be considered a match. For example:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr.y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.05 sec)

So far we haven't done anything that could not have been achieved using wildcards. Regular expressions, however, do not stop here.

Matching from a Group of Characters

One problem with the approach outlined above is that any letter between the 'Gr' and the 'y' would have registered as a match. In reality we are only interested in words that contain either an 'a' or an 'e' in that location. Fortunately, regular expressions allow us to specify a group of acceptable character matches for any character position. The syntax for this requires that the characters be places in square brackets at the desired location in the match text. For example:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr[ae]y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.00 sec)

Use of this syntax ensures that only the words 'Grey' and 'Gray will match the search criteria. There is no limit to the number of characters that can be grouped in the brackets when using this filtering technique.

Matching from a Range of Characters

The character group matching syntax can be extended to cover range of characters. For example, instead of declaring a regular expression to cover the letters between A and F as [ABCDEF] we could simply specify a range of characters using the '-' character between the upper and lower ranges [A-F]. We could, for example, list a product based on certian model numbers which begin with numbers ranging from 1 to 6:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [1-6]543';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
|        15 | CD-RW Model 6543 | CD Writer |
|        16 | CD-RW Model 2543 | CD Writer |
+-----------+------------------+-----------+
4 rows in set (0.00 sec)

Handling Special Characters

As you have seen, regular expressions assign special meaning to particular characters. For example the dot (.) and square brackets ([]) all have special meaning. Those who studying critical thinking at college will already be questioning what to do if the character sequence that is the subject of a search contains one or more of these characters. Obviously if you are are looking for the following text that looks like a regular expression, the text for which you want to search is, itself, going to be viewed as regular expression syntax.

To address this issue, a concept known as escaping is used. In SQL, escaping involves preceding any characters that may be mis-interpreted as a regular expression special character with double back slashes (\\). For example, suppose we have a row in our product table which reads as follows:

 
+-----------+--------------------+-----------+
| prod_code | prod_name          | prod_desc |
+-----------+--------------------+-----------+
|        17 | CD-RW Model [7543] | CD Writer |
+-----------+--------------------+-----------+

If we were to search for this without regard to the fact that the prod_name value contains regular expression special characters we will not get what we wanted:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [7543]';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
+-----------+------------------+-----------+
2 rows in set (0.00 sec)

The cause of the problem is that the regular expression has been interpreted as a search for any entries that read 'CD-RW Model' followed by either a 7, 5, 4 or 3. Clearly, what we wanted was the actual text [7543]. If instead we escape the brackets with the \\ escape sequence we get the result we want:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model \\[7543\\]';
+-----------+--------------------+-----------+
| prod_code | prod_name          | prod_desc |
+-----------+--------------------+-----------+
|        17 | CD-RW Model [7543] | CD Writer |
+-----------+--------------------+-----------+
1 row in set (0.00 sec)

Regular Expressions and Whitespace Characters

Regular expression syntax also provides a mechanism to reference whitespace characters such as tabs, carriage returns and line feeds. These are referenced in a regular expression using metacharacters. These metacharacters are outlined in the following table:

Metacharacter	Description
\\n	New line (line feed)
\\f	Form feed
\\t	Tab
\\r	Carriage return
\\v	Vertical tab

Matching by Character Type

Another useful regular expression trick is to match characters by type or class. For example we might need to specify that a character must be a letter, a number or a alphanumeric. This involves the use of some special class definitions outlines in the following table. Some examples of these special classes in action follow the table:

Class Keyword	Description of Matches
[[:alnum:]]	Alphanumeric - any number or letter. Equivalent to [a-z], [A-Z] and [0-9]
[[:alpha:]]	Alpha - any letter. Equivalent to [a-z] and [A-Z]
[[:blank:]]	Space or Tab. Equivalent to [\\t] and [ ]
[[:cntrl:]]	ASCII Control Character
[[:digit:]]	Numeric. Equivalent to [0-9]
[[:graph:]]	Any character with the exception of space
[[:lower:]]	Lower case letters. Equivalent to [a-z]
[[:print:]]	Any printable character
[[:punct:]]	Characters that are neither control characters, nor alphanumeric (i.e punctuation characters)
[[:space:]]	Any whitespace character (tab, new line, form feed, space etc)
[[:upper:]]	Upper case letters. Equivalent to [A-Z]
[[:xdigit:]]	Any hexadecimal digit. Equivalent to [A-F], [a-f] and [0-9]

Let's now look at some examples. Suppose in our product database we have two products with similar names, the 'One&One VoIP Headset' and the "One2One USB Hub'. In order to retrieve the 'One2One' product we would use the [:digit:] character class:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'One[[:digit:]]One';
+-----------+-----------------+----------------+
| prod_code | prod_name       | prod_desc      |
+-----------+-----------------+----------------+
|        19 | One2One USB Hub | 4 Port USB Hub |
+-----------+-----------------+----------------+
1 row in set (0.00 sec)

Similarly, to extract the 'One&One' product we would use the [:punct:] class:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP 'One[[:punct:]]One';
+-----------+----------------------+----------------+
| prod_code | prod_name            | prod_desc      |
+-----------+----------------------+----------------+
|        18 | One&One VoIP Headset | Stereo Headset |
+-----------+----------------------+----------------+
1 row in set (0.02 sec)

Regular Expression Repetition Metacharacters

In addition to allowing searches on single instances, regular expressions can also be written to look for repetition in text. this is achieved using a set of repetition metacharacters:

Metacharacter	Description
*	Any number of matches
+	One or more matches
{n}	n number of matches
{n,}	Not less than n number of matches
{n1,n2}	A range of matches between n1 and n2
?	Optional single character match (character my be present or not to qualify for a match)

As always, example do a much better job of demonstrating a concept than data in a table. Let's begin by searching for all 4 digit number sequences in our prod_name column:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP '[[:digit:]]{3}';
+-----------+---------------------+-----------------+
| prod_code | prod_name           | prod_desc       |
+-----------+---------------------+-----------------+
|         1 | CD-RW Model 4543    | CD Writer       |
|         2 | EasyTech Mouse 7632 | Cordless Mouse  |
|         3 | WildTech 250Gb 1700 | SATA Disk Drive |
|         7 | Dell XPS 400        | Desktop PC      |
|        14 | CD-RW Model 5543    | CD Writer       |
|        15 | CD-RW Model 6543    | CD Writer       |
|        16 | CD-RW Model 2543    | CD Writer       |
|        17 | CD-RW Model [7543]  | CD Writer       |
+-----------+---------------------+-----------------+
8 rows in set (0.00 sec)

In the above example we have indicated that we are looking for any sequence of 3 digits by using the digit:{3} regular expression. In this case we have picked up entries with both 3 and 4 digits in a sequence. Suppose, instead we wanted only 4 digit sequences:

 
mysql> SELECT * FROM product WHERE prod_name REGEXP '[[:digit:]]{4,}';
+-----------+---------------------+-----------------+
| prod_code | prod_name           | prod_desc       |
+-----------+---------------------+-----------------+
|         1 | CD-RW Model 4543    | CD Writer       |
|         2 | EasyTech Mouse 7632 | Cordless Mouse  |
|         3 | WildTech 250Gb 1700 | SATA Disk Drive |
|        14 | CD-RW Model 5543    | CD Writer       |
|        15 | CD-RW Model 6543    | CD Writer       |
|        16 | CD-RW Model 2543    | CD Writer       |
|        17 | CD-RW Model [7543]  | CD Writer       |
+-----------+---------------------+-----------------+
7 rows in set (0.00 sec)

Here we see that our Dell XPS 400 is no longer listed because it has only 3 digits.

The '?' metacharacter is particularly useful when we need to allow for plural words. For example, we may want to list any product descriptions where the word Drive or Drives is used. To achieve this we simply follow the 's' with a '?', thereby making the trailing 's' optional for a match:

 




mysql> SELECT * FROM product WHERE prod_desc REGEXP 'Drives?';
+-----------+------------------------+--------------------+
| prod_code | prod_name              | prod_desc          |
+-----------+------------------------+--------------------+
|         3 | WildTech 250Gb 1700    | SATA Disk Drive    |
|        20 | MasterSlave Multi-pack | 5 SATA Disk Drives |
+-----------+------------------------+--------------------+
2 rows in set (0.00 sec)

Matching by Text Position

The final area of regular expressions to cover in this chapter involves matching based on the location of text in a string. For example, we may want to find a particular match that requires that a word appears at the beginning or end of a piece of text. Once again, this requires the use of some special metacharacters:

Metacharacter	Description
^	Beginning of text
$	End of text
[[:<:]]	Start of word
[[:>:]]	End of word

For example, to search for text that begins with a digit:

 
mysql> SELECT prod_desc FROM product WHERE prod_desc REGEXP '^[[:digit:]]';
+--------------------+
| prod_desc          |
+--------------------+
| 4 Port USB Hub     |
| 5 SATA Disk Drives |
+--------------------+
2 rows in set (0.00 sec)

Similarly, to search for text which ends in the word 'Phone':

 
mysql> SELECT prod_desc FROM product WHERE prod_desc REGEXP 'Phone$';
+--------------+
| prod_desc    |
+--------------+
| Smart Phone  |
| Mobile Phone |
+--------------+
2 rows in set (0.00 sec)

We can also find instances where string of characters are a separate word. For example if we search for the word 'one' we might get the following:

 
mysql> SELECT prod_name FROM product WHERE prod_name REGEXP 'one';
+----------------------+
| prod_name            |
+----------------------+
| Apple iPhone 8Gb     |
| One2One USB Hub      |
| Level One Firewall   |
+----------------------+
3 rows in set (0.00 sec)

As we can see from the above example, because the word 'Phone' contains the word 'one' we have retrieved more rows than we anticipated. Using the beginning and and end of word metacharacters we can isolate the instances of 'one' which are a standalone word:

 
mysql> SELECT prod_name FROM product WHERE prod_name REGEXP '[[:<:]]One[[:>:]]';
+----------------------+
| prod_name            |
+----------------------+
| Level One Firewall   |
+----------------------+
1 row in set (0.00 sec)

Chirag

Regular Expressions in MySQL

Introduction

A very interesting and useful capability of MySQL is to incorporate Regular Expressions (regex) in SQL queries. The regular expression support in MySQL is extensive. Let's take a look at using Regular Expressions in queries and the supported metacharacters.

Using Regular Expressions in queries

A simple example of using Regular Expressions in a SQL query would be to select all names from a table that start with 'A'.

Code: SQL

SELECT name FROM employees WHERE name REGEXP '^A'

A slight modification in the above example to look for names starting with 'A' or 'D' or 'F' will look like this.

Code: SQL

SELECT name FROM employees WHERE name REGEXP '^(A|D|F)'

If we want to select all names ending with 'P', then the SQL query goes like this

Code: SQL

SELECT name FROM employees WHERE name REGEXP 'P$'

We can use much complex patterns in our SQL queries, but first let's have a look at various MySQL Regular Expression metacharacters.

Regular Expression Metacharacters

*

Matches zero or more instances of the string preceding it

+

Matches one or more instances of the string preceding it

?

Matches zero or one instances of the string preceding it

.

Matches any single character, except a newline

[xyz]

Matches any of x, y, or z (match one of enclosed characters)

[^xyz]

Matches any character not enclosed

[A-Z]

Matches any uppercase letter

[a-z]

Matches any lowercase letter

[0-9]

Matches any digit

^

Anchors the match from the beginning

$

Anchors the match to the end

|

Separates alternatives

{n,m}

String must occur at least n times, but not more than m times

{n}

String must occur exactly n times

{n,}

String must occur at least n times

[[:<:]]

Matches beginning of words

[[:>:]]

Matches ending of words

[:class:]

match a character class i.e.,

[:alpha:] for letters

[:space:] for whitespace

[:punct:] for punctuation

[:upper:] for upper case letters

Extras

MySQL interprets a backslash (\) character as an escape character. To use a backslash in a regular expression, you must escape it with another backslash (\\).

Whether the Regular Expression match is case sensitive or otherwise is decided by the collation method of the table. If your collation method name ends with ci then the comparison/match is case-insensitive, else if it end in cs then the match is case sensitive.

Examples

Checking only for numbers

Code: SQL

SELECT age FROM employees WHERE age REGEXP '^[0-9]+$'

/* starts, ends and contains numbers */

Contains a specific word, for example the skill PHP in skill sets

Code: SQL

SELECT name FROM employees WHERE skill_sets REGEXP '[[:<:]]php[[:>:]]'

Fetching records where employees have entered their 10-digit mobile number as the contact number.

Code: SQL

SELECT name FROM employees WHERE contact_no REGEXP '^[0-9]{10}$'

For more information refer http://dev.mysql.com/doc/refman/5.1/en/regexp.html

Chirag

A regular expression (regex) is a powerful way of specifying a complex search.

MySQL uses Henry Spencer's implementation of regular
expressions, which is aimed at conformance with POSIX
1003.2. MySQL uses the extended version.

This is a simplistic reference that skips the details. To get more exact
information, see Henry Spencer's regex(7) manual page that is
included in the source distribution. See section C Credits.

A regular expression describes a set of strings. The simplest regexp is
one that has no special characters in it. For example, the regexp
hello matches hello and nothing else.

Non-trivial regular expressions use certain special constructs so that
they can match more than one string. For example, the regexp
hello|word matches either the string hello or the string
word.

As a more complex example, the regexp B[an]*s matches any of the
strings Bananas, Baaaaas, Bs, and any other string
starting with a B, ending with an s, and containing any
number of a or n characters in between.

A regular expression may use any of the following special
characters/constructs:

^

Match the beginning of a string.

 
mysql> SELECT "fo\nfo" REGEXP "^fo$";           -> 0
mysql> SELECT "fofo" REGEXP "^fo";              -> 1

$

Match the end of a string.

 
mysql> SELECT "fo\no" REGEXP "^fo\no$";         -> 1
mysql> SELECT "fo\no" REGEXP "^fo$";            -> 0

.

Match any character (including newline).

 
mysql> SELECT "fofo" REGEXP "^f.*";             -> 1
mysql> SELECT "fo\nfo" REGEXP "^f.*";           -> 1

a*

Match any sequence of zero or more a characters.

 
mysql> SELECT "Ban" REGEXP "^Ba*n";             -> 1
mysql> SELECT "Baaan" REGEXP "^Ba*n";           -> 1
mysql> SELECT "Bn" REGEXP "^Ba*n";              -> 1

a+

Match any sequence of one or more a characters.

 
mysql> SELECT "Ban" REGEXP "^Ba+n";             -> 1
mysql> SELECT "Bn" REGEXP "^Ba+n";              -> 0

a?

Match either zero or one a character.

 
mysql> SELECT "Bn" REGEXP "^Ba?n";              -> 1
mysql> SELECT "Ban" REGEXP "^Ba?n";             -> 1
mysql> SELECT "Baan" REGEXP "^Ba?n";            -> 0

de|abc

Match either of the sequences de or abc.

 
mysql> SELECT "pi" REGEXP "pi|apa";             -> 1
mysql> SELECT "axe" REGEXP "pi|apa";            -> 0
mysql> SELECT "apa" REGEXP "pi|apa";            -> 1
mysql> SELECT "apa" REGEXP "^(pi|apa)$";        -> 1
mysql> SELECT "pi" REGEXP "^(pi|apa)$";         -> 1
mysql> SELECT "pix" REGEXP "^(pi|apa)$";        -> 0

(abc)*

Match zero or more instances of the sequence abc.

 
mysql> SELECT "pi" REGEXP "^(pi)*$";            -> 1
mysql> SELECT "pip" REGEXP "^(pi)*$";           -> 0
mysql> SELECT "pipi" REGEXP "^(pi)*$";          -> 1

{1}

{2,3}

The is a more general way of writing regexps that match many
occurrences of the previous atom.

a*: Can be written as a{0,}.
a+: Can be written as a{1,}.
a?: Can be written as a{0,1}.

To be more precise, an atom followed by a bound containing one integer
i and no comma matches a sequence of exactly i matches of
the atom. An atom followed by a bound containing one integer i
and a comma matches a sequence of i or more matches of the atom.
An atom followed by a bound containing two integers i and
j matches a sequence of i through j (inclusive)
matches of the atom.

Both arguments must be in the range from 0 to RE_DUP_MAX
(default 255), inclusive. If there are two arguments, the second must be
greater than or equal to the first.

[a-dX]

[^a-dX]

Matches
any character which is (or is not, if ^ is used) either a, b,
c, d or X. To include a literal ] character,
it must immediately follow the opening bracket [. To include a
literal - character, it must be written first or last. So
[0-9] matches any decimal digit. Any character that does not have
a defined meaning inside a [] pair has no special meaning and
matches only itself.

 
mysql> SELECT "aXbc" REGEXP "[a-dXYZ]";         -> 1
mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]$";       -> 0
mysql> SELECT "aXbc" REGEXP "^[a-dXYZ]+$";      -> 1
mysql> SELECT "aXbc" REGEXP "^[^a-dXYZ]+$";     -> 0
mysql> SELECT "gheis" REGEXP "^[^a-dXYZ]+$";    -> 1
mysql> SELECT "gheisa" REGEXP "^[^a-dXYZ]+$";   -> 0

[[.characters.]]

The sequence of characters of that collating element. The sequence is a
single element of the bracket expression's list. A bracket expression
containing a multi-character collating element can thus match more than
one character, for example, if the collating sequence includes a ch
collating element, then the regular expression [[.ch.]]*c matches the
first five characters of chchcc.

[=character_class=]

An equivalence class, standing for the sequences of characters of all
collating elements equivalent to that one, including itself.

For example, if o and (+) are the members of an
equivalence class, then [[=o=]], [[=(+)=]], and
[o(+)] are all synonymous. An equivalence class may not be an
endpoint of a range.

[:character_class:]

Within a bracket expression, the name of a character class enclosed in
[: and :] stands for the list of all characters belonging
to that class. Standard character class names are:

Name	Name	Name
alnum	digit	punct
alpha	graph	space
blank	lower	upper
cntrl	print	xdigit

These stand for the character classes defined in the ctype(3) manual
page. A locale may provide others. A character class may not be used as an
endpoint of a range.

 
mysql> SELECT "justalnums" REGEXP "[[:alnum:]]+";       -> 1
mysql> SELECT "!!" REGEXP "[[:alnum:]]+";               -> 0

[[:<:]]

[[:>:]]

These match the null string at the beginning and end of a word
respectively. A word is defined as a sequence of word characters which
is neither preceded nor followed by word characters. A word character is
an alnum character (as defined by ctype(3)) or an underscore
(_).

 
mysql> SELECT "a word a" REGEXP "[[:<:]]word[[:>:]]";      -> 1
mysql> SELECT "a xword a" REGEXP "[[:<:]]word[[:>:]]";     -> 0

 
mysql> SELECT "weeknights" REGEXP "^(wee|week)(knights|nights)$"; -> 1

PHP Funda

Tuesday, January 20, 2009

MySQL Regular Expressions - Part 3

What are Regular Expressions?

Regular Expression Character Matching

Matching from a Group of Characters

Matching from a Range of Characters

Handling Special Characters

Regular Expressions and Whitespace Characters

Matching by Character Type

Regular Expression Repetition Metacharacters

Matching by Text Position

MySQL Regular Expressions - Part 2

Regular Expressions in MySQL

MySQL Regular Expressions - Part 1

Make Donation

Blog Archive

HitMap

Bookmark Site

Referance Site

Total Hit

Page Rank Checking tool

Earning Vahicle

Snap Shots