I have been a fan of regular expressions for a long time now.
Regular expressions provide a concise and flexible notation for matching and replacing patterns of text within a body of text. Some might say that regular expressions are concise and cryptic, perhaps because most regular expressions are built from a combination of metacharacters such as ^$*+?. The actual regular expression itself is known as a pattern.
They are, in my opinion, very much underused. Perhaps they are a visual turn off? After all, looking at this regular expression…
([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})
…and it’s no wonder people don’t use them as much as they should.
However, there is a plethora of good web-sites that list common regular expressions thus relieving us of the need to type them in manually. Equally there are many good tools that will help us build regular expressions using a pleasing visual interface…after which it’s typically a matter of cut’n’paste.
—
A few months ago I had to write a little application that scraped the HTML that makes up a web-page. I needed to extract all the e-mail addresses that were in the HTML – it was all legit, the e-mail addresses were made available to registered organisations (of which we are one) – it was just a shame that the sheer number of e-mail addresses didn’t lend itself to a mailshot. That was, until I wrote a few lines of code that used a pre-defined regular expression to extract all the e-mail addresses, formatting them nicely on the way.
The application worked by asking the user to paste the HTML source code from the web-page that contained the e-mail addresses, albeit they were embedded withing anchor tags. The user could then run the regular expression over the HTML source code – a treeview of the matches appears on the right-hand side and a neatly textbox appears at the bottom. Here’s a screenshot of the application:

Here’s the source code:
[code lang=”C#”]
[C#]
using System.Text.RegularExpressions;
namespace HTML_Scraper
{
public partial class Extractor : Form
{
public Extractor()
{
InitializeComponent();
}
private void btnProcess_Click(object sender, EventArgs e)
{
Boolean found = false;
lblMessage.Visible = false;
Match m;
Regex r = new Regex(tbRegEx.Text,
RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
tvTree.Nodes.Clear();
this.Cursor = Cursors.WaitCursor;
for (m = r.Match(tbSource.Text); m.Success; m = m.NextMatch())
{
if (m.Value.Length > 0)
{
found = true;
tvTree.Nodes.Add(“[” + m.Value + “]”);
if (tbOutput.Text.Length > 0) { tbOutput.Text = tbOutput.Text + “, “; }
tbOutput.Text = tbOutput.Text + m.Value;
int ThisNode = tvTree.Nodes.Count – 1;
tvTree.Nodes[ThisNode].Tag = m;
if (m.Groups.Count > 1)
{
for (int i = 1; i < m.Groups.Count; i++)
{
tvTree.Nodes[ThisNode].Nodes.Add(r.GroupNameFromNumber(i) + ": [" + m.Groups[i].Value + "]");
tvTree.Nodes[ThisNode].Nodes[i - 1].Tag = m.Groups[i];
int Number = m.Groups[i].Captures.Count;
if (Number > 1)
{
for (int j = 0; j < Number; j++)
{
tvTree.Nodes[ThisNode].Nodes[i - 1].Nodes.Add(m.Groups[i].Captures[j].Value);
tvTree.Nodes[ThisNode].Nodes[i - 1].Nodes[j].Tag = m.Groups[i].Captures[j];
}
}
}
}
}
}
if (found)
{
tbOutput.SelectAll();
Clipboard.SetText(tbOutput.Text);
lblMessage.Visible = true;
}
this.Cursor = Cursors.Default;
}
private void btnGetText_Click(object sender, EventArgs e)
{
tbSource.Text = "";
tbSource.Text = Clipboard.GetText();
}
private void textBox1_TextChanged(object sender, EventArgs e)
{
btnProcess.Enabled = (tbSource.Text.Length > 0);
}
private void btnEmail_Click(object sender, EventArgs e)
{
tbRegEx.Text = @”([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})”;
}
}
}
[/code]
As long as you don’t get too boiled down in the actual regular expression itself, the code is fairly self-explanatory.
A simple example
Consider the following strings: A123, 234, C456. I’ve deliberately missed the ‘B’ from the second string.
It would be useful to be able to scan these strings to pick out strings similar to A123, i.e. an alphabetic character, followed by some numeric content. Alphabetic characters are represented using character sets enclosed in square brackets. Assuming the alphabetic character was allowed a range of A through to Z, we could represent this set like this: [A-Z]. Numeric sets work in the same way; the range 0 to 9 is the pattern [0-9]. Thus given the pattern [A-Z][0-9]+, we can match the two strings A123 and C456.
If we augmented the strings to be A123, B234, C345, we could use the pattern [A,C][0-9]+ to match A123 and C345, to give us the same result.
Resources:
http://www.regular-expressions.info/
http://regexlib.com/
Tools:
http://www.regular-expressions.info/regexbuddy.html
http://www.editpadpro.com/
Via Search Engines:
http://search.live.com/results.aspx?q=REGULAR+EXPRESSIONS&src=IE-SearchBox
http://www.google.co.uk/search?hl=en&q=REGULAR+EXPRESSIONS&meta=
Technorati Tags: regex, regular expressions, html scraping, html scraper, scraping, e-mail address extraction