Extract email addresses from tags

Ran into another cool hurdle today for my Fwd:Vault development. When I grab the message content to archive it in the system, first thing I do is scrub it out to ensure that (a) it displays properly, and (b) there are no misbehaving characters. I grab both plain text and HTML email formats (if present), so the scrubbing process is a little different in each case. For the plain text, I take some extra steps to ensure there is no HTML whatsoever. Naturally, at one point this involves a call to PHP’s ultra-useful <a href="http://www.php.net/strip_tags">strip_tags()</a> function. However, in the course of testing today, I realized that when a message is forwarded, sometimes the forward header will encode the email address, which gets stripped when I process the message. Allow me to demonstrate. Here’s the body an example message that someone might send to Fwd:Vault for safe keeping… “` ———- Forwarded message ———-

From: "Office Flirt” flirt@example.com

Date: Wed, Jan 14, 2009 at 10:14 AM

Subject: Delete those images

To: you@example.com My boss is sniffing around. I want you to delete those pictures I sent you right away. Signed,

Office Flirt “`

Obviously you’re tucking this one away in Fwd:Vault to provide a little CYA-insurance when the boss calls you into his office. Good call. Now, before today, this message would come out of the scrubbing process looking like this: ”` ———- Forwarded message ———-

From: Office Flirt

Date: Wed, Jan 14, 2009 at 10:14 AM

Subject:

To: you@example.com

… “`

Look at the bolded red line. The email address is gone. You don’t have any other copies of it, so your boss doesn’t believe your story, and you get the blame. You’re forced to attend one of those god-awful sexual harassment classes. Fail. So, what happened? Remember, you are looking at the body of a message in plain text. That "Forwarded message” block at the beginning is just part of the body text. So when the text was scrubbed by strip_tags(), the function picked it up as just another tag, which it dutifully removed. To handle this situation, I came up with a piece of code that will look for email addresses in “tagged format” — i.e. surrounded by < and > — and remove the surrounding symbols, leaving us with harmless text.lang="php">$test = 'some surrounding text';

$test = preg_replace( "/(<)(.+@[^();:,<>]+.[a-zA-Z]{2,4})(>)/",

                  ' $2 ',

                  $test);

$test = preg_replace(’/[\s]+/’, “, $test);

echo $test; ”`

Let’s break this down. First, we have a regular expression that identifies email addresses: .+@[^\(\);:,<>]+\.[a-zA-Z]{2,4}. This is the same expression set in the example on the Quanetic Software Regular Expression Tester (an excellent tool). We surround that in parentheses to isolate it as a subpattern. Then on either end of the expression, we tack on more regex voodoo to look for tag syntax: (\<) and (\>). These also get parentheses to identify them as subpatterns. Once its finished, we have an expression that will only match addresses wrapped in tagging structure. The second argument in preg_replace() is the replacement, or what we should replace any matches with. In this case, we’ve isolated the address from the tags using subpatterns. So all we need to do is make a single call to the proper reference, which is $2, because its the second set of parentheses in the expression. Confused? You can learn about subpatterns on the PHP manual page for preg_replace(). Note the spaces around the $2 in the second argument. Sometimes the address will not have any spaces between the person’s name and the actual address. This could lead to the address being combined with the name which, in the case of Fwd:Vault, would screw up our search indexing. So we add spaces during the replace, then make a second call to preg_replace() to eliminate extra spaces: $test = preg_replace('/[\s]+/', '', $test);. Legal Disclaimer: In case you do end up using Fwd:Vault when it launches, I’m fairly certain the service wouldn’t be liable in this silly hypothetical. Just make sure you read the terms before you sign up if you play the field at your office. Sorry to everyone going “duh” right now; it’s a sue-happy world. Update: When I went to implement this change today, I discovered that the code was catching newlines (\n or \r) in the crossfire. It was actually due to the second call to preg_replace(), the “\s” character class includes not only spaces but line terminators as well. Oops. The revised version looks like this:lang="php"> $body_text = preg_replace('/[ ]{2,}/', ' ', $body_text);