Use the Spam Word Filter to add Regular Expressions

This tutorial was provided by Chactory. The original version can be found on his website.

Spamihilator's Spam Word Filter supports Regular Expressions (sometimes called RegExes). This tutorial explains how to use Regular Expressions to identify spam words in spam mails. Comprehensive information can be found on the following websites: Regenechsen (german) or Regular Expressions.

The Spam Word Filter

Most Spam Word Filters use a list of words which typically occur in spam mails. If the filter finds such a word in a mail's subject or body it marks it as spam. But Spamihilator's Spam Word Filter is even more complex.

Each spam word has a certain probability between 0% and 100%. This makes sure that if the filter only finds one rather regular word like benefit (40%) the mail won't be blocked immediately. Only if there are three or more spam words and the sum of their probabilities is higher than 100%, the mail will be considered spam.

Spamihilator's Spam Word Filter is a substring filter. That means it also looks for long strings which contain a spam word as a sub-string, but this advantage has to be used with care. Look at the following examples: The string sex could filter the words sex, sexy, sexual and others, but it could also filter regular words like sexagenarian. The spam word anal could filter analyse or analgesic. Spam words can be disabled by resetting their probability to 0%. The following text shows how to redefine them as regular expressions.

Definition

Regular Expressions are used to find strings in long texts. RegExes consist of characters, character classes/groups, wildcards and modifiers/meta-characters.

Rules

How to find one character: The character a can be used to find the first occurance of a, for example in hat.

How to find characters with character classes/groups: [Vv] can be used to find the first occurance of V or v. The expression analy[sz]e can be used to find analyse or analyze.

How to use wildcards: \s matches whitespace characters, \b matches boundaries of words. The dot (.) matches every character except line break.

How to use modifiers: you may add a modifier as a suffix to one of the expressions explained above. ? matches the character 0-1 times. + matches it at least once. * matches 0 times or infinitely. For example: hat(ch)? matches hat or hatch.

Greediness of the asterix: The modifier * can be very greedy. A regular expression with an asterix does not only match the first possible string, but the longest first one possible! This can lead to false positives. A question mark after the asterix decreases its greediness.

Meta-characters are \, [, ^, $, ., |, ?, *, +, ( and ). \ (Backslash) disables a meta-character, for example: 1\+1 matches 1+1.

Overview

a matches a
. matches all characters except line break (newline)
\d digits
\D non-digits
\w alpha-numeric characters and underscore
\W non-\w-characters
\s whitespaces (tab, space, etc.)
\S non-whitespaces
\b boundary of a word
\B non-boundary of a word
\A beginning of a string
\z actual end of a string
\Z end of a string, or the last character before a line break
^ beginning of a line
$ line end
| and-or
? expression right before the question mark is optional (matches 0 or 1 time)
+ matches 1 or more times
* matches 0 or more times
*? question mark decreases greediness of the asterix
\ disables meta characters ?, [, {, (, ), +, *, ., ^, $, | and \
[] character class, contains several charachters or an expression
() expression group
{x,y} x up to y times. y is optional: {x} means x times, {x,} x or more times
(?i) in front of an expression matches string case-insensitively

Examples

How to find viagra, viaaggggra, via gra, via.gra, \/|AGRA, \/I/\GR/\, v1aqra, vi@gr@ and variants:

The Regular Expression [Vv].?[Ii].?[Aa].?[Gg].?[Rr].?[Aa] from Spamihilator's default list of spam words matches viagra, viaggra, via gra and via.gra.

The expression v[\W_]{0,2}[i1][\W_]{0,2}[a@][\W_]{0,2}g[\W_]{0,2}r[\W_]{0,2}[a@] from the website of another anti-spam software matches viagra, via gra and via.gra, but not viaggra.

The RegEx (?i)[v\\]+/?.?[i:1!\|]+.?[a@/]+\\?.?[gq]+.?r+.?[a@/]+\\? matches all variants from above.

The more comprehensive the regular expression, the higher the probability to find a “good” string which could lead to the possiblity that a non-spam mail is classified as spam.

Additional examples:

ambien (?i)[a@/]+\\?.?m+.?[b8]+.?[i:1!\|]+.?[e3€]+.?n+\b
anal (?i)\banal\b
bulgary (?i)\b[b8].?[uv].?[li17\|].?[gq].?[a@/][\\]?r.?[yi1!:\|]?
cheapest pills (?i)cheapest\spills
cialis (?i)\bc.?[i1!:\|].?[a@/][\\]?.?[li17\|].?[i1!:\|].?[s235$]
credit (?i)\bc.?r.?[e3€].?d.?[i1!:\|].?[t\+]
discount (?i)\bd.?[i1!:\|].?[s235$].?c.?[oQ0].?[uv].?n.?[t\+]?
ejaculation (?i)ejaculation
enlargement (?i)enlargement
levitra \b[LlIi17\|].?[Ee3].?[Vv\\][/]?.?[Ii1!:\|].?[Tt\+].?[Rr].?[Aa@/]?[\\]? (v nicht optional!)
money (?i)\bm.?[oQ0].?n.?[e3€].?y
mortgage \b[Mm].?[OoQ0].?[Rr].?[Tt\+].?[Ggq].?[Aa@/][\\]?.?[Ggq].?[Ee3€]?
natural weight loss (?i)(natural\s)?weight(\s)?loss
omega \b[OoQ0].?[Mm].?[Ee3€].?[Ggq].?[Aa@/][\\]?
online pharmacy (?i)online\spharmacy
penis (?i)\bp+.?[e3€]+.?n+.?[i1!:\|]+.?[s5$]+\b
pharmaceuticals (?i)pharmaceuticals
porno (?i)\bp.?[oQ0].?r.?n.?[oQ0]?
premature (?i)premature
prescription (?i)prescription
refinance \b[Rr].?[Ee3€].?[Ff].?[Ii1!:\|].?[Nn].?[Aa@/][\\]?.?[Nn].?[CcZz].?[Ee3€]?
rolex \b[Rr].?[OoQ0].?[LlIi17\|].?[Ee3€].?[Xx]
sex \b[Ss25$].?[Ee3€].?[Xx].?[Yy]?\b
soma [Ss25$].?[OoQ0].?[Mm].?[Aa]?\b oder \b[Ss25$].?[OoQ0].?[Mm].?[Aa/][\\]?\b
src=3D"cid: src=3D"cid:
src="cid: src="cid:
valium (?i)[v\\]+/?.?[a@/]+\\?.?[li17\|]+.?[il1!:\|]+.?[uv]+.?m+\b
xanax \b[Xx].?[Aa@/][\\]?.?[Nn].?[Aa@/][\\]?.?[Xx]?
X-Spam-Level: ******** X-Spam-Level:\s\*{8,30}
**** SPAM **** (?i)\*{1,6}spam\*{1,6}

Tools

If you want to create your own Regular Expressions, you may use the program The Regex Coach by Edi Weitz. If can be used to test regular expressions, even step by step. Regex Coach is freeware.

08/07/07, Chactory
English translation by Michel Krämer

 
en/tutorials/regex.txt · Last modified: 2008/03/22 16:06 by michel