How to build the regex for phone number scraping

How to build the regex for phone number scraping

Your looking to before any characters like + and ( or ) but not dashes.

Add a real space when there is a space and then do a

d{x}

Where X is the quantity of digits in that set.  So you can see my example and then the 3 examples built into scrapebox and make up your masks for your other formats.

 

Ok, lets take one of the included USA examples

1-555-555-5555

With a regex of

d{1}-d{3}-d{3}-d{4}

 

Now let me break that down

First you backslash out the digits.

Then the d represents the number of digits, hence the d.

Then the number n brackets is the number of digits.

So the first section is

d{1}

So backslash then d for digits then {1) because there is 1 digit in the first part of the phone number.  So it matches on the

1

That you see in

1-555-555-5555

 

Then dashes are just there.  So the dash after the 1 is

-

So that so far gives us

d{1}-

Then for the new group of numbers we backslash it out, then d for digits then {3} because there are 3 digits in the next part.

So

d{3}

That matches on the

555

Which thus far gives us

d{1}-d{3}

Which matches

1-555

So hopefully that makes sense.  Then we continue on for the rest, and sense there is then 1 more set of 3 digits that next part of the regex is the same

d{1}-d{3}-d{3}-

Then that last segment has 4 digits so it’s a 4 in there instead of a 3

d{1}-d{3}-d{3}-d{4}

So that matches

 

1-555-555-5555

 

~~~~~~~~~~~~~~~

 

Lets take another format

+444 333-222 111

The regex would be

+d{3} d{3}-d{3} d{3}

 

So you put in the plus, then backslash then d and then the {3} which matches the 444 in that number and then a space.

Then follow thru.

 

~~~~~~~~~~~~~~~

 

Another Example

 

+43(0)7243-50724

+d{2}(d{1})d{4}-d{5}

 

~~~~~~~~~~~~~~~

 

Notes: Dashes and spaces  are just there, so you can just put them in, they don't need to be backslashed out.

Plus signs and parentheses DO need to be backslashed out.

 

Any regex should have the leading ^ and ending $ removed. This means match the start and end of the line, however when scraping stuff from HTML the data isn’t going to sit perfectly on the start and end of a line of source code there’s going to be other HTML and content before and after the data being scraped.

The regex here should all work http://www.regexlib.com/Search.aspx?k=phone%20number but if it starts with ^ and ends with $ they simply need to be removed.

This entry was posted in Uncategorized. Bookmark the permalink.