Scrapebox Everything Else

If I reformat my hard drive or get a new computer what do I do?

Simply re-download Scrapebox and reinstall scrapebox.  When you load the program simply click the Activate button, just like you did the first time.  Then fill out the info that is required and check the transfer license button, and hit submit.

(You can re-download scrapebox here: http://www.scrapebox.com/payment-received)

How do I transfer my scrapebox liscense to a new PC?

You are permitted to transfer your ScrapeBox license to another PC once per month for free in case you get a new PC, re-install Windows etc.  For complete instructions on how to do this go here: http://www.scrapebox.com/scrapebox-license-transfer

Pinging your links to get them indexed

If you want to Ping your links to get them indexed you need to use the RSS ping function, which is labeled simply RSS in the commenter section of scrapebox. The option labeled PING is for inflating page views and won't get your urls indexed.

RSS Ping is an XML-RPC spec http://www.xmlrpc.com

So the way to do it is import the file that contains the urls you want to get indexed,  into the harvester grid, go to Export URL List >> Export as RSS XML List. Then scan the URL’s which fetches the link Title and Descriptions, set how many entries in each feed and export. It saves as an .xml file(s) which then needs to be uploaded to your domain and will look like: http://www.scrapeboxfaq.com/feed.xml

Then select RSS in the commenter section. Load the RSS Services and feed URL’s to ping them.  There are default RSS services that come with scrapebox or you can use your own.  The feed urls are the ones you uploaded to your domain, like above.

Scrape emails from Craigslist

You can grab emails with the email grabber in the harvested urls section. It will let you harvest emails from a url or a local file.

Say you wanted to harvest emails from the Jobs category on Craigslist.

In a regular web browser open up Craigslist. Find the category you want to harvest from, in the case of the jobs category, most major cities it looks like this:

http://losangeles.craigslist.org/jjj/

I got this by selecting the city I wanted, and then clicking the "jobs" link at the top of the category.

Then you would copy down that url, which is what is above.  Note: make sure that if it gives you a spam warning you follow thru to get the actual url of the page that lists the ads.

If you like you can also copy down the urls of the "Next 100 results".

Then save off all of the urls from the categories you want.

Then import them into the Link Extractor addon.

Choose Internal only.

Then let it harvest all the urls from those pages.  This will give you all the current craigslist ads for each category from all the pages you choose.

Then export the results to a txt file.

Then import that txt file into the urls harvester section.

Then use the email grabber to get the emails from those urls.  Thus you have scraped all the emails from Craigslist for the current ads from the categories you have chosen.

The best part is the category urls are static, but the urls that you harvest from them change daily, so you can repeat this process over and over.

What does the delay apply to?

The delay option in the comment poster section lets you set a delay in seconds. If the RND is chosen that it pulls a random value from the adjust RND delay range under settings.

The delay only works with:
Single Threaded Harvester
Ping Mode
Page Rank Checker
Email Grabber

How do I filter adult urls or content in scrapebox?

Load the list of urls you want to filter in the Urls harvested section.  Then go to remove/filter >> remove urls containing entries from.

Then it will ask you to load a text file.  Take the below terms and any other you want to add and put them in a text file and save them.  Then load the list when prompted by scrapebox, after clicking remove urls containing entries from.

Of course some of these terms could filter out good urls as well, such as "bride" for instance, but this is the best alternative for filtering adult urls.

List:

sex
porn
pron
szex
xxx
x-live
x-video
xvideo
hentai
erotic
chick
tit
boob
slut
anal
poker
babe
blonde
brunette
russian
bride
fuk
redhead
penis
dick
blowjob
oral
gay
lesbian
pussy
vagina
gangbang
bondage
adult
teen
girl
woman
dirty
fuck
ass
bitch
shit
butt

Will Scrapebox work with TOR?

While I have not personally attempted to torrify scrapebox to see if it will work, I have looked at it and I am fairly confident it won't work.  Even if it did work, due to many limitations of TOR, when considering how scrapebox works, it wouldn't work well.  Simply scraping public proxies would work better.

Does scrapebox work on WINE?

No Scrapebox will Not work on WINE.  WINE lacks many of the needed APIs and other elements that scrapebox needs to work.

Why does the remove subdomains from URLs function not work for some domains?

With the new ability to register generic tlds it has gotten a bit confusing.  So Scrapebox uses a database to remove subdomains from urls.

However sometimes you will have a list and most of it seems to work, but you will be left with stuff like

something.blogspot.com

and wonder why that subdomain wasn't removed.  The answer is that is not a subdomain thats a domain because blogspot.com is an actual tld.  So it would seem.com is the tld, but its not its blogspot.com So

car.something.blogspot.com

is a subdomain but

something.blogspot.com is a regular domain just like car.com is a regular domain.  You can view the complete list here:

https://publicsuffix.org/list/effective_tld_names.dat

How to build the regex for phone number scraping

Your looking to before any characters like + and ( or ) but not dashes.

Add a real space when there is a space and then do a

d{x}

Where X is the quantity of digits in that set.  So you can see my example and then the 3 examples built into scrapebox and make up your masks for your other formats.

 

Ok, lets take one of the included USA examples

1-555-555-5555

With a regex of

d{1}-d{3}-d{3}-d{4}

 

Now let me break that down

First you backslash out the digits.

Then the d represents the number of digits, hence the d.

Then the number n brackets is the number of digits.

So the first section is

d{1}

So backslash then d for digits then {1) because there is 1 digit in the first part of the phone number.  So it matches on the

1

That you see in

1-555-555-5555

 

Then dashes are just there.  So the dash after the 1 is

-

So that so far gives us

d{1}-

Then for the new group of numbers we backslash it out, then d for digits then {3} because there are 3 digits in the next part.

So

d{3}

That matches on the

555

Which thus far gives us

d{1}-d{3}

Which matches

1-555

So hopefully that makes sense.  Then we continue on for the rest, and sense there is then 1 more set of 3 digits that next part of the regex is the same

d{1}-d{3}-d{3}-

Then that last segment has 4 digits so it’s a 4 in there instead of a 3

d{1}-d{3}-d{3}-d{4}

So that matches

 

1-555-555-5555

 

~~~~~~~~~~~~~~~

 

Lets take another format

+444 333-222 111

The regex would be

+d{3} d{3}-d{3} d{3}

 

So you put in the plus, then backslash then d and then the {3} which matches the 444 in that number and then a space.

Then follow thru.

 

~~~~~~~~~~~~~~~

 

Another Example

 

+43(0)7243-50724

+d{2}(d{1})d{4}-d{5}

 

~~~~~~~~~~~~~~~

 

Notes: Dashes and spaces  are just there, so you can just put them in, they don't need to be backslashed out.

Plus signs and parentheses DO need to be backslashed out.

 

Any regex should have the leading ^ and ending $ removed. This means match the start and end of the line, however when scraping stuff from HTML the data isn’t going to sit perfectly on the start and end of a line of source code there’s going to be other HTML and content before and after the data being scraped.

The regex here should all work http://www.regexlib.com/Search.aspx?k=phone%20number but if it starts with ^ and ends with $ they simply need to be removed.