Scrapebox Scraping

How many urls can I scrape from each engine per keyword

Due to the API limits set by the engines 1000 urls is the most you scrape per engine per keyword.

How do I adjust the number of urls I scrape from each engine per keyword

In the "Select Engines & Proxies" section there is a box labeled "Results:" this is where you specify how many urls you want returned for each keyword for each engine you choose.

What is the "M" button in the harvester section?

It lets you Merge a .txt file with your keyword list.  So if you wanted to pull a site: a list of urls, you can load the urls in the keyword section and then merge a file that contains site: with it.  It will add whatever is in the .txt file infront of any keywords in the list.

You can include multiple lines in the .txt file and it will take each keyword and add it to each line in the .txt file.

How can I scrape all the urls for a domain?

You can trim the url to root, so you have the domain homepage.  Then you can use the site operator to scrape the addtional urls.

site:http://www.rootdomain.com

Tip:  When using search operators, its important that you understand what operators do and how they work with the different engines.  Some Google searches on search operators will lend a lot of useful information.

How do I scrape all of the backlinks/inlinks to any given url?

You can scrape all the backlinks to any given url (max 1000 limit per the engines api limitation) by using the link: operator.

link:http://www.site.com/fullurl

At this time, using this feature in Yahoo is the same as using Yahoo Site Explorer.

How do I use tokens with the M (merge) option?

Ok let me explain the "M" merge a little, if you have the following keywords in the ScrapeBox keyword box:

black
white

If you hit the "M" to load a footprint file and it contains this:

"powered by wordpress" %KW%
intitle:%KW%
intitle:%KW% inurl:%KW%

The outcome will be:

"powered by wordpress" black
"powered by wordpress" white
intitle:black
intitle:white
intitle:black inurl:black
intitle:white inurl:white

I hope that makes sense, if not just experiment with it.

Can I scrape more then 1 million ulrs at a time?

Yes.  You can virtually scrape and unlimited amount of ulrs.  Urls are stored in 1 million chunks in the following folder:

Scrapebox Folder>> Harvester Sessions >> Harvester_ XXX_XXX

Each session creates a new folder.  You may want to delete these from time to time.

Where are harvested results stored?

Urls are stored in 1 million chunks in the following folder:

Scrapebox Folder>> Harvester Sessions >> Harvester_ XXX_XXX

Each session creates a new folder.  You may want to delete these from time to time.

How can Scrapebox scrape blogsearch.google.com?

For scraping blogsearch.google.com you need to add this to custom googles (drop down arrow next to google in the select search engines in lower left quadrant of main GUI)

google.com+&tbm=blg&

Then make sure its checked off when you hit that same drop down arrow. The reason they haven't specifically included it is that Google includes a lot of non blogs in there. Like scrapebox.com for instances is there, along with lots of forums, and lots of other stuff, so a custom footprint that searches for blogs is going to provide more accurate results compared to the google blog search.

Can Scrapebox scrape from Google Groups?

No,  Google Groups uses a completely different html format and it would require a different parser all together, as well as other changes.

How can I scrape news.google.com?

You can add the following as a custom google and it will scrape news.google.com

google.com+&tbm=nws

Comments are closed.