Scrapebox Scraping

How many urls can I scrape from each engine per keyword

1000 urls is the max number of urls you can scrape per keyword per engine.  This limitation is not imposed by scrapebox, but rather by the engines.

Even if you use a browser and browse to page 100 of 10 results per page, engines will not give you a next button to go any further.

How do I adjust the number of urls I scrape from each engine per keyword

In the "Select Engines & Proxies" section there is a box labeled "Results:" this is where you specify how many urls you want returned for each keyword for each engine you choose.

What is the "M" button in the harvester section?

It lets you Merge a .txt file with your keyword list.  So if you wanted to pull a site: a list of urls, you can load the urls in the keyword section and then merge a file that contains site: with it.  It will add whatever is in the .txt file infront of any keywords in the list.

You can include multiple lines in the .txt file and it will take each keyword and add it to each line in the .txt file.

How can I scrape all the urls for a domain?

You can trim the url to root, so you have the domain homepage.  Then you can use the site operator to scrape the addtional urls.

site:http://www.rootdomain.com

Tip:  When using search operators, its important that you understand what operators do and how they work with the different engines.  Some Google searches on search operators will lend a lot of useful information.

How do I scrape all of the backlinks/inlinks to any given url?

You can scrape all the backlinks to any given url (max 1000 limit per the engines api limitation) by using the link: operator.

link:http://www.site.com/fullurl

At this time, using this feature in Yahoo is the same as using Yahoo Site Explorer.

How do I use tokens with the M (merge) option?

Ok let me explain the "M" merge a little, if you have the following keywords in the ScrapeBox keyword box:

black
white

If you hit the "M" to load a footprint file and it contains this:

"powered by wordpress" %KW%
intitle:%KW%
intitle:%KW% inurl:%KW%

The outcome will be:

"powered by wordpress" black
"powered by wordpress" white
intitle:black
intitle:white
intitle:black inurl:black
intitle:white inurl:white

I hope that makes sense, if not just experiment with it.

Can I scrape more then 1 million ulrs at a time?

Yes.  You can virtually scrape and unlimited amount of ulrs.  Urls are stored in 1 million chunks in the following folder:

Scrapebox Folder>> Harvester Sessions >> Harvester_ XXX_XXX

Each session creates a new folder.  You may want to delete these from time to time.

Where are harvested results stored?

Urls are stored in 1 million chunks in the following folder:

Scrapebox Folder>> Harvester Sessions >> Harvester_ XXX_XXX

Each session creates a new folder.  You may want to delete these from time to time.

How can Scrapebox scrape blogsearch.google.com?

For scraping blogsearch.google.com you need to add this to custom googles (drop down arrow next to google in the select search engines in lower left quadrant of main GUI)

google.com+&tbm=blg&

Then make sure its checked off when you hit that same drop down arrow. The reason they haven't specifically included it is that Google includes a lot of non blogs in there. Like scrapebox.com for instances is there, along with lots of forums, and lots of other stuff, so a custom footprint that searches for blogs is going to provide more accurate results compared to the google blog search.

Can Scrapebox scrape from Google Groups?

No,  Google Groups uses a completely different html format and it would require a different parser all together, as well as other changes.

How can I scrape news.google.com?

You can add the following as a custom google and it will scrape news.google.com

google.com+&tbm=nws

Can I match / associate scraped results with the respective keywords they came from?

There are two ways to do this.  The first is to scrape 1 keyword at a time and save off the results.

The second way is to untick the use multi threaded harvester and the use custom harvester (under settings) and then scrape.  You will be using the single threaded harvester.

The single threaded harvester will save all urls in the exact order they are scraped and then when it is done it will give you a count of the number of urls scraped for each keyword.  So you would then export the count and export all the urls that were harvested.

Then you can count down the number harvested for each keyword.

So say you harvest for

car
boat
truck

and you get these results

car -15
boat -30
truck - 400

You could export the scraped urls and then open them in excel and count down.  The first 15 results are from the word car, the next 30 results are from boat and the final 400 results are from the word truck.

How can I change the time span in the Automator or Custom Harvester for Google?

When the Automator runs, it uses whatever Timespan you have set in ScrapeBox. If it's set on show results from the last month, and you load and run a job then that's what it will use. Some people prefer it this way, you don't have to go and make edits to your job files just to adjust the timespan.

If you want to make individual job files to use specific time spans, you could do this with the custom harvester. You can duplicate the existing Google engine and just add the "time based search" parameter to the url and give the engine a new name like "Google 1hr"

Last 1hr &tbs=qdr:h
Last Day: &tbs=qdr:d
Last Week: &tbs=qdr:w
Last Month &tbs=qdr:m
Last year: &tbs=qdr:y

Then when you make a job file, just select whatever engine/timespan you want the job to always use.

What useragent do I choose in the ScrapeBox email grabber?

The Scrapebox email grabber requires a useragent in order to operate.   Some domains, such as facebook for instance, will work with some useragents, but not with others.  So in order to allow the most flexibility, Scrapebox built in the option that allows you to choose a useragent.

As a general rule you can choose any useragent from the list.  However if you find that a particular domain does not work with the useragent your using, try a different one from the list.

How does the number of proxies to retrive before starting work in the custom harvester?

Harvesting urls will start when either the required number of proxies are found, or there are no more proxies to test “and” there is at least 1 working proxy to work with.

So if it’s set to get 10 proxies, and after testing all proxies from all sources and only 5 are found it will still start.

If you have a list of 50 proxies and set it to 10, it will get the first 10 working and start. After X minutes it will load the entire list of proxies, test them until 10 are found then continue and it will repeat this until either there are 0 working proxies or all the keywords have been done.

Why Does Scrapebox 2.0 Not Support The Option To Replace Proxies Every X Mins?

There are over 20 engines built into the Scrapebox 2.0 proxy harvester by default, and you can add as many of your own as you like as well.  Scrapebox automatically replaces proxies when they are all dead and it needs new ones.

Every thread has its own copy of proxies and works with them, it's designed so threads are not accessing a single global list, but using their own copy because proxies not working with engine 1 might work with engines 2, 3, 4.  So each engine gets a list of proxies to work with.

Currently each engine will auto replace the proxies when all the current working proxies die.  If Scrapebox were to replace proxies every X mins, then proxies that do not work with Engine 1 would get replaced, but proxies that are still working with engine 2 would also get replaced and that makes no sense.

This would slow things down overall, because in order to replace the proxies it would mean suspending every thread, loading the proxies, then creating copies for each thread.  It's also tricky , because you can't just suspend all threads, you have to wait until they are all in some sort of idle state, else they will crash.

Replacing proxies every 5 minutes for example, is a little crazy when all threads have to be terminated (can take some time if running 200 threads across all engines), all the proxies replaced and started again. Also there's no guarantee the proxies tossed out all because of a timer were not working better for a number of engines then the ones loaded.  So it makes no sense to replace with a timer, each engine keeps track of its own proxies and replaces them automatically when needed.

So in short, forcing Scrapebox to replace proxies across all engines automatically every X mins would slow Scrapebox down, make it more prone to crashing, and potentially reduce the overall success rate.  Plus you would potentially be tossing out a lot of good proxies for many engines, and replacing them with potentially worse proxies.

Its far better to let Scrapebox manage all the proxies independently for each engine and just replace the proxies for each engine when they are dead, and continue to use proxies for other engines long as they are working.  Basically Scrapebox is optimized to produce the highest success rates as possible already so there is no need to try and force it to replace proxies based on a timer, as thats not as efficient.

 

What do the colors mean in the keyword scraper in Scrapebox 2.0?

Green=completed
Red=error
Blue=aborted
Magenta=timeout
Black=still working (or trying)

When its red, then there should also be an entry in the keywordscraper.log file in the errorlogs folder

 

How do I set the harvester to not harvest "suggested results" in Google?

You can do this by using the "Verbatim" link on the side of the search results. Which could be done by adding this to the query string:

&tbs=li:1

Go to settings >> harvester engine configuration - click on google, paste the

&tbs=li:1

on the end of the query string and then click update engine.  Then when you choose google it will not harvest suggested results.  Alternatively you could give this a different display name, and then save it as a new engine as well.

For Example:

Verbatim: http://www.google.com/search?q=site:http://newyork.craigslist.org+%22ebay+seo%22&complete=0&hl=en&gbv=2&prmd=ivns&source=lnt&tbs=li:1&sa=X

Regular: http://www.google.com/search?q=site:http://newyork.craigslist.org+%22ebay+seo%22&complete=0&hl=en&gbv=2&prmd=ivns&tbas=0&source=lnt&sa=X

How many private proxies do I need to scrape google and how many connections should I use?

As a general rule for using private proxies with google at the moment its recommended to use 1 connection per every 20-30+ proxies for basic keywrod scraping.  For advanced operator scraping its recommended to use 1 connection for every 30-50+ proxies.  If you don't have that many proxies then you can use the detailed harvester and add a delay.

The concept here is that the ips are used slow enough that it doens't trigger a ban.  So if you don't have enough proxies to do that, then you can use the detailed harvester and use a delay.

So if you are searching for words like "cloud" and yo uahve 60 proxies, then you could try using 2 connections.

If you were searching for something like "inurl:contact cloud" and you had 100 proxies you could try using 2 connections, but if you only had say 10 proxie syou might need a 10-20 second delay.

However there is no hard fast rule as there are many factors that cant' be predicted/known.  Such things as the more often a proxy is banned the quicker it is banned in the future, and the longer it is banend for.  So if the person that had the proxies before you got them banned a lot then you will have to use more proxies or add a delay.  There is no way to know this for certain, google gives no codes that indicate anything, its merely a matter of trial and error.  So its generally a good diea to pick a safe range, such as is listed above, and stay with it.  If you find its not enough and your proxies still get baned then you just need to increase the delay in teh detailed harvester or use less connections.

Note: The detailed harvester is single threaded, and the custom harvester has no delay.  So if you need more then 1 connection use the custom harvester, if you need a delay then you really shouldn't be using more then 1 connection anyway, so you can use the detailed harvester.