How many private proxies do I need to scrape google and how many connections should I use?

How many private proxies do I need to scrape google and how many connections should I use?

As a general rule for using private proxies with google at the moment its recommended to use 1 connection per every 20-30+ proxies for basic keywrod scraping.  For advanced operator scraping its recommended to use 1 connection for every 30-50+ proxies.  If you don't have that many proxies then you can use the detailed harvester and add a delay.

The concept here is that the ips are used slow enough that it doens't trigger a ban.  So if you don't have enough proxies to do that, then you can use the detailed harvester and use a delay.

So if you are searching for words like "cloud" and yo uahve 60 proxies, then you could try using 2 connections.

If you were searching for something like "inurl:contact cloud" and you had 100 proxies you could try using 2 connections, but if you only had say 10 proxie syou might need a 10-20 second delay.

However there is no hard fast rule as there are many factors that cant' be predicted/known.  Such things as the more often a proxy is banned the quicker it is banned in the future, and the longer it is banend for.  So if the person that had the proxies before you got them banned a lot then you will have to use more proxies or add a delay.  There is no way to know this for certain, google gives no codes that indicate anything, its merely a matter of trial and error.  So its generally a good diea to pick a safe range, such as is listed above, and stay with it.  If you find its not enough and your proxies still get baned then you just need to increase the delay in teh detailed harvester or use less connections.

Note: The detailed harvester is single threaded, and the custom harvester has no delay.  So if you need more then 1 connection use the custom harvester, if you need a delay then you really shouldn't be using more then 1 connection anyway, so you can use the detailed harvester.

This entry was posted in Uncategorized. Bookmark the permalink.