Scraping Data from Google Search Using Python and Scrapy

December 22, 2021 Steve

Scraping Google SERPs (search engine finish end result pages) is as easy or as refined as a result of the devices we use. For this tutorial, we’ll be using Scrapy, an internet scraping framework designed for Python. Python and Scrapy combine to create a robust duo that we’ll use to scrape practically any website online.

Scrapy has many beneficial built-in choices which will make scraping Google a stroll inside the park with out compromising any data we want to scrape.

For occasion, with Scrapy all it takes is a single command to format our data as CSV or JSON recordsdata – a course of we should code ourselves in another case.

Before leaping into the code itself, let’s first uncover quite a few causes a Google scraper may be useful.

Why Scrape Google?

There’s no dispute, Google is the king of serps like google.That means there’s a complete lot of knowledge on the market in its search outcomes for a savvy scraper to reap the advantages of.

Here are few functions for a Google scraper:

Collecting Customer Feedback Data to Inform Your Marketing

In the fashionable shopping for experience, it’s normal for customers to seek for product evaluations sooner than deciding on a purchase order order.

With this in ideas, a robust utility for a Google SERPs scraper is to collect evaluations and purchaser ideas from your competitor’s merchandise to know what’s working and what’s not working for them.

It may be to boost your product, uncover an answer to distinguish your self from the rivals, or to know which choices or experiences to highlight in your promoting.

Keep this in ideas on account of we’ll be establishing our scraper spherical this problem exactly.

Inform Your search engine optimisation and PPC Strategy

According to Oberlo, “Google has 92.18 % of the market share as of July 2019” and it “has been visited 62.19 billion situations this yr”. With that many eyes on the SERPs, getting your company to the very best of these pages for associated key phrases means some big money.

Web scraping is primarily an info-gathering gadget. We can use it to know our positions in Google greater and benchmark ourselves to the rivals.

If we check out our positions and look at ourselves to the very best pages, we are going to generate a way to outrank them.

The related goes for PPC campaigns. Because adverts appear on the excessive of every SERP – and usually on the bottom – we inform our scraper to hold the title, description, and hyperlink to all the adverts displaying on the excessive of the search outcomes for our centered key phrases.

This evaluation will help us uncover un-targeted key phrases, understand our competitor’s strategies and take into account the copy of their adverts to differentiate ours.

Generate Content Ideas

Google moreover has many further choices of their SERPs like related searches, “people moreover ask” bins and further. Scraping tons of of key phrases helps you to acquire all this information in a couple of hours and prepare it in an easy-to-analyze database.

These are merely use circumstances. Depending on the form of data and the final phrase goal you will have, you must make the most of a Google scraper for lots of fully totally different causes.

How to Build a Google Web Scraper Without Getting Blocked

As we acknowledged earlier, for this occasion we’ll assemble our Google internet scraper to collect rivals’ evaluations. So, let’s take into consideration we’re a model new startup establishing a problem administration software program program, and we have to understand the state of the enterprise.

Let’s start from there:

1. Choose Your Target Keywords

Now that everyone knows our main goal, it’s time to decide on the important thing phrases we have to scrape to help it.

To select your objective key phrases, take into account the phrases customers may be searching to hunt out your offering, and decide your rivals. In this occasion, we’ll objective 4 key phrases:

“asana evaluations”
“clickup evaluations”
“best problem administration software program program”
“best problem administration software program program for small teams”

We may add many further key phrases to this itemizing, nonetheless for this scraper tutorial they’ll be higher than adequate.

Also, uncover that the first two queries are related to direct rivals, whereas the ultimate two will help us decide totally different rivals and get an preliminary data of the state of the enterprise.

2. Setup Your Development Environment

The subsequent step is to get our machine capable of develop our Google scraper. For this, we’ll need quite a few points:

Python mannequin 3 or later
Pip – to place in Scrapy and totally different packages, we might need
ScraperAPI

Your machine may have a pre-installed Python mannequin. Enter python -v into your command speedy to see if that’s the case.

If you may need to arrange each half from scratch, adjust to our Python and Scrapy scraping tutorial. We’ll be using the equivalent setup, so get that achieved and come once more.

Note: one factor to remember is that the workforce behind Scrapy recommends placing in Scrapy in a digital setting (VE) instead of worldwide in your PC or laptop computer pc. If you’re unfamiliar, the above Python and Scrapy tutorial reveals you ways one can create the VE and arrange all dependencies.

In this tutorial we’re moreover going to be using ScraperAPI to maintain away from any IP bans or repercussions. Google doesn’t really need us to scrape their SERPs – significantly with out price. As such, they’ve utilized superior anti-scraping strategies that’ll shortly decide any bots attempting to extract data routinely.

To get spherical this, ScraperAPI is a flowery system that makes use of third-party proxies, machine learning, huge browser farms, and years of statistical data to make it possible for our scraper obtained’t get blocked from any web site by rotating our IP cope with for every request, setting wait situations between requests and coping with CAPTCHAs.

In totally different phrases, by merely together with quite a few strains of code, ScraperAPI will supercharge our scraper, saving us problems and hours of labor.

All we might like for this tutorial is to get our API Key from ScraperAPI. To get it, merely create a free ScraperAPI account to redeem 5000 free API requests.

3. Create Your Project’s Folder

After placing in Scrapy in your VE, enter this snippet into your terminal to create the obligatory folders:

scrapy startproject google_scraper

cd google_scraper

scrapy genspider google api.scraperapi.com

Scrapy will first create a model new problem folder often known as “google-scraper,” which moreover happens to be the problem’s title. Next, go into this folder and run the “genspider” command to create an internet scraper named “google”.

We now have many configuration recordsdata, a “spiders” folder containing our scraper, and a Python modules folder containing bundle recordsdata.

4. Import All Necessary Dependencies to Your google.py File

The subsequent step is to assemble quite a few components which will make our script as surroundings pleasant as attainable. To obtain this, we’ll should make our dependencies on the market to our scraper by together with them on the excessive of our file:

import scrapy

from urllib.parse import urlencode

from urllib.parse import urlparse

import json

from datetime import datetime

API_KEY = 'YOUR_API_KEY'

With these dependencies in place, we are going to use them to assemble requests and cope with JSON recordsdata. This ultimate component is critical on account of we’ll be using ScraperAPI’s autoparse efficiency.

After sending the HTTP request, it will probably return the data in JSON format, simplifying the tactic and making it so that we don’t need to put in writing and protect our private parser.

5. Construct the Google Search Query

Google employs a standard and query-able URL building. You merely should know the URL parameters for the data you need and you probably can generate a URL to query Google with.

That talked about, the subsequent makes up the URL building for all Google search queries:

http://www.google.com/search

There are quite a few customary parameters that make up Google search queries:

q represents the search key phrase parameter. http://www.google.com/search?q=tshirt, for example, will seek for outcomes containing the important thing phrase “tshirt.”

The offset stage is specified by the start parameter. http://www.google.com/search?q=tshirt&start=100 is an occasion.

hl is the language parameter. http://www.google.com/search?q=tshirt&hl=en is an efficient occasion.

The as_sitesearch argument helps you to search for an internet site (or website online). http://www.google.com/search?q=tshirt&as sitesearch=amazon.com is one occasion.

The number of outcomes per internet web page (most is 100) is specified by the num parameter. http://www.google.com/search?q=tshirt&num=50 is an occasion.

The protected parameter generates solely “protected” outcomes. http://www.google.com/search?q=tshirt&protected=energetic is an efficient occasion.

Note: Moz’s full itemizing of google search parameters is extraordinarily useful in establishing a query-able URL. Bookmark it for further sophisticated scraping duties in the end.

Alright, let’s define a way to assemble our Google URL using this information:

def create_google_url(query, web site=''):

google_dict = {'q': query, 'num': 100, }

if web site:

internet = urlparse(web site).netloc

google_dict['as_sitesearch'] = internet

return 'http://www.google.com/search?' + urlencode(google_dict)

In our methodology we’re setting ‘q’ as query on account of we’ll specify our exact key phrases later inside the script to make it easier to make changes to our scraper.

6. Define the ScraperAPI Method

To use ScraperAPI, all we’ve got to do is to ship our request through ScraperAPI’s server by appending our query URL to the proxy URL provided by ScraperAPI using payload and urlencode. The code seems to be like like this:

def get_url(url):

payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}

proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)

return proxy_url

Now that we have got outlined the logic our scraper will use to assemble our objective URLs, it’s time to assemble the precept spider.

7. Write the Spider Class

In Scrapy we are going to create fully totally different programs, often known as spiders, to scrape explicit pages or groups of internet sites. Thanks to this function, we are going to assemble fully totally different spiders contained in the equivalent problem, making it quite a bit easier to scale and protect.

class GoogleSpider(scrapy.Spider):

title = 'google'

allowed_domains = ['api.scraperapi.com']

custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',

'CONCURRENT_REQUESTS_PER_DOMAIN': 10,

'RETRY_TIMES': 5}

We need to current our spider a fame, as that’s how Scrapy will resolve which script you have to run. The title you choose have to be explicit to what you’re attempting to scrape, as duties with quite a few spiders can get sophisticated within the occasion that they aren’t clearly named.

Because our URLs will start with ScraperAPI’s space, we’ll moreover need in order so as to add “api.scraper.com” to allowed_domains. ScraperAPI will change the IP cope with and headers between every retry sooner than returning a failed message (which doesn’t rely in the direction of our full on the market API calls).

We moreover want to tell our scraper to ignore the directive inside the robots.txt file. This is on account of by default Scrapy obtained’t scrape any web site which has a contradictory directive inside talked about file.

Finally, we’ve set quite a few constraints so that we don’t exceed the boundaries of our free ScraperAPI account. As you probably can see inside the custom_settings code above, we’re telling ScraperAPI to ship 10 concurrent requests and to retry 5 situations after any failed response.

8. Send the Initial Request

It’s lastly time to ship our HTTP request. It could also be very straightforward to do this with the start_requests(self) methodology:

def start_requests(self):

queries = ['asana+reviews', 'clickup+reviews', 'best+project+management+software', 'best+project+management+software+for+small+teams']

url = create_google_url(query)

yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

It will loop through a list of queries that may be handed to the create_google_url function as query URL key phrases.

The query URL we created will then be despatched to Google Search by the use of the proxy connection we prepare inside the get_url function, utilizing Scrapy’s yield. The finish end result will then be given to the parse function to be processed (it have to be in JSON format). The {‘pos’: 0} key-value pair will also be added to the meta parameter, which is used to rely the number of pages scraped.

Note: when typing key phrases, needless to say every phrase in a key phrase is separated by a + sign, considerably than an space.

9. Write the Parse Function

Thanks to ScraperAPI’s auto parsing efficiency, our scraper have to be returning a JSON file as a response to our request. Make sure it is by enabling the parameter ‘autoparse’: ‘true’ inside the get_url function.

Next, we’ll load the entire JSON response and cycle through each finish end result, taking the data and combining it right into a model new merchandise that we’ll take advantage of later.

def parse(self, response):

di = json.lots(response.textual content material)

pos = response.meta['pos']

dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

for finish in di['organic_results']:

title = finish end result['title']

snippet = finish end result['snippet']

hyperlink = finish end result['link']

merchandise = {'title': title, 'snippet': snippet, 'hyperlink': hyperlink, 'place': pos, 'date': dt}

pos += 1

yield merchandise

next_page = di['pagination']['nextPageUrl']

if next_page:

yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})

This course of checks to see whether or not or not one different internet web page of outcomes is accessible. The request is invoked as soon as extra if an additional internet web page is present, repeating until there are no further pages.

10. Run the Spider

Congratulations, we constructed our first Google scraper! Remember, our code can always be modified in order so as to add efficiency we uncover is missing, nonetheless for now we have a purposeful scraper. If you’ve been following alongside, your google.py file should seem like this by now:

import scrapy

from urllib.parse import urlencode

from urllib.parse import urlparse

import json

from datetime import datetime

API_KEY = 'YOUR_API_KEY'

def get_url(url):

payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}

proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)

return proxy_url

def create_google_url(query, web site=''):

google_dict = {'q': query, 'num': 100, }

if web site:

internet = urlparse(web site).netloc

google_dict['as_sitesearch'] = internet

return 'http://www.google.com/search?' + urlencode(google_dict)

class GoogleSpider(scrapy.Spider):

title = 'google'

allowed_domains = ['api.scraperapi.com']

custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',

'CONCURRENT_REQUESTS_PER_DOMAIN': 10,

'RETRY_TIMES': 5}

def start_requests(self):

queries = ['asana+reviews', 'clickup+reviews', 'best+project+management+software', 'best+project+management+software+for+small+teams']

for query in queries:

url = create_google_url(query)

yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

def parse(self, response):

di = json.lots(response.textual content material)

pos = response.meta['pos']

dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

for finish in di['organic_results']:

title = finish end result['title']

snippet = finish end result['snippet']

hyperlink = finish end result['link']

merchandise = {'title': title, 'snippet': snippet, 'hyperlink': hyperlink, 'place': pos, 'date': dt}

pos += 1

yield merchandise

next_page = di['pagination']['nextPageUrl']

if next_page:

yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})

Note: If you have to scrape Google SERPs from fully totally different worldwide areas (let’s say Italy), all you may need to do is change the code contained within the country_code parameter inside the get_url function. Check out our documentation to be taught every parameter you probably can customise in ScraperAPI.

To run our scraper, navigate to the problem’s folder contained within the terminal and use the subsequent command:

scrapy crawl google -o serps.csv

Now our spider will run and retailer all scraped data in a model new CSV file named “serps.” This attribute is an unlimited time saver and one more reason to utilize Scrapy for internet scraping Google.

The saved data can then be analyzed and used to produce notion for devices, promoting and further.

If you’d favor to dive deeper into internet scraping with Python, check our Python and Beautiful Soup tutorial. Beautiful Soup is a simpler internet scraping framework for Python that’s merely as extremely efficient for scraping static pages.

To make the most of out of ScraperAPI, try our internet scraping and ScraperAPI best practices cheat sheet. You’ll discover out about the commonest challenges when scraping large web sites and how one can overcome them.

Happy scraping!

Why Scrape Google?

Collecting Customer Feedback Data to Inform Your Marketing

Inform Your search engine optimisation and PPC Strategy

Generate Content Ideas

How to Build a Google Web Scraper Without Getting Blocked

1. Choose Your Target Keywords

2. Setup Your Development Environment

3. Create Your Project’s Folder

4. Import All Necessary Dependencies to Your google.py File

5. Construct the Google Search Query

6. Define the ScraperAPI Method

7. Write the Spider Class

8. Send the Initial Request

9. Write the Parse Function

10. Run the Spider

You May Also Like

30 Must-Know Tools for Python Development

5 Python Data Processing Tips & Code Snippets

6 Data Science Technologies You Need to Build Your Supply Chain Pipeline