How To Scrape Amazon Product Data

June 29, 2021 Steve

Amazon, as the largest e-commerce firm throughout the United States, affords the widest fluctuate of merchandise on the earth. Their product information is perhaps useful in a variety of strategies, and you could merely extract this information with internet scraping. This info will help you develop your technique for extracting product and pricing knowledge from Amazon, and in addition you’ll increased understand strategies to make use of internet scraping devices and tricks to successfully acquire the data you need.

The Benefits of Scraping Amazon

Web scraping Amazon information helps you take into account competitor value evaluation, real-time worth monitoring and seasonal shifts with a goal to current customers with increased product affords. Web scraping allows you to extract associated information from the Amazon site and reserve it in a spreadsheet or JSON format. You could even automate the strategy to switch the data on an on a regular basis weekly or month-to-month basis.

There is at current no resolution to merely export product information from Amazon to a spreadsheet. Whether it’s for competitor testing, comparability buying, creating an API to your app problem or another enterprise need we’ve acquired you coated. This downside is certainly solved with internet scraping.

Here are one other specific benefits of using an web scraper for Amazon:

Utilize particulars from product search outcomes to boost your Amazon SEO standing or Amazon promoting and advertising campaigns
Compare and distinction your offering with that of your opponents
Use overview information for overview administration and product optimization for retailers or producers
Discover the merchandise which could be trending and lookup the top-selling product lists for a gaggle

Scraping Amazon is an intriguing enterprise proper this second, with quite a lot of companies offering gadgets, value, analysis, and completely different types of monitoring choices notably for Amazon. Attempting to scrape Amazon information on a big scale, nonetheless, is a difficult course of that all the time will get blocked by their anti-scraping know-how. It’s no easy exercise to scrape such a big site in the event you’re a beginner, so this step-by-step info ought to help you scrape Amazon information, notably in the event you’re using Python Scrapy and Scraper API.

First, Decide On Your Web Scraping Approach

One method for scraping information from Amazon is to crawl each key phrase’s class or shelf guidelines, then request the product internet web page for each one sooner than transferring on to the next. This is best for smaller scale, less-repetitive scraping. Another chance is to create a database of merchandise you want to monitor by having a list of merchandise or ASINs (distinctive product identifiers), then have your Amazon internet scraper scrape each of these specific particular person pages each day/week/and lots of others. This is the most common method amongst scrapers who monitor merchandise for themselves or as a service.

Scrape Data From Amazon Using Scraper API with Python Scrapy

Scraper API allows you to scrape basically probably the most troublesome websites like Amazon at scale for a fraction of the worth of using residential proxies. We designed anti-bot bypasses correct into the API, and you could entry additional choices like IP geotargeting (&nation code=us) for over 50 worldwide areas, JavaScript rendering (&render=true), JSON parsing (&autoparse=true) and further by merely together with extra parameters to your API requests. Send your requests to our single API endpoint or proxy port, and we’ll current a worthwhile HTML response.

Start Scraping with Scrapy

Scrapy is an web crawling and information extraction platform that may be utilized for a variety of functions equal to information mining, knowledge retrieval and historic archiving. Since Scrapy is written throughout the Python programming language, you’ll need to put in Python sooner than you need to make the most of pip (a python supervisor machine).

To arrange Scrapy using pip, run:

pip arrange scrapy

Then go to the folder the place your problem is saved (Scrapy routinely creates an web scraping problem folder for you) and run the “startproject” command along with the problem title, “amazon_scraper”. Scrapy will assemble an web scraping problem folder for you, with the whole thing already prepare:

scrapy startproject amazon_scraper

The consequence should seem like this:

├── scrapy.cfg # deploy configuration file
└── tutorial # problem's Python module, you'll import your code from proper right here
    ├── __init__.py
    ├── devices.py # problem devices definition file
    ├── middlewares.py # problem middlewares file
    ├── pipelines.py # problem pipeline file
    ├── settings.py # problem settings file
    └── spiders # a list the place spiders are located
        ├── __init__.py
        └── amazon.py # spider we merely created

Scrapy creates all the knowledge you’ll need, and each file serves a particular perform:

Items.py – Can be used to assemble your base dictionary, which you can then import into the spider.
Settings.py – All of your request settings, pipeline, and middleware activation happens in settings.py. You can modify the delays, concurrency, and a lot of different completely different parameters proper right here.
Pipelines.py – The merchandise yielded by the spider is transferred to Pipelines.py, which is principally used to wash the textual content material and bind to databases (Excel, SQL, and lots of others).
Middlewares.py – When you want to change how the request is made and scrapy manages the reply, Middlewares.py is beneficial.

Create an Amazon Spider

You’ve established the problem’s complete development, so now you’re ready to start engaged on the spiders which will do the scraping. Scrapy has a variety of spider species, nevertheless we’ll give consideration to the popular one, the Generic Spider, on this tutorial.

Simply run the “genspider” command to make a model new spider:

# syntax is --> scrapy genspider name_of_spider site.com
scrapy genspider amazon amazon.com

Scrapy now creates a model new file with a spider template, and in addition you’ll obtain a model new file known as “amazon.py” throughout the spiders folder. Your code should seem like the subsequent:

import scrapy
class AmazonSpider(scrapy.Spider):
    title = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/']
    def parse(self, response):
        go

Delete the default code (allowed domains, start urls, and the parse function) and substitute it along with your private, which should embody these 4 capabilities:

start_requests — sends an Amazon search query with a particular key phrase.
parse_keyword_response — extracts the ASIN value for each product returned in an Amazon key phrase query, then sends a model new request to Amazon for the product itemizing. It will even go to the next internet web page and do the similar issue.
parse_product_page — extracts all the specified information from the product internet web page.
get_url — sends the request to the Scraper API, which is ready to return an HTML response.

Send a Search Query to Amazon

You can now scrape Amazon for a particular key phrase using the subsequent steps, with an Amazon spider and Scraper API as a result of the proxy decision. This will allow you to scrape all the important thing particulars from the product internet web page and extract each product’s ASIN. All pages returned by the important thing phrase query shall be parsed by the spider. Try using these fields for the spider to scrape from the Amazon product internet web page:

ASIN
Product title
Price
Product description
Image URL
Available sizes and colors
Customer scores
Number of evaluations
Seller ranking

The first step is to create start_requests, a function that sends Amazon search requests containing our key phrases. Outside of AmazonSpider, you can merely set up a list variable using our search key phrases. Input the important thing phrases you want to search for in Amazon into your script:

queries = [‘tshirt for men’, ‘tshirt for women’]

Inside the AmazonSpider, you cas assemble your start_requests perform, which is ready to submit the requests to Amazon. Submit a search query “okay=SEARCH KEYWORD” to entry Amazon’s search choices by means of a URL:

https://www.amazon.com/s?okay=<SEARCH_KEYWORD>;

It seems like this after we use it throughout the start_requests function:

## amazon.py
queries = ['tshirt for men', ‘tshirt for women’]
class AmazonSpider(scrapy.Spider):
    def start_requests(self):
        for query in queries:
            url = 'https://www.amazon.com/s?' + urlencode({'okay': query})
            yield scrapy.Request(url=url, callback=self.parse_keyword_response)

You will urlencode each query in your queries guidelines so that it is secure to utilize as a query string in a URL, after which use scrapy.Request to request that URL.

Use yield as a substitute of return since Scrapy is asynchronous, so the capabilities can each return a request or a completed dictionary. If a model new request is obtained, the callback method is invoked. If an object is yielded, it should seemingly be despatched to the data cleaning pipeline. The parse_keyword_response callback function will then extract the ASIN for each product when scrapy.Request prompts it.

How to Scrape Amazon Products

One of the popular methods to scrape Amazon consists of extracting information from a product itemizing internet web page. Using an Amazon product internet web page ASIN ID is the very best and commonest resolution to retrieve this information. Every product on Amazon has an ASIN, which is a novel identifier. We would possibly use this ID in our URLs to get the product internet web page for any Amazon product, equal to the subsequent:

https://www.amazon.com/dp/<ASIN>;

Using Scrapy’s built-in XPath selector extractor methods, we’re capable of extract the ASIN value from the product itemizing tab. You can assemble an XPath selector in Scrapy Shell that captures the ASIN value for each product on the product itemizing internet web page and generates a url for each product:

merchandise = response.xpath('//*[@data-asin]')
        for product in merchandise:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"

The function will then be configured to ship a request to this URL after which title the parse_product_page callback function when it receives a response. This request will even embody the meta parameter, which is used to maneuver devices between capabilities or edit positive settings.

Extract Product Data From the Amazon Product Page

After the parse_keyword_response function requests the product pages URL, it transfers the response it receives from Amazon along with the ASIN ID throughout the meta parameter to the parse product internet web page callback function. We now want to derive the information we might like from a product internet web page, equal to a product internet web page for a t-shirt.

You should create XPath selectors to extract each topic from the HTML response we get from Amazon:

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/textual content material()').extract_first()
        image = re.search('"huge":"(.*?)"',response.textual content material).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()

        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/textual content material()').extract_first()

        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/textual content material()').extract()

        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/dad or mum::*//textual content material()[not(parent::style)]').extract()

Try using a regex selector over an XPath selector for scraping the image url if the XPath is extracting the image in base64.

When working with huge websites like Amazon which have a variety of product pages, you’ll uncover that writing a single XPath selector isn’t always ample since it will work on positive pages nevertheless not others. To deal with the fully completely different internet web page layouts, you’ll need to put in writing a lot of XPath selectors in situations like these.

When you run into this topic, give the spider three fully completely different XPath decisions:

        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/textual content material()').extract_first()

        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/textual content material()').extract()

        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/dad or mum::*//textual content material()[not(parent::style)]').extract()

        value = response.xpath('//*[@id="priceblock_ourprice"]/textual content material()').extract_first()
        if not value:
            value = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or

                    response.xpath('//*[@id="price_inside_buybox"]/textual content material()').extract_first()

If the spider is unable to discover a value using the first XPath selector, it goes on to the next. If we check out the product internet web page as soon as extra, we’re capable of see that there are fully completely different sizes and colors of the product.

To get this info, we’ll write a fast check out to see if this half is on the internet web page, and whether or not it’s, we’ll use regex selectors to extract it.

temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.textual content material).groups()[0]
            json_acceptable = s.substitute("'", """)
            di = json.tons of(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

When all of the gadgets are in place, the parse_product_page function will return a JSON object, which shall be despatched to the pipelines.py file for information cleaning:

        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/textual content material()').extract_first()

                    response.xpath('//*[@id="price_inside_buybox"]/textual content material()').extract_first()

        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/textual content material()').extract()

        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/dad or mum::*//textual content material()[not(parent::style)]').extract()

        yield {'asin': asin, 'Title': title, 'PredominantImage': image, 'Rating': rating, 'NumberOfReviews': number_of_reviews,

               'Price': value, 'AvailableSizes': sizes, 'AvailableColors': colors, 'BulletPoints': bullet_points,

'SellerRank': seller_rank}

How To Scrape Every Amazon Product on Amazon Product Pages

Our spider can now search Amazon using the important thing phrase we provide and scrape the product knowledge it returns on the net web site. What if, alternatively, we want our spider to endure each internet web page and scrape the devices on each one?

To accomplish this, we merely need in order so as to add a lot of traces of code to our parse_keyword_response function:

def parse_keyword_response(self, response):
        merchandise = response.xpath('//*[@data-asin]')
        for product in merchandise:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})
        next_page = response.xpath('//li[@class="a-last"]/a/@href').extract_first()
        if next_page:
            url = urljoin("https://www.amazon.com",next_page)
            yield scrapy.Request(url=product_url, callback=self.parse_keyword_response)

After scraping all of the product pages on the first internet web page, the spider would look to see if there is a subsequent internet web page button. If there could also be, the url extension shall be retrieved and a model new URL for the next internet web page shall be generated. For Example:

https://www.amazon.com/s?okay=tshirt+for+males&internet web page=2&qid=1594912185&ref=sr_pg_1

It will then use the callback to restart the parse key phrase response function and extract the ASIN IDs for each product along with all of the product information as sooner than.

Test Your Spider

Once you’ve developed your spider, now you may check out it with the built-in Scrapy CSV exporter:

scrapy crawl amazon -o check out.csv

You would possibly uncover that there are two factors:

The textual content material is sloppy and some values seem like in lists.
You’re retrieving 429 responses from Amazon, and subsequently Amazon detects that your requests are coming from a bot so Amazon is obstructing the spider.

If Amazon detects a bot, it’s potential that Amazon will ban your IP deal with and in addition you gained’t have the pliability to scrape Amazon. In order to resolve this topic, you desire a large proxy pool and in addition you moreover should rotate the proxies and headers for every request. Luckily, Scraper API would possibly assist eradicate this hassle.

Connect Your Proxies with Scraper API to Scrape Amazon

Scraper API is a proxy API designed to make internet scraping proxies easier to utilize. Instead of discovering and creating your private proxy infrastructure to rotate proxies and headers for each request, or detecting bans and bypassing anti-bots, you can merely ship the URL you want to scrape to the Scraper API. Scraper API will keep your entire proxy needs and ensure that your spider works with a goal to effectively scrape Amazon.

Scraper API ought to be built-in alongside along with your spider, and there are 3 methods to take motion:

Via a single API endpoint
Scraper API Python SDK
Scraper API proxy port

If you mix the API by configuring your spider to ship your entire requests to their API endpoint, you merely should assemble a simple function that sends a GET request to Scraper API with the URL we want to scrape.

First be part of Scraper API to acquire a free API key that allows you to scrape 1,000 pages per thirty days. Fill throughout the API_KEY variable alongside along with your API key:

API = ‘<YOUR_API_KEY>’
def get_url(url):
    payload = {'api_key': API_KEY, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

Then, by setting the url parameter in scrapy, we’re capable of change our spider capabilities to utilize the Scraper API proxy. get_url(url):

def start_requests(self):
       ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)
def parse_keyword_response(self, response):
       ...
       …
      yield scrapy.Request(url=get_url(product_url), callback=self.parse_product_page, meta={'asin': asin})
        ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)

Simply add an extra parameter to the payload to allow geotagging, JS rendering, residential proxies, and completely different choices. We’ll use the Scraper API’s geotargeting function to make Amazon assume our requests are coming from the US, because of Amazon adjusts the value information and supplier information displayed counting on the nation you make the request from. To accomplish this, we must always add the flag "&nation code=us" to the request, which is perhaps achieved by together with one different parameter to the payload variable.

Requests for geotargeting from the United States would seem like the subsequent:

def get_url(url):
    payload = {'api_key': API_KEY, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

Then, based on the concurrency limit of our Scraper API plan, we’ve got to switch the number of concurrent requests we’re licensed to make throughout the settings.py file. The number of requests you might make in parallel at any given time is called concurrency. The sooner you can scrape, the additional concurrent requests you can produce.

The spider’s most concurrency is about to 5 concurrent requests by default, as that’s the utmost concurrency permitted on Scraper API’s free plan. If your plan allows you to scrape with bigger concurrency, then you definately’ll need to enhance the utmost concurrency in settings.py.

Set RETRY_TIMES to 5 to tell Scrapy to retry any failed requests, and guarantee DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t allowed because of they in the reduction of concurrency and aren’t required with the Scraper API.

## settings.py
CONCURRENT_REQUESTS = 5
RETRY_TIMES = 5
# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

Don’t Forget to Clean Up Your Data With Pipelines
As a final step, clear up the data using the pipelines.py file when the textual content material is a large number and among the many values appear as lists.

class TutorialPipeline:
    def process_item(self, merchandise, spider):
        for okay, v in merchandise.devices():
            if not v:
                merchandise[k] = '' # substitute empty guidelines or None with empty string
                proceed
            if okay == 'Title':
                merchandise[k] = v.strip()
            elif okay == 'Rating':
                merchandise[k] = v.substitute(' out of 5 stars', '')
            elif okay == 'AvailableSizes' or okay == 'AvailableColors':
                merchandise[k] = ", ".be part of(v)
            elif okay == 'BulletPoints':
                merchandise[k] = ", ".be part of([i.strip() for i in v if i.strip()])
            elif okay == 'SellerRank':
                merchandise[k] = " ".be part of([i.strip() for i in v if i.strip()])
        return merchandise

The merchandise is transferred to the pipeline for cleaning after the spider has yielded a JSON object. We need in order so as to add the pipeline to the settings.py file to make it work:

## settings.py

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}

Now you’re good to go and you need to make the most of the subsequent command to run the spider and save the end result to a csv file:

scrapy crawl amazon -o check out.csv

How to Scrape Other Popular Amazon Pages

You can modify the language, response encoding and completely different options of the data returned by Amazon by together with extra parameters to these urls. Remember to always ensure that these urls are safely encoded. We already went over the strategies to scrape an Amazon product internet web page, nevertheless it is also potential to try scraping the search and sellers pages by together with the subsequent modifications to your script.

Search Page

To get the search outcomes, merely enter a key phrase into the url and safely encode it.
- Format:
- https://www.amazon.com/s?<SEARCH KEYWORD>
You would possibly add extra parameters to the search to filter the outcomes by value, mannequin and completely different parts.

Sellers Page

Instead of a faithful internet web page displaying what completely different sellers present a product, Amazon not too way back updated these pages so that now a aspect slides in. You ought to now submit a request to the AJAX endpoint that populates this slide-in with a goal to scrape this information.
- Format: https://www.amazon.com/gp/aod/ajax/ref=dp_aod_NEW_mbc?asin=<ASIN>;
- Example: https://www.amazon.com/gp/aod/ajax/ref=dp_aod_NEW_mbc?asin=B087Z6SNC1
You can refine these findings by using additional parameters such as a result of the merchandise’s state, and lots of others.
- Example: https://www.amazon.com/gp/aod/ajax/ref=tmm_pap_new_aod_0?filters={"all":true,"new":true}&scenario=new&asin=1844076342&laptop computer=dp

Forget Headless Browsers and Use the Right Amazon Proxy

99.9% of the time you don’t need to make use of a headless browser. You can scrape Amazon additional quickly, cheaply and reliably in the event you occur to make use of commonplace HTTP requests comparatively than a headless browser usually. If you go for this, don’t permit JS rendering when using the API.

Residential Proxies Aren’t Essential

Scraping Amazon at scale is perhaps achieved with out having to resort to residential proxies, so long as you make the most of top of the range datacenter IPs and totally deal with the proxy and particular person agent rotation.

Don’t Forget About Geotargeting

Geotargeting is a ought to in the event you’re scraping a site like Amazon. When scraping Amazon, be sure your requests are geotargeted appropriately, or Amazon can return incorrect knowledge.

Previously, you’d rely on cookies to geotarget your requests; nonetheless, Amazon has improved its detection and blocking of those type of requests. As a consequence, proxies located in that nation ought to be used to geotarget a particular nation. To try this with the scraper API, as an illustration, set country_code=us.

If you want to see outcomes that Amazon would current to a person throughout the U.S., you’ll desire a US proxy, and in the event you want to see outcomes that Amazon would current to a person in Germany, you’ll desire a German proxy. You ought to use proxies located in that space in the event you want to exactly geotarget a particular state, metropolis or postcode.

Scraping Amazon doesn’t should be troublesome with this info, no matter your coding expertise, scraping needs and worth vary. You shall be succesful to amass full information and make good use of it as a result of fairly a couple of scraping devices and ideas on the market.