Don’t look that it is a nice Amazon spider, it might be stealing your content!

Share the knowledge...Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Share on LinkedIn
Linkedin
stop web crawling from amazon web servers (AWS)

Hey, look at me! I’m innocent! I’ll just steal your post…

After reading Neil Patel’s post about some content marketing tips, I’ve adjusted my .htaccess file in order to block web scrapers from Amazon Web Services (a platform that can be used for scraping websites).

Update: I use now Sucuri Website Firewall which blocks malicious traffic very effectively. If you use this product, you will not need to do such tricks like this one in my article below.

Why to Block Web Scrapers From Amazon Web Services?

  • Web scrapers can steal your content, publish it somewhere else and thus cause duplicate content. It can negatively affect your site rankings in search engines.
  • Web scrapers load your hosting server bandwidth, causing your website to download slower and making your pay for more expensive hosting plan if you exceed your bandwidth.

Here are the instructions for Appache to include in your website’s .htaccess file:
# Blocking AWS scrapers (Amazon Web Services IP's)
deny from 103.246.148.0/23
deny from 103.246.150.0/23
deny from 103.4.8.0/21
deny from 103.8.172.0/22
deny from 107.20.0.0/14
deny from 107.23.255.0/26
deny from 122.248.192.0/18
deny from 172.96.97.0/24
deny from 174.129.0.0/16
deny from 175.41.128.0/18
deny from 175.41.192.0/18
deny from 176.32.104.0/21
deny from 176.32.112.0/21
deny from 176.32.120.0/22
deny from 176.32.125.0/25
deny from 176.32.64.0/19
deny from 176.32.96.0/21
deny from 176.34.0.0/19
deny from 176.34.128.0/17
deny from 176.34.159.192/26
deny from 176.34.32.0/19
deny from 176.34.64.0/18
deny from 177.71.128.0/17
deny from 177.71.207.128/26
deny from 177.72.240.0/21
deny from 178.236.0.0/20
deny from 184.169.128.0/17
deny from 184.72.0.0/18
deny from 184.72.128.0/17
deny from 184.72.64.0/18
deny from 184.73.0.0/16
deny from 185.143.16.0/24
deny from 185.48.120.0/22
deny from 203.83.220.0/22
deny from 204.236.128.0/18
deny from 204.236.192.0/18
deny from 204.246.160.0/22
deny from 204.246.164.0/22
deny from 204.246.168.0/22
deny from 204.246.174.0/23
deny from 204.246.176.0/20
deny from 205.251.192.0/19
deny from 205.251.192.0/21
deny from 205.251.224.0/22
deny from 205.251.228.0/22
deny from 205.251.232.0/22
deny from 205.251.236.0/22
deny from 205.251.240.0/22
deny from 205.251.244.0/23
deny from 205.251.247.0/24
deny from 205.251.248.0/24
deny from 205.251.249.0/24
deny from 205.251.250.0/23
deny from 205.251.252.0/23
deny from 205.251.254.0/24
deny from 205.251.255.0/24
deny from 207.171.160.0/20
deny from 207.171.176.0/20
deny from 216.137.32.0/19
deny from 216.182.224.0/20
deny from 23.20.0.0/14
deny from 27.0.0.0/22
deny from 43.250.192.0/24
deny from 43.250.193.0/24
deny from 46.137.0.0/17
deny from 46.137.128.0/18
deny from 46.137.192.0/19
deny from 46.137.224.0/19
deny from 46.51.128.0/18
deny from 46.51.192.0/20
deny from 46.51.216.0/21
deny from 46.51.224.0/19
deny from 50.112.0.0/16
deny from 50.16.0.0/15
deny from 50.18.0.0/16
deny from 50.19.0.0/16
deny from 52.0.0.0/15
deny from 52.10.0.0/15
deny from 52.12.0.0/15
deny from 52.15.0.0/16
deny from 52.16.0.0/15
deny from 52.18.0.0/15
deny from 52.192.0.0/15
deny from 52.196.0.0/14
deny from 52.2.0.0/15
deny from 52.20.0.0/14
deny from 52.200.0.0/13
deny from 52.208.0.0/13
deny from 52.216.0.0/15
deny from 52.218.0.0/18
deny from 52.218.128.0/18
deny from 52.218.192.0/18
deny from 52.218.64.0/18
deny from 52.219.0.0/20
deny from 52.219.16.0/22
deny from 52.219.20.0/22
deny from 52.219.24.0/21
deny from 52.219.32.0/21
deny from 52.219.40.0/22
deny from 52.220.0.0/15
deny from 52.222.0.0/17
deny from 52.222.128.0/17
deny from 52.24.0.0/14
deny from 52.28.0.0/16
deny from 52.29.0.0/16
deny from 52.30.0.0/15
deny from 52.32.0.0/14
deny from 52.36.0.0/14
deny from 52.4.0.0/14
deny from 52.40.0.0/14
deny from 52.44.0.0/15
deny from 52.46.0.0/18
deny from 52.48.0.0/14
deny from 52.52.0.0/15
deny from 52.54.0.0/15
deny from 52.57.0.0/16
deny from 52.58.0.0/15
deny from 52.62.0.0/15
deny from 52.64.0.0/17
deny from 52.64.128.0/17
deny from 52.65.0.0/16
deny from 52.66.0.0/16
deny from 52.67.0.0/16
deny from 52.68.0.0/15
deny from 52.70.0.0/15
deny from 52.72.0.0/15
deny from 52.74.0.0/16
deny from 52.76.0.0/17
deny from 52.76.128.0/17
deny from 52.77.0.0/16
deny from 52.78.0.0/16
deny from 52.79.0.0/16
deny from 52.8.0.0/16
deny from 52.80.0.0/16
deny from 52.84.0.0/15
deny from 52.86.0.0/15
deny from 52.88.0.0/15
deny from 52.9.0.0/16
deny from 52.90.0.0/15
deny from 52.92.0.0/20
deny from 52.92.16.0/20
deny from 52.92.248.0/22
deny from 52.92.252.0/22
deny from 52.92.32.0/22
deny from 52.92.39.0/24
deny from 52.92.40.0/21
deny from 52.92.48.0/22
deny from 52.92.52.0/22
deny from 52.92.56.0/22
deny from 52.92.60.0/22
deny from 52.92.64.0/22
deny from 52.92.68.0/22
deny from 52.92.72.0/22
deny from 52.92.76.0/22
deny from 52.92.80.0/22
deny from 52.92.92.0/22
deny from 52.93.0.0/24
deny from 52.93.1.0/24
deny from 52.93.12.0/22
deny from 52.93.16.0/24
deny from 52.93.2.0/24
deny from 52.93.3.0/24
deny from 52.93.4.0/24
deny from 52.93.8.0/22
deny from 52.94.0.0/22
deny from 52.94.10.0/24
deny from 52.94.11.0/24
deny from 52.94.12.0/24
deny from 52.94.13.0/24
deny from 52.94.192.0/22
deny from 52.94.196.0/24
deny from 52.94.197.0/24
deny from 52.94.198.0/28
deny from 52.94.198.16/28
deny from 52.94.198.32/28
deny from 52.94.198.48/28
deny from 52.94.198.64/28
deny from 52.94.198.80/28
deny from 52.94.204.0/23
deny from 52.94.206.0/23
deny from 52.94.208.0/21
deny from 52.94.216.0/21
deny from 52.94.224.0/20
deny from 52.94.252.0/23
deny from 52.94.254.0/23
deny from 52.94.4.0/24
deny from 52.94.5.0/24
deny from 52.94.6.0/24
deny from 52.94.7.0/24
deny from 52.94.8.0/24
deny from 52.94.9.0/24
deny from 52.95.0.0/20
deny from 52.95.100.0/22
deny from 52.95.104.0/22
deny from 52.95.128.0/22
deny from 52.95.132.0/22
deny from 52.95.16.0/21
deny from 52.95.192.0/20
deny from 52.95.212.0/22
deny from 52.95.24.0/22
deny from 52.95.240.0/24
deny from 52.95.241.0/24
deny from 52.95.242.0/24
deny from 52.95.243.0/24
deny from 52.95.244.0/24
deny from 52.95.245.0/24
deny from 52.95.246.0/24
deny from 52.95.247.0/24
deny from 52.95.248.0/24
deny from 52.95.249.0/24
deny from 52.95.251.0/24
deny from 52.95.252.0/24
deny from 52.95.255.0/28
deny from 52.95.255.112/28
deny from 52.95.255.128/28
deny from 52.95.255.144/28
deny from 52.95.255.16/28
deny from 52.95.255.32/28
deny from 52.95.255.48/28
deny from 52.95.255.64/28
deny from 52.95.255.80/28
deny from 52.95.255.96/28
deny from 52.95.28.0/24
deny from 52.95.30.0/23
deny from 52.95.34.0/24
deny from 52.95.35.0/24
deny from 52.95.36.0/22
deny from 52.95.40.0/24
deny from 52.95.48.0/22
deny from 52.95.52.0/22
deny from 52.95.56.0/22
deny from 52.95.60.0/24
deny from 52.95.61.0/24
deny from 52.95.62.0/24
deny from 52.95.63.0/24
deny from 52.95.64.0/20
deny from 52.95.80.0/20
deny from 52.95.96.0/22
deny from 54.144.0.0/14
deny from 54.148.0.0/15
deny from 54.150.0.0/16
deny from 54.151.0.0/17
deny from 54.151.128.0/17
deny from 54.152.0.0/16
deny from 54.153.0.0/17
deny from 54.153.128.0/17
deny from 54.154.0.0/16
deny from 54.155.0.0/16
deny from 54.156.0.0/14
deny from 54.160.0.0/13
deny from 54.168.0.0/16
deny from 54.169.0.0/16
deny from 54.170.0.0/15
deny from 54.172.0.0/15
deny from 54.174.0.0/15
deny from 54.176.0.0/15
deny from 54.178.0.0/16
deny from 54.179.0.0/16
deny from 54.182.0.0/16
deny from 54.183.0.0/16
deny from 54.183.255.128/26
deny from 54.184.0.0/13
deny from 54.192.0.0/16
deny from 54.193.0.0/16
deny from 54.194.0.0/15
deny from 54.196.0.0/15
deny from 54.198.0.0/16
deny from 54.199.0.0/16
deny from 54.200.0.0/15
deny from 54.202.0.0/15
deny from 54.204.0.0/15
deny from 54.206.0.0/16
deny from 54.207.0.0/16
deny from 54.208.0.0/15
deny from 54.210.0.0/15
deny from 54.212.0.0/15
deny from 54.214.0.0/16
deny from 54.215.0.0/16
deny from 54.216.0.0/15
deny from 54.218.0.0/16
deny from 54.219.0.0/16
deny from 54.220.0.0/16
deny from 54.221.0.0/16
deny from 54.222.0.0/19
deny from 54.222.128.0/17
deny from 54.222.58.0/28
deny from 54.223.0.0/16
deny from 54.224.0.0/15
deny from 54.226.0.0/15
deny from 54.228.0.0/16
deny from 54.228.16.0/26
deny from 54.229.0.0/16
deny from 54.230.0.0/16
deny from 54.231.0.0/17
deny from 54.231.128.0/19
deny from 54.231.160.0/19
deny from 54.231.192.0/20
deny from 54.231.224.0/21
deny from 54.231.232.0/21
deny from 54.231.240.0/22
deny from 54.231.244.0/22
deny from 54.231.248.0/22
deny from 54.231.252.0/24
deny from 54.231.253.0/24
deny from 54.231.254.0/24
deny from 54.232.0.0/16
deny from 54.232.40.64/26
deny from 54.233.0.0/18
deny from 54.233.128.0/17
deny from 54.233.64.0/18
deny from 54.234.0.0/15
deny from 54.236.0.0/15
deny from 54.238.0.0/16
deny from 54.239.100.0/23
deny from 54.239.104.0/23
deny from 54.239.108.0/22
deny from 54.239.114.0/24
deny from 54.239.116.0/22
deny from 54.239.120.0/21
deny from 54.239.128.0/18
deny from 54.239.16.0/20
deny from 54.239.192.0/19
deny from 54.239.2.0/23
deny from 54.239.32.0/21
deny from 54.239.4.0/22
deny from 54.239.48.0/22
deny from 54.239.52.0/23
deny from 54.239.54.0/23
deny from 54.239.56.0/21
deny from 54.239.64.0/21
deny from 54.239.8.0/21
deny from 54.239.96.0/24
deny from 54.239.98.0/24
deny from 54.239.99.0/24
deny from 54.240.128.0/18
deny from 54.240.192.0/22
deny from 54.240.196.0/24
deny from 54.240.197.0/24
deny from 54.240.198.0/24
deny from 54.240.199.0/24
deny from 54.240.200.0/24
deny from 54.240.202.0/24
deny from 54.240.203.0/24
deny from 54.240.204.0/22
deny from 54.240.208.0/22
deny from 54.240.212.0/22
deny from 54.240.216.0/22
deny from 54.240.220.0/22
deny from 54.240.225.0/24
deny from 54.240.226.0/24
deny from 54.240.227.0/24
deny from 54.240.228.0/23
deny from 54.240.230.0/23
deny from 54.240.232.0/22
deny from 54.240.236.0/22
deny from 54.240.240.0/24
deny from 54.240.244.0/22
deny from 54.240.248.0/21
deny from 54.241.0.0/16
deny from 54.241.32.64/26
deny from 54.242.0.0/15
deny from 54.243.31.192/26
deny from 54.244.0.0/16
deny from 54.244.52.192/26
deny from 54.245.0.0/16
deny from 54.245.168.0/26
deny from 54.246.0.0/16
deny from 54.247.0.0/16
deny from 54.248.0.0/15
deny from 54.248.220.0/26
deny from 54.250.0.0/16
deny from 54.250.253.192/26
deny from 54.251.0.0/16
deny from 54.251.31.128/26
deny from 54.252.0.0/16
deny from 54.252.254.192/26
deny from 54.252.79.128/26
deny from 54.253.0.0/16
deny from 54.254.0.0/16
deny from 54.255.0.0/16
deny from 54.255.254.192/26
deny from 54.64.0.0/15
deny from 54.66.0.0/16
deny from 54.67.0.0/16
deny from 54.68.0.0/14
deny from 54.72.0.0/15
deny from 54.74.0.0/15
deny from 54.76.0.0/15
deny from 54.78.0.0/16
deny from 54.79.0.0/16
deny from 54.80.0.0/13
deny from 54.88.0.0/14
deny from 54.92.0.0/17
deny from 54.92.128.0/17
deny from 54.93.0.0/16
deny from 54.94.0.0/16
deny from 54.95.0.0/16
deny from 67.202.0.0/18
deny from 72.21.192.0/19
deny from 72.44.32.0/19
deny from 75.101.128.0/17
deny from 79.125.0.0/17
deny from 87.238.80.0/21
deny from 96.127.0.0/17
# End of Blocking AWS scrapers (Amazon Web Services IP's)

The source list of AWS IPs is taken from here (thanks Sid from appsecho.com for helping me update the list).

Also, AWS may be a source of hacking attempts. So it’s an additional reason to block these scrapers.

However, please keep in mind that there may be some NOT harmful services working from some of these IP ranges (e.g. CloudFront CDN). So, if you use such services, you will need to edit the IP-deny list appropriately.

If you are not sure what you do, or if you want more protection from a malicious traffic, then look at external website firewall services (I’m using Sucuri WAF (CloudProxy), which is comparatively affordable).

By the way, here are detailed articles about how you can protect your WordPress site from hackers: easy step-by-step do-it-your-self instruction and using security plugins.

FYI, here are some technical tips about IPs if you need:
Blocking multiple ip ranges using mod access in htaccess on StackOverflow.com
To understand how to calculate subnet masks – Quick subnet calculating techniques

Subscribe to my Free Researches
Work on your blog and small business more efficiently!

subscribe
BTW, I respect your privacy, and of course I don't send spam, affiliate offers or trade your emails. What I send is information that I consider useful.

Share the knowledge...Share on Facebook
Facebook
Tweet about this on Twitter
Twitter
Share on LinkedIn
Linkedin

Comments

  1. Scraper bots have been driving me crazy and causing massive damage to my news artciles. I need RSS on as I have a Google News approved website that shows in top stories every day. I tired everything, cloudflare rules, Bot fight mods, premium security plugins and all at the same time.

    Nothing was stopping these disrespeful lazy crooks using my own content to cause caninical issues and rank above me causing the demotion of my own original content.

    The biggiest question, with all the technology Google has, they can’t stop scapped stolen content from indexing and ranking? Apparently they cant. I think (They wont) and you are forced to spend hours of your day filling out the Google copyright form. Which can taken them over a week to remove the stolen content (by then its too late because you lose all your trending headline traffic and may not get your original rightful ranking positon back either

    Anyways. I found one of the best ways to stop 95% of this rubbish was to actualy just use Wordfence,
    Wordfence wont automatically take care of this by default. So you gotta set this up manually.

    Here is how

    Most of these scrapers are hosted and coming in via AWS and Linode hosting. So just jump into wordfence and select ‘Blocking” and then choose “Custom Pattern”

    From there make two blocking rules.

    Under the custom pattern enter these two in the “Hostname” section

    *.amazonaws.com
    And
    *.linode.com

    This will block a a massive amount of scrapers inlcuding RSS feed scraping.

    You can also do this via cloudflare. With the free account you only get 5 ruiles. I always lock down my wp-admin and wp-login.php to my personal dedicated IP address for access. You should always at least have those two rules in place. Or at least lock down those admin URLs to your country only

    However, I have the $20 per month premum account becuase you really need much more than just 5 rules and the other protections cloudflare WAF offers if you have a good website that scammrs and crooks want a peice of.

    So. What I do, I also have a WAF rule in cloudflare blocking *.amazonaws.com and *.linode.com

    This blocks these scrapers both at the edge and locally

    Note. This is of course no good if you are actually hositng your own website with one of these. You will block yourself : ) So make sure not to take note what I wrote here of you are hosting with any of these these companies.

    Now. There are some manual content crooks out there. They will come and do a simple copy and paste. This is can be stopped by using a a few different plugins.

    Eg. Hide My WP Ghost, or you can simply search for many other copy & paste blocking plugins. The only problem with these copy blocking plugins. Every single one of them has a performance imapct on your website due to java script I think.

    Well. This is the best way to stop a huge amount of scapers.

    Just use wordfence to block aws and linode host names
    You also block aws and linode via a cloudflare waf rule.

    No need to add all those IP to block lists. However, nice list, If you are going to use them it would a good idea to check if they are stil allocated to AWS. It could be outdated by now

  2. Can you give an update on how to handle AWS? You say a firewall but what settings?

    • Hi Jen,
      if you mean the firewall I mentioned in the end of the article, then it handles malicious traffic automatically. You don’t need any special settings in there.
      Moreover, using a firewall like that is a preferred method since blocking the ranges of IPs is too straight forward method which works only in a short-run. Malicious traffic sources do not stay on the same IPs for months.
      Meanwhile you still can block IPs using the described method with editing .htaccess. However for a more solid solution it’s advised to use a firewall.

  3. In my opinon, you should only all major search spiders on your site, why? major search engines have a legit reason to be on your site, all other spiders serve no real purpose, and worst case is they maybe crawling your content, and scooping it up as they go to serve some other low life that does not have the ability or money in order to come up with his/her own content. Your biggest and only friend is google bot, no other search engine legit or scam will bring in traffic that Google can provide, there is no legit reason why spiders overseas should be on my site, so they get blocked, there is no reason for amazonaws to be on my site, so they are blocked too. It does not matter if spiders are harmless, if they are not helping me such as google spider, then those spiders are not useful.

    • Hi Jo-Ann,
      Thanks for your comment.
      What you say does make sense.
      However, in my opinion this is too strict approach. I don’t mind my site to be indexed by alternative search engines (which names may change through time), as well as by some other services like Ahrefs.
      Using a firewall (e.g. the one I mentioned in the article) seems like an easier solution to me.
      Besides, when you talk about allowing only selected bots to crawl your site, you probably mean robots.txt, which may be neglected by harmful bots. So restricting access by robots.txt will probably affect only good bots. Again, using a firewall service looks like a better idea.

      • There are two sides to every story, another option is to do nothing. Google has done a good job placing fear into the minds of webmasters concerning this or that. If your content is going to be stolen, then it will be stolen, even if you stop amazonaws, then there are tons of other spiders that can and will slip by, and do what they are designed to do. I know webmasters would prefer to keep their content on their site, and not others, however, this is nothing to lose sleep over as your content is going to be scrapped at some point anyways. Do a google search on “is duplicate content penalty a myth?” and you will find multiple authority sites that explain that duplicate content penalty is a myth, however,if you have a new site, and a larger site stills your content, then that larger site will benefit since they have more authority, however, if the smaller site steals content from the authority site, then it will not have any effect on the larger sites since that larger site already has authority and trust in the eyes of Google,not only that, I have seen authroity sites based on something but duplicate content, and half the internet has duplicate content.

  4. _removed are providing amazon scraping tools without IP blocked and Banned.Using that tools any one can scrape million of records easily.
    Below is Few Tools we provide

    1.Amazon Scraping and Reprice tools
    2.Amazon competitor products monitor tools
    3.FBA scraping tools
    4.Buybox Scraping tools
    5.Amazon title modifications alert tools
    6.Amazon to Ebay Price comparisons
    7.Amazon to Ebay automatic scraping and listing tools and maintain price and stocks
    8.Aliexpress to Ebay Automatic listing tools and maintain price and stocks
    9.Walmart,Bhphotovideo,best buy and many other website to Ebay listing tools and maintain price and stocks
    10.Ebay scraping tools and Tracking tools
    11.ASIN track tools
    12.Ebay Listing tools
    13.Scrape million of data from any website. etc…..
    based on your needs i can develop or modify this tools
    Contact us for demo

    #1 Web Scraping Software – _removed |‎ Free Developer Support

    • I just leave this spam comment a bit edited (I’ve removed the URL) as a reminder for everyone that there are lots of bots stealing your bandwidth and data. In order to be protected from such activities I recommend using website firewalls. For example, Sucuri WAF which I use.

  5. murad abuseta says

    Hi professional researcher
    you save me three times
    what you did here is not just talk you just search and give us the most important this in all subject and content thank you

  6. Hi Michael,
    Thanks for this valuable post with the great list of Web Scrapers IP.

  7. This is Great Your Content Article Guide Thanks For Sharing me

  8. Wow Thanks for making us aware of this treat. Does adding this code in the .htaccess file affect the SEO?

    Please tell.

    • Hi Rajiv,

      it affects SEO in a good way 🙂

      The main SEO advantage of using this code is to help your content from being stolen by web spiders that use AWS IPs. Otherwise stolen and thus duplicated content may drop your rankings.

      Thanks for your question!

  9. Hello Michael,
    It is a great sharing this message about. This is my first time hearing of this spider though and I obviously need to stop it as well.

    So you mean adding #Blocking AWS scrapers will do the magic? Thanks for this and do have a great day!

  10. Hi Michael,

    That’s interesting. So basically it’s people using AWS services to scrape your content and use it, not actually Amazon, right?

    Have you found it made a big difference (In terms of saving bandwidth etc)?

    I have never heard of people using AWS to scrape sites so I’ll have to do some more reading on this.

    Thanks for the post.

    Robert.

    • Hi Robert,
      Thanks for your questions.

      > So basically it’s people using AWS services to scrape your content and use it, not actually Amazon, right?
      Absolutely.

      > Have you found it made a big difference (In terms of saving bandwidth etc)?
      As regards saving bandwidth, there are just 3 weeks have passed after I implemented AWS blocking (not much time to judge), but anyway here are the results:

      – I’ve compared the period of 22 days from AWS blocking implementation till now (i.e. from the 24th of Jul till the 14th of Aug) with the period of 22 days before the implementation (from the 2nd of Jul till the 23th of Jul).
      – I’ve got 83% more visitors (with the same pageviews per visitor stats) [according to Google Analytics]
      – I’ve got just 30% increase of bandwidth. [according to AWStats]
      Thus, roughly, I’ve saved 40% of bandwidth ((1.83-1.3)/1.3 = 40%).

      However, I believe, that websites with a different popularity can get a different bandwidth savings.

    • Not exactly, amazon is letting users rebind IP very easily. So amazon could be aware of this feature and could know that rebinding IP adress on AWS has single purpose only – scraping.

      • Hi Adam,
        Thanks for your comment.
        I’d agree with you. I wrote this article almost 3 years ago. So the part which still stays actual for sure now is the recommendation to use a firewall service from the threats like this.
        Thanks for your comment!

  11. Hey Michael, thanks for sharing these IPs that we should block in order to host a more powerful and working webpage focusing upon the basic business. I also read Neil Patel’s blog but couldn’t see that specific post in past about what you were talking in this update.

    Anyway thanks for the update, that helped me.

It's important for me to know what you think

*