Bypassing Anti-Scraping Tools: What Actually Works
At first, I thought scraping was simple—send a request, parse the HTML, done. Nope. Some sites refused to load, others threw CAPTCHAs, and my IP got temporarily blocked. That’s when I hit r/learnpython and realized: bots get blocked, browsers don’t.
I didn’t expect scraping Web3 news to be this annoying. I just wanted a steady pipeline of fresh crypto articles, processed and embedded into Pinecone for my Retrieval-Augmented Generation (RAG) system. But websites had different plans.
Some pages loaded fine. Others returned 403 Forbidden. Some worked on the first attempt, then blocked me on the second. And then came the dreaded CAPTCHAs.
At first, I assumed I needed some sophisticated solution—machine learning-based bot evasion, some AI to rewrite request signatures dynamically. Nope. I just needed to act like a real person.
Scrapers Get Caught Because They’re Too Perfect
Real users don’t make 100 requests per second from the same IP.
Real users don’t send requests without cookies.
Real users don’t always use the same browser version every time they visit.
But scrapers do. That’s why they stand out.
Websites don’t block scraping because they hate automation. They block it because predictable, high-volume behavior breaks their data pipelines and clogs their servers. The trick isn’t hacking through barriers—it’s not tripping alarms in the first place.
The Moment I Realized I Was the Problem
I was sitting there, frustrated, wondering why some news sites worked fine while others locked me out. Then I stumbled upon this r/learnpython thread and this blog where someone else was having the same issue.
One comment hit me hard:
“If you look like a bot, you get treated like one.”
That’s when it clicked. I wasn’t being blocked because I was scraping. I was being blocked because I was making it obvious.
So I rewrote everything.
Step 1: Looking Like a Real Browser
By default, Python’s requests module sends a user-agent that screams “I’m a bot”:
python-requests/2.31.0
That alone was getting me flagged. Every browser has a User-Agent string—it tells websites what OS, browser, and version you’re using. Real users have diverse, inconsistent User-Agent values. Bots don’t.
So I grabbed one from my own browser and added it:
import requests
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://google.com',
'Connection': 'keep-alive'
})
response = session.get("https://target-site.com")
Instant improvement. Some sites that previously blocked me started working again.
✅ Result: No instant bans, fewer CAPTCHA triggers.
Step 2: Avoiding the “Too Many Requests” Ban
Even with the right headers, sites still flagged me after a few minutes.
Why? I was sending requests too fast, too predictably, from the same IP. A normal user doesn’t request 100 pages in 10 seconds.
So I slowed it down and added randomness:
import time
import random
def delay_request():
time.sleep(random.uniform(2, 5)) # Random delay between 2-5 seconds
for url in urls:
response = session.get(url)
delay_request()
✅ Result: My scraper no longer got flagged after a few minutes.
Step 3: Rotating IPs to Avoid Rate Limits
Even with throttling, some sites were tracking my IP and cutting me off.
A real user isn’t always using the same IP address, so I set up a simple proxy rotation system:
proxies = {
'http': 'http://proxy_ip:port',
'https': 'https://proxy_ip:port',
}
response = requests.get("https://target-site.com", proxies=proxies)
For small-scale scraping, free proxies worked sometimes but were unreliable.
For high-volume scraping, residential rotating proxies were the only real option.
Lesson learned: If your IP stays the same for too long, you’re asking to get blocked.
Step 4: Handling JavaScript-Rendered Content
Some sites don’t even serve real HTML—they load content dynamically with JavaScript. requests and BeautifulSoup alone can’t see it.
I wasted hours trying to parse an empty page before realizing this.
Instead of fighting the website, I ran JavaScript just like a browser would using Playwright:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://target-site.com")
page.wait_for_selector("div.article-content") # Wait for content to load
content = page.content()
browser.close()
Why Playwright?
• Selenium was too slow for large-scale scraping.
• Playwright handles headless browsers faster and more efficiently.
• It even mimics user actions, making detection harder.
Once CAPTCHAs start appearing, you’ve already lost. They mean:
🚨 The site knows you’re a bot.
🚨 You’re being rate-limited.
I tested 2Captcha and AI-based CAPTCHA solvers—they worked, but were slow and unreliable. The better approach? Avoid triggering CAPTCHAs in the first place.
How I did it:
• Session persistence → Kept cookies from previous visits.
• Reduced request frequency → Fewer requests = less suspicion.
• Switched User-Agents dynamically → Made each request look unique.
After these changes, CAPTCHAs stopped appearing altogether.