Crawling a billion web pages in just over 24 hours, in 2025

(andrewkchan.dev)

174 points | by pseudolus 1 day ago

14 comments

bndr 16 hours ago
I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
[-]
- mettamage 11 hours ago
  I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.
  [-]
  - fuomag9 10 hours ago
    In Italy it’s a crime punishable up to 12 years to access any protected computer system without authorization, especially if it causes a DoS to the owner
    Consider the case of selfhosting a web service on a low performance server and the abusive crawling goes on loop fetching data (which was happening when I was self hosting gitlab!)
    https://www.brocardi.it/codice-penale/libro-secondo/titolo-x...
- mrweasel 15 hours ago
  Can't your users just whitelist your IPs?
  [-]
  - dewey 13 hours ago
    I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.
    And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
  - bndr 15 hours ago
    They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
    [-]
    - cassepipe 14 hours ago
      Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?
- peter_d_sherman 4 hours ago
  Very interesting!
  Yes, in this day and age, I could definitely see web pages being harder to crawl by search engines (and SEO companies and other users of automated web crawling technologies (AI agents?)) than they were in the early days of the Internet due to many possible causes -- many of which you've excellently described!
  In other words, there's more to be aware of for anyone writing a search engine (or search-engine-like piece of software -- SEO, AI Agent, etc., etc.) than there was in the early days of the Internet, where everything was straight unencrypted http and most URLs were easily accessible without having to jump through additional hoops...
  Which leads me to wonder... on the one hand, a website owner may not want bots and other automated software agents spidering their site (we have ROBOTS.TXT for this), but on the flip side, most business owners DO want publicity and easy accessibility for sales and marketing purposes, thus, they'd never want to issue a 403 (or other error code) for any public-facing product webpage...
  Thus there may be a market for testing public facing business/product websites against faulty "I can't give you that web page for whatever reason" error codes from a wide variety of clients, from a wide variety of locations around the world.
  That market is related to the market for testing if a website is up and functioning properly (the "uptime market"), again, from a wide variety of locations around the world, using a wide variety of browsers...
  So, a very interesting post!
  Also (for future historians!) compare all of the restrictive factors which may prevent access to a public-facing web page today Vs. Tim Berners-Lee original vision for the web, which was basically to let scientists (and other academic types!) SHARE their data PUBLICLY with one another!
  (Things have changed... a bit! :-) )
  Anyway, a very interesting post, and a very interesting article -- for both present and future Search Engine programmers!
- 0xdeadbeefbabe 13 hours ago
  Blocking seems really popular. I wonder if it coincides with stack overflow closing.
- gilrain 15 hours ago
  > the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
  I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
  [-]
  - bndr 15 hours ago
    Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.
    [-]
    - demetris 8 hours ago
      But how does that work?
      Does Cloudflare force firewall rules for those who choose to use it for their websites?
      If the tool that does the crawling identifies itself properly, does Cloudflare block it even if users do not tell Cloudflare to block it?
    - gilrain 15 hours ago
      It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.
      [-]
      - joncrane 14 hours ago
        OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.
      - bndr 15 hours ago
        Users sign up for my service.
        [-]
        gilrain 15 hours ago
        You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!
        [-]
        christoff12 14 hours ago
        This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.
        toomuchtodo 13 hours ago
        Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.
        [-]
        conception 4 hours ago
        Why not both?
  - prettyblocks 10 hours ago
    I would argue that the ability to crawl and scrape is core to the original ethos of the internet and all the hoops people jump through to block non-abusive scraping of content is in fact more anti-social than circumventing these mechanisms.
- spiderfarmer 14 hours ago
  Just stop scraping. I'll do everything to block you.
  [-]
  - ssgodderidge 14 hours ago
    > in my case, users add their own domains
    Seems like they're only scraping websites their clients specifically ask them to
  - Keyframe 14 hours ago
    Now you've gamified it :)
    [-]
    - shimman 14 hours ago
      It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.
      There's no point in playing tug of war against unethical actors, just ban them and be done with it.
      I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.
      [-]
      - Klonoar 7 hours ago
        If you think the game is played on a single IP address, you are not adept enough to be weighing in on this discussion.
      - stevewodil 12 hours ago
        What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?
        [-]
        Keyframe 12 hours ago
        He said "it's pretty easy", probably not realizing there are whole industries on both sides of that cat and mouse game, making it not easy.
umairnadeem123 45 minutes ago
the real cost isnt compute or bandwidth - its the URL frontier management. at a billion pages you need a deduplication layer that can handle hundreds of thousands of URL lookups per second while also tracking crawl-delay per domain, last-visit timestamps, and priority scoring. thats essentially a specialized database problem.
bloom filters work for basic dedup but they cant handle recrawl scheduling. we ended up using a combination of rocksdb for the frontier state and an in-memory hash ring for domain-level rate limiting. the frontier alone consumed more engineering time than the actual HTTP client.
throwaway77385 17 hours ago
> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
[-]
- fluoridation 15 hours ago
  >for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU
  That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
  [-]
  - dist-epoch 14 hours ago
    You could be charitable and say the bus is narrow because it has to travel a long distance and this makes it hard to have a lot of traces.
    [-]
    - fluoridation 14 hours ago
      It's not. It's narrow even between the CPU and RAM. That's just the way x86 is designed. Nvidia and AMD by contrast have the luxury of being able to rearchitect their single-board computers each generation as long as they honor the PCIe interface.
      It is also true that having a 384-bit memory bus shared with the video card would necessitate a redesigned PCIe slot as well as an outrageous number of traces on the motherboard, though.
      [-]
      - adrian_b 9 hours ago
        Traditionally, the width of the GPU memory interfaces was many times greater than that of CPUs.
        However the maximum width in consumer GPUs, of up to 1024-bit, has been reached many years ago.
        Since then the width of the memory interfaces in consumer GPUs has been decreasing continuously, and this decrease has been only partially compensated by higher memory clock frequencies. This reduction has been driven by NVIDIA, in order to increase their profit margins by reducing the memory cost.
        Nowadays, most GPU owners must be content with a memory interface no better than 192-bit, like in RTX 5070, which is only 50% wider than for a desktop CPU and much narrower than for a workstation or server CPU.
        The reason why using the main memory in GPUs is slow has nothing to do with the width of the CPU memory interface, but it is caused by the fact that the GPU accesses the main memory through PCIe, so it is limited by the throughput of at most 16 PCIe lanes, which is much lower than that of either the GPU memory interface or the CPU memory interface.
      - dist-epoch 14 hours ago
        ThreadRipper has 8 memory channels versus 2 for a desktop AMD CPU. It's not an x86 limitation.
        [-]
        fluoridation 13 hours ago
        "x86" as in the computer architecture, not the ISA. Why do you think they put extra channels instead of just having a single 512-bit bus?
        [-]
        adrian_b 10 hours ago
        The memory interface of CPUs is made wider by adding more channels because there are no memory modules with a 512-bit interface. Thus you must add multiples of the module width to the CPU memory interface.
        This has nothing to do with x86, but it is determined by the JEDEC standards for DRAM packages and DRAM modules. The ARM server CPUs use the same number of memory channels, because they must use the same memory modules.
        A standard DDR5 memory module has a width of the memory interface that is of 64-bit or 72-bit or 80-bit, depending on how many extra bits may be available for ECC. The interface of a module is partitioned in 2 channels, to allow concurrent accesses at different memory addresses. Despite the fact that the current memory channels have a width of 32-bit/36-bit/40-bit, few people are aware of this, so by "memory channel" most people mean 64 bits (or 72-bit for ECC), because that was the width of the memory channel in older memory generations.
        Not counting ECC bits, most desktop and laptop CPUs have an 128-bit memory interface, some cheaper server and workstation CPUs have a 256-bit memory interface, many server CPUs and some workstation CPUs have a 512-bit memory interface, while the state-of-the-art server CPUs have a 768-bit memory interface.
        For comparison, RTX 5070 has a 192-bit memory interface, RTX 5080 has a 256-bit memory interface and RTX 5090 has a 512-bit memory interface. However, the GDDR7 memory has a transfer rate that is 4 to 5 times higher than DDR5, which makes the GPU interfaces faster, despite their similar or even lower widths.
finnlab 22 hours ago
Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
[-]
- nurettin 17 hours ago
  I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
  [-]
  - jeroenhd 9 hours ago
    If Hetzner actually puts their own customers on their blacklist then that list becomes more trustworthy.
    They were right to blacklist you, they were right to complain to you, and they were right not to assume malice and kick you off their platform/shut down your server.
    [-]
    - nurettin 5 hours ago
      Yes I wasn't banned or anything, they aren't barbarians. Also, explain your opinion don't just put it out there. This is not a football match.
  - qingcharles 10 hours ago
    This. I tried to run a very slow DHT scraper I was writing on a Hetzner server and within minutes they were on my ass. I don't want to make an enemy of them so I killed it immediately, but they are clearly very sensitive to anything outside of "normal".
- varispeed 18 hours ago
  This. AWS is like a cash furnace, only really usable for VC backed efforts with more money than sense.
lovelearning 1 hour ago
As an experiment, it's interesting.
If anyone actually needs such a dataset, look into CommonCrawl first. I feel using something that already exists will be more cooperative and considerate than everyone overloading every website with their spider. https://commoncrawl.org/overview
dangoodmanUT 16 hours ago
> because redis began to hit 120 ops/sec and I’d read that any more would cause issues
Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…
thefounder 17 hours ago
Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
snowhale 13 hours ago
The anti-bot stuff mentioned upthread is real, but at this scale per-domain politeness queuing also becomes a genuine headache. You end up needing to track crawl-delay directives per domain, rate-limit your outbound queues by host, and handle DNS TTL properly to avoid hammering a CDN edge that's mapping thousands of domains to the same IPs. Most crawlers that work fine at 100M pages break somewhere in that machinery at 1B+.
[-]
- overfeed 9 hours ago
  > this scale per-domain politeness queuing also becomes a genuine headache
  Not really a headache - if you've ever implemented resource-based, server-side rate limiting (per-endpoint, with client-ID and/or IP buckets), that's all the logic that's required, adapted for the client side. One could wrap rate-limiting libraries designed for server-side usage and call it a day.
  I hate how people who a bad at parallelizing their user-agents across the internet are causing needless pain and giving scrapers a bad name. They are also causing blowback on the more well-behaved scrapers.
ph4rsikal 17 hours ago
When I read this, I realize how small Google makes the Internet.
sunpolice 16 hours ago
I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.
Thought about making it public but probably no one would use it.
[-]
- charlesdenault 16 hours ago
  please do
  [-]
  - mamsouuu 15 hours ago
    Yes! Please do!
handfuloflight 17 hours ago
There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.
corv 12 hours ago
Python is obviously too slow for web-scale
gethly 13 hours ago
> I also truncated page content to 250KB before passing it to the parser.
WTF did I just read?
[-]
- tengada1 12 hours ago
  It's just HTML, presumably not requesting JS libraries. So 250K is a large amount.
  [-]
  - gethly 10 hours ago
    Exactly - how can a html page need to be trimmed to 250 KB??? That is insane. Something is not right with this article.
    [-]
    - iggldiggl 7 hours ago
      A transcript for a half-hour radio comedy show with some formatting takes up about 60 kB. The English Wikipedia page for Monty Python is about 130 kB in pure UTF-8 text and the actual HTML page takes up around around 660 kB (plus/minus, depending on which Wikipedia theme exactly you use).
      So large, text-heavy pages don't seem too unlikely to exceed 250 kB, especially if they also include some amount of formatting that's more substantial than just a minimal bunch of <p> tags.
T3RMINATED 11 hours ago
[dead]