Inside The Internet Archive's Infrastructure

(hackernoon.com)

306 points | by dvrp 1 day ago

18 comments

initialg 3 minutes ago
Is it still year 2006 and websites haven’t figured out responsive design?
hedora 10 hours ago
It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.
I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.
The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.
[-]
- nodja 9 hours ago
  It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.
- philipkglass 10 hours ago
  I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.
  [1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."
  [-]
  - 986aignan 5 hours ago
    I wish there were some kind of file search for the Wayback Machine. Like "list all .S3M files on members.aol.com before 1998". It would've made looking for obscure nostalgia much easier.
  - toomuchtodo 10 hours ago
    https://github.com/internetarchive/wayback/tree/master/wayba...
    https://akamhy.github.io/waybackpy/
    https://wiki.archiveteam.org/index.php/Restoring
    [-]
    - philipkglass 10 hours ago
      Yes, there are documents and third party projects indicating that it has a free public API, but I haven't been able to get it to work. I presume that a paid API would have better availability and the possibility of support.
      I just tried waybackpy and I'm getting errors with it too when I try to reproduce their basic demo operation:
      >>> from waybackpy import WaybackMachineSaveAPI >>> url = "https://nuclearweaponarchive.org" >>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0" >>> save_api = WaybackMachineSaveAPI(url, user_agent) >>> save_api.save() Traceback (most recent call last): File "<python-input-4>", line 1, in <module> save_api.save() ~~~~~~~~~~~~~^^ File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 210, in save self.get_save_request_headers() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 99, in get_save_request_headers raise TooManyRequestsError( ...<4 lines>... ) waybackpy.exceptions.TooManyRequestsError: Can not save 'https://nuclearweaponarchive.org'. Save request refused by the server. Save Page Now limits saving 15 URLs per minutes. Try waiting for 5 minutes and then try again.
      [-]
      - toomuchtodo 10 hours ago
        Reach out to patron services, support @ archive dot org. Also, your API limits will be higher if you specify your API key from your IA user versus anonymous requests when making requests.
- qingcharles 9 hours ago
  The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.
  [-]
  - sailfast 50 minutes ago
    Might be easier for them to just pay for the mirrors and do an on-site copy and move the data in a container?
    That way they would provide some more value back to the community as a mirror?
  - Gormo 4 hours ago
    How is it "egregious" that people are obtaining content to use for their own purposes from a resource intentionally established as a repository of content for people to obtain and use for their own purposes?
    [-]
    - Intralexical 2 hours ago
      Because nobody who opens a public library does so intending, nor consenting, for random companies to jam the entrance trying to cart off thousands of books solely to use for their own enrichment.
      https://xkcd.com/1499/
    - decremental 4 hours ago
      [dead]
  - astrange 7 hours ago
    Has any evidence been provided for this fact?
    [-]
    - textfiles 6 hours ago
      They absolutely are.
      [-]
      - razakel 5 hours ago
        Of course they are. Had to block anything at work coming from one certain company because it wasn't respecting robots.txt and the bill was just getting silly.
    - stonogo 6 hours ago
      Yes.
- quux 8 hours ago
  Is running an IPFS node and pinning the internet archive's collections a good way to do this?
- toomuchtodo 10 hours ago
  Pick the items you want to mirror and seed them via their torrent file.
  https://help.archive.org/help/archive-bittorrents/
  https://github.com/jjjake/internetarchive
  https://archive.org/services/docs/api/internetarchive/cli.ht...
  u/stavros wrote a design doc for a system (codename "Elephant") that would scale this up: https://news.ycombinator.com/item?id=45559219
  (no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)
  [-]
  - billyhoffman 8 hours ago
    There are real problems with the Torrent files for collections. They are automatically created when a collection is first created and uploaded, and so they only include the files of the initial upload. For very large collections (100+ GB) it is common for a creator to add/upload files into a collection in batches, but the torrent file is never regenerated, so download with the torrent results in just a small subset of the entire collection.
    https://www.reddit.com/r/torrents/comments/vc0v08/question_a...
    The solution is to use one of the several IA downloader script on GitHub, which download content via the collection's file list. I don't like directly downloading since I know that is most cost to IA, but torrents really are an option for some collections.
    Turns out, there are a lot of 500BG-2TB collections for ROMs/ISOs for video game consoles through the 7th and 8th generation, available on the IA...
    [-]
    - Wowfunhappy 8 hours ago
      Is this something the Internet Archive could fix? I would have expected the torrent to get replaced when an upload is changed, maybe with some kind of 24 hour debounce.
      [-]
      - rincebrain 7 hours ago
        "They're working on it." [1]
        It sounds like they put this mechanism into place that stops regenerating large torrents incrementally when it caused massive slowdowns for them, and haven't finished building something to automatically fix it, but will go fix individual ones on demand for now.
        [1] - https://www.reddit.com/r/theinternetarchive/comments/1ij8go9...
        [-]
        textfiles 6 hours ago
        It is on my desk to fix this soon.
- hinkley 8 hours ago
  I'd like a Public Broadcasting Service for the Internet but I'm afraid that money would just be pulled from actual PBS at this point to support it.
  [-]
  - xp84 6 hours ago
    Too late, PBS is already defunded. CPB was deleted. PBS is now an indie organization without a dime of public money. They should probably rebrand and lose the word “Public”
- skywhopper 5 hours ago
  Don’t put any stock into the numbers in the article. They are mostly made up out of thin air.
- Gormo 4 hours ago
  > $25-30M per year is a lot for a non-profit
  $25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.
  [-]
  - esseph 1 hour ago
    You have an extremely skewed view of the average nonprofit
  - Medium_Taco 3 hours ago
    You're being purposefully obtuse. Most non-profits don't function at scale (neither do they do best at scale). They serve their local community
BryantD 12 hours ago
They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.
[-]
- textfiles 11 hours ago
  There is a fundamental resistance to tape technology that exists to this day as a result of all those troubles.
  [-]
  - EvanAnderson 7 hours ago
    That's sad, but it mirrors my experience with commercial customers. Tape is so fiddly but the cost efficiency for large amounts of data and at-rest stability is so good. Tape is caught in a spiral of decreasing market share so industry has no incentive to optimize it.
    Edit: Then again, I recently heard a podcast that talked about the relatively good at-rest stability of SATA hard disk drives stored outdoors. >smile<
    [-]
    - duskwuff 6 hours ago
      Tape is also an extraordinarily poor option for a service like Internet Archive which intends to provide interactive, on-demand access to its holdings.
      [-]
      - BryantD 1 hour ago
        Back in the day, if you loaded a page from the web archive that wasn’t in cache, it’d tell you to come back in a couple of minutes. If it was in cache, it was reasonably speedy.
        Cache in this case was the hard drives. If I recall correctly, we were using SAM-FS, which worked fairly well for the purpose even though it was slow as dirt —- we could effectively mount the tape drive on Solaris servers, and access the file system transparently.
        Things have gotten better. I’m not sure if there were better affordable options in the late 1990s, though. I went from Alexa/IA to AltaVista, which solved the problem of storing web crawl data by being owned by DEC and installing dozens of refrigerator sized Alpha servers. Not an option open to Alexa/IA.
      - EvanAnderson 5 hours ago
        I presume backing-up the archive is a desirable thing. That's a place where I would see tape fitting well for them.
        [-]
        duskwuff 5 hours ago
        Perhaps? But unless tape, and the infrastructure to support it, is dramatically cheaper than disk, they might still be better served by more disk - having two or more copies of data on disk means that both of them can service load, whereas a tape backup is only passively useful as a backup.
        [-]
        stonogo 11 minutes ago
        unless tape, and the infrastructure to support it, is dramatically cheaper than disk,
        This turns out to be the case, with the cost difference growing as the archive size scales. Once you hit petascale, it's not even close. However, most large-scale tape deployments also have disk involved, so it's usually not one or the other.
      - stonogo 5 hours ago
        This is a common use for tape, which can via tools like HPSS have a couple petabytes of disk in front of it, and present the whole archive in a single POSIX filesystem namespace, handling data migration transparently and making sure hot data is kept on low-latency storage.
        [-]
        BryantD 1 hour ago
        Yeah, it was like this (except not petabytes).
- hinkley 8 hours ago
  We had a little server room where the AC was mounted directly over the rack. I don't think we ever put an umbrella in there but it sure made everyone nervous the drain pipe would clog.
  Much more recently, I worked at a medium-large SaaS company but if you listened to my coworkers you'd think we were Google (there is a point where optimism starts being delusion, and a couple of my coworkers were past it.)
  Then one day I found the telemetry pages for Wikipedia. I am hoping some of those charts were per hour not per second, otherwise they are dealing with mind numbing amounts of traffic.
mcpar-land 9 hours ago
Is this some kind of copypasted AI output? There are unformatted footnote numbers at the end of many sentences.
[-]
- NetOpWibby 9 hours ago
  I was thinking the same thing. No proofreading is a sure sign to me. I also feel like I've read parts of this before.
- sltkr 7 hours ago
  Some of the images are AI generated (see the Gemini watermark in the bottom right), and the final paragraph also reads extremely AI-generated.
rarisma 8 hours ago
I think this was writen wholly by deep research.
It just reads like a clunky low quality article
[-]
- astrange 7 hours ago
  It's clearly AI writing ("hum", "delve") but oddly I don't think deep research models use those words.
  [-]
  - joemi 6 hours ago
    I think relying on the vocabulary to indicate AI is pointless (unless they're actually using words that AI made up). There's a reason they use words such as those you've pointed out: because they're words, and their training material (a.k.a. output by humans) use them.
    [-]
    - astrange 5 hours ago
      No American used "delve" before ChatGPT 3.5, and nobody outside fanfiction uses the metaphors it does (which are always about "secrets" "quiet" "humming" "whispers" etc). It's really very noticeable.
      https://www.nytimes.com/2025/12/03/magazine/chatbot-writing-...
      [-]
      - pests 40 minutes ago
        But now Americans do use "delve" since 3.5. So what? No Americans used "cromulent" as a word either until Simpsons invented it. Is it not a real word? Does using it mean the Simpsons wrote it?
jarboot 12 minutes ago
Hate to be the guy in the comments complaining about the css, but the sides of the text of this article are cut off. It looks like I'm zoomed in, and there's no way I can see the first few columns of the text without going to Reader view. I'm on a modern iPhone using safari, accessibility settings font larger than usual.
semiquaver 6 hours ago
This article is way too LLMey for my taste.
alfgrimur 2 hours ago
I love to imagine this is all a cover and the Internet Archive is located in a remote cave in northern Sweden and consists of a series of endlessly self replicating flash drives powered by the sun.
segalord 13 minutes ago
this is every data hoarders dream setup haha
bpiche 7 hours ago
IA is hosting a couple more of Rick Prelinger’s shows this month. Looking forward to visiting
ghm2199 7 hours ago
Does any one know how the size of this compares to archive.today?
[-]
- textfiles 6 hours ago
  We absolutely lap them with many, many more petabytes of material. But archive.today is also not doing speculative or multiple scheduled captures of the amount of sites that archive.org is.
vladiim 7 hours ago
How long will it take for them to send the PetaBox to space?
[-]
- textfiles 6 hours ago
  That project gets discussed every once in a while.
brcmthrowaway 11 hours ago
Does IA do deduplication?
[-]
- textfiles 11 hours ago
  Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.
- HumanOstrich 11 hours ago
  [flagged]
  [-]
  - zxcvasd 9 hours ago
    heres the second paragraph in full:
    "Here, amidst the repurposed neoclassical columns and wooden pews of a building constructed to worship a different kind of permanence, lies the physical manifestation of the "virtual" world. We tend to think of the internet as an ethereal cloud, a place without geography or mass. But in this building, the internet has weight. It has heat. It requires electricity, maintenance, and a constant battle against the second law of thermodynamics. As of late 2025, this machine—collectively known as the Wayback Machine—has archived over one trillion web pages.1 It holds 99 petabytes of unique data, a number that expands to over 212 petabytes when accounting for backups and redundancy.3"
    can you help my small brain by pointing out where in this paragraph they talk about deduplication?
  - sltkr 10 hours ago
    I don't think the article mentions anything about deduplication. Can you be less snarky and actually quote the relevant sentence?
schmuckonwheels 10 hours ago
Disappointed with the lack of pictures.
[-]
- parttimelarry 9 hours ago
  Probably because this looks more like a Deep Research agent "delving" into the infrastructure -- with a giant list of sources at the end. The Archive is not just a library; it is a service provider.
  [-]
  - schmuckonwheels 7 hours ago
    I wasn't expecting to read a podcast when clicking.
- textfiles 6 hours ago
  What do you want some pictures of?
  [-]
  - schmuckonwheels 4 hours ago
    An article about "infrastructure" that opens up with a dramatic description of a datacenter stuffed into an old church, I would expect more than just generic clipart you'd see in the back half of Wired magazine.
    [-]
    - textfiles 3 hours ago
      Here's some photos I took a long time ago.
      https://www.flickr.com/photos/textfiles/albums/7215763372220...
      [-]
      - Tempest1981 46 minutes ago
        Thanks! The church attendees (employees?) have a Severence Kier vibe... although I'm guessing the TV show came much later.
brcmthrowaway 11 hours ago
[flagged]
krunck 9 hours ago
[flagged]
[-]
- mjmas 8 hours ago
  Was this reply meant for this story instead? https://news.ycombinator.com/item?id=46637127
lysace 9 hours ago
The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.
[-]
- mixologic 6 hours ago
  They can offer a perk that literally no other tech job can offer: Someday have a statue of your likeness preserved in ceramic: https://www.atlasobscura.com/places/internet-archive-headqua...
  "Inside the church's main room, with its still-intact pews, there are more than 120 ceramic sculptures of the Internet Archive's current and former employees, created by artist Nuala Creed and inspired by the statues of the Xian warriors in China."
- textfiles 6 hours ago
  We've hired a few dozen people over the past couple of years. We think they're pretty talented.
  [-]
  - lysace 5 hours ago
    Is retreival from the wayback machine intentionally made slow?
    [-]
    - textfiles 4 hours ago
      Show me the faster wayback machine we are competing against.
      [-]
      - brokensegue 1 hour ago
        i'm a big fan of IA and wayback machine. i donate. but i do wish it were faster. i understand that would cost a lot more though.
        i wonder if maybe donors above a certain level could get priority on archiving pages or something.
cowhax 10 hours ago
>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.
I'd say the nonprofit has found itself a profitable reason for its existence