Miasma: A tool to trap AI web scrapers in an endless poison pit

(github.com)

153 points | by LucidLynx 5 hours ago

29 comments

bobosola 47 minutes ago
I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.
Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all.
So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.
[0]https://developers.google.com/search/docs/essentials/spam-po...
[-]
- trinsic2 13 minutes ago
  >I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced
  If you are automating it, I don't see why not. Kitboga, a you-tuber kept scam callers in AI call-center loops tying up there resources so they cant use them on unsuspecting victims.[0]
  That's a guerilla tactic, similar in warfare, when you steal resources from an enemy, you get stronger and they get weaker, its pretty effective.
  [0]: https://www.youtube.com/watch?v=ZDpo_o7dR8c
- xyzal 1 minute ago
  One would assume legit spiders obey robots.txt.
tasuki 2 hours ago
> If you have a public website, they are already stealing your work.
I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!
[-]
- margalabargala 24 minutes ago
  The problem I have, is they hammer my site so hard they take it down.
  The content is for everyone. They can have it. Just don't also take it away from everybody else.
- coldpie 47 minutes ago
  I agree theft isn't a good analogy, but there is something similar going on. I put my words out into the world as a form of sharing. I enjoy reading things others write and share freely, so I write so others might enjoy the things I write. But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet. They are using my work in a way I don't want it to be used. It makes me not want to share anymore.
  [-]
  - tasuki 40 minutes ago
    > But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet.
    I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.
    [-]
    - FromTheFirstIn 23 minutes ago
      They’re getting the money to burn, though
  - gruez 20 minutes ago
    >but there is something similar going on [...]
    No, what you're basically describing is "I shared something but then I didn't like how it ended up being used". If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing, but it's not "similar" to stealing beyond "I hate stealing"
    [-]
    - Hendrikto 9 minutes ago
      > If you put stuff out in public for anyone to use, then find out it's used in a way you don't like
      Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.
      Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.
      [-]
      - gruez 6 minutes ago
        >Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court,
        ...the same courts that ruled that AI training is probably fair use? Fair use trumps whatever restrictions author puts on their "licenses". If you're an author and it turned out that your book was pirated by AI companies then fair enough, but "I put my words out into the world as a form of sharing" strongly implied that's not what was happening, eg. it was a blog on the open internet or something.
- spiderfarmer 1 hour ago
  If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?
  [-]
  - drfloyd51 1 hour ago
    Odd thing about cookies… they disappear after one serving.
    Websites are an endless stream of cookies.
    The analogy doesn’t hold.
    [-]
    - ghywertelling 1 hour ago
      If copying content from harddrive to another is theft, then so is DNA copying itself.
      Everything is a Remix culture. We should promote remix culture rather than hamper it.
      Everything is a Remix (Original Series) https://youtu.be/nJPERZDfyWc
    - GeoAtreides 10 minutes ago
      how about this analogy: I created a most tasty cookie recipe. I give it out for free, and all copies have my name because I am vain person who likes to be known far and wide as the best baking chef ever. Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.
    - z3c0 1 hour ago
      Digital information may be our first post-scarce resource. It's interesting, and sad, to see so many attempt to fit it within scarcity-based economic models.
      [-]
      - Terretta 45 minutes ago
        > digital information may be our first post-scarce resource
        … browses memory and storage prices on NewEgg …
        Hmm.
        But the word digital is distracting us.
        The word information is the important one. The question isn't where information goes. It's where information comes from.
        Is new information post scarcity?
        Can it ever be?
    - throwaway613746 1 hour ago
      [dead]
  - bengale 58 minutes ago
    It’s interesting to see twists on the old anti-piracy arguments recycled for anti-ai.
    [-]
    - gruez 17 minutes ago
      Turns out many (most?) people on the internet were never anti-copyright in the first place. They were just anti-copyright (or at least, refused to challenge the anti-copyright people) because they wanted free movies and/or hated corporations.
  - falcor84 1 hour ago
    That really depends, but the quick answer is that according to our human social contract, we'd just ask "how many can I take?". Until now, the only real tool to limit scrapers has been throttling, but I don't see any reason for there not to be a similar conversational social contract between machines.
    [-]
    - volemo 1 hour ago
      Isn’t robots.txt such a “social contract between machines”? But AI scrapers couldn’t care less.
  - GaggiX 1 hour ago
    I will copy the supermarket and paste it somewhere else.
    I'm also going to download a car.
  - pbasista 1 hour ago
    This is a dishonest analogy. In your example, there is only a limited amount of cookies available. While there is no practical limit on the amount of time a certain digital media can be viewed.
    You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.
    [-]
    - hollow-moe 1 hour ago
      There sure is a limit in the load that the server you're DDoSing can take or the will for people to post new worthy content in public. The supply is limited just not at the first degree. Let's make a small edit: Are you allowed to take all the cookies and then sell them with a small ribbon with your name on it ?
      [-]
      - spiderfarmer 22 minutes ago
        Their is no arguing with pirates. They’ll take what’s yours and forget about you while you tend to the ashes.
    - spiderfarmer 25 minutes ago
      Multiple AI scrapers are downloading every page of my 6M page website as we speak. They don’t care about the fact that I have dedicated 20 years to building it, nor that I have to maintain multiple VPSes just to serve it to them.
      If I can poison them and their families, I will.
    - throwaway613746 1 hour ago
      [dead]
effnorwood 0 minutes ago
certainly don't allow anyone to access your content. perhaps shut the site down just to be safe.
aldousd666 1 hour ago
This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.
[-]
- johneth 1 hour ago
  > This is ultimately just going to give them training material for how to avoid this crap.
  > The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped.
  So we should all just do nothing and accept the inevitable?
  [-]
  - ninjagoo 17 minutes ago
    > So we should all just do nothing and accept the inevitable?
    I daresay rate-limiting will result in better outcomes than well-poisoning with hidden links that are against the policies of search engines.
    Lots of potential for collateral damage, including your own websites' reputations and search visibility, with the well-poisoning approach.
- aldousd666 1 hour ago
  To be clear, I mean AI is going to be the downfall of ad supported content. But let's face it. We have link farms and spam factories as a result of the ad supported content market. I think this is going to eventually do justice for users because it puts a premium on content quality that someone will want to pay a direct licensing fee to scrape for your AI bots as opposed to tricking somebody into clicking on a link and looking at an impression for something they won't buy.
- Apocryphon 1 hour ago
  Tech is just a series of arms races
Art9681 19 minutes ago
Can't we simple parse and remove any style="display: none;", aria-hidden="true", and tabindex="1" attributes before the text is processed and get around this trick? What am I missing?
madeofpalk 3 hours ago
Is there any evidence or hints that these actually work?
It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.
[-]
- raincole 1 hour ago
  It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.
  More centralized web ftw.
  [-]
  - hexage1814 1 hour ago
    It also probably won't work if the person actually wants your content and is checking if the thing they scraped actually makes sense or it just noise. Like, none of these are new things. Site owners send junk/fake data to webscrapers since web scraping was invented.
  - otherme123 1 hour ago
    In my experience, Google (among others) plays nice. Just put "disallow: *" in your robots.txt, and they won't bother you again.
    My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.
  - LaGrange 1 hour ago
    > It might work against people just use their Mini Mac with OpenClaw to summarize news every morning,
    Good enough for me.
    > More centralized web ftw.
    This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.
- xyzal 3 minutes ago
  About two years ago, I made up reference to a nonexistent python library and put code "using" it in just 5 GitHub repos. Several months later the free ChatGPT picked it up. So IMO it works.
- sd9 3 hours ago
  Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.
  [-]
  - 20k 3 hours ago
    I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand
    [-]
    - lucasfin000 1 hour ago
      The asymmetry is what makes this very interesting. The cost to inject poison is basically zero for the site owner, but the cost to detect and filter it at scale is significant for the scraper. That math gets a lot worse for them as more sites adopt it. It doesn't solve the problem, but it changes the economics.
- spiderfarmer 1 hour ago
  There are hundreds of bots using residential proxies. That is not free. Make them pay.
- bediger4000 51 minutes ago
  The search engine crawlers are sophisticated enough, but Meta's are not. Neither is Anthropic's Claude crawler. Source: personal experience trying garbage generators on Yandex, Blexbot, Meta's and Anthropics crawlers.
  I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?
- m00dy 2 hours ago
  it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.
- nubg 2 hours ago
  What kind of migitations? How would you detect the poison fountain?
  [-]
  - avereveard 2 hours ago
    style="display: none;" aria-hidden="true" tabindex="1"
    many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups
    [-]
    - m00dy 2 hours ago
      Google will give your website a penalty for doing this.
  - GaggiX 2 hours ago
    Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.
- phoronixrly 2 hours ago
  It does work, on two levels:
  1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.
  2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.
  My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.
kristopolous 33 minutes ago
I did a related approach:
A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.
This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted
I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...
It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!
I'm open to crypto not all being hustles and scams
Tell me what you think?
https://github.com/kristopolous/tollbot
eliottre 36 minutes ago
The data poisoning angle is interesting. Models trained on scraped web data inherit whatever biases, errors, and manipulation exist in that data. If bad actors can inject corrupted data at scale, it creates a malign incentive structure where model training becomes adversarial. The real solution is probably better data provenance -- models trained on licensed, curated datasets will eventually outcompete those trained on the open web.
theandrewbailey 1 hour ago
Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...
ninjagoo 53 minutes ago
Isn't this a trope at this point? That AI companies are indiscriminately training on random websites?
Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?
Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?
Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?
If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?
Is this a solution in search of a problem?
jstanley 1 hour ago
If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.
[-]
- mrweasel 1 hour ago
  If you're constantly being harassed by someone and despite your best efforts, nothing is being done to help you, quite the opposite in fact, tons of people cheer your assailant on in the name of profit and progress, it's only natural that you lash out.
  It's not all that productive, it's an act of desperation. If you can't stop the enemy, at least you can make their action more costly.
  One positive outcome I could see it AI companies becoming more critical of their training data.
nosmokewhereiam 1 hour ago
My asthmar
I'm assuming this is a reference to Lord of the flies
[-]
- cwnyth 53 minutes ago
  Miasma is bad or poisonous air. It's a Greek word.
ninjagoo 1 hour ago
This is essentially machine-generated spam.
The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?
Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.
Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.
Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.
This project's selective protection of the major players reinforces that effect; from the README:
" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!
User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "
superkuh 48 minutes ago
Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".
meta-level 3 hours ago
Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?
[-]
- suprfsat 3 hours ago
  "disobeys robots.txt" is more of a feature
foxes 1 hour ago
Wonder if you can just avoid hiding it to make it more believable
Why not have a library of babel esq labrinth visible to normal users on your website,
Like anti surveillance clothing or something they have to sift through
snehesht 3 hours ago
Why not simply blacklist or rate limit those bot IP’s ?
[-]
- xprnio 2 hours ago
  If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple
  [-]
  - snehesht 1 hour ago
    You wouldn’t permanently block them, it’s more like a rolling window.
    You can use security challenges as a mechanism to identify false positives.
    Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.
- phyzome 2 hours ago
  Because punishment for breaking the robots.txt rules is a social good.
- arbol 1 hour ago
  The AI companies are using virtually unlimited "clean" residential IPs so this is not a valid strategy.
  [-]
  - DaiPlusPlus 1 hour ago
    How? They run their scraping and training infrastructure - and models themselves - from within those “AI datacenters”[1] we hear about in the news - and not proxying through end-users’ own pipes.
    [1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.
    [-]
    - AyyEye 17 minutes ago
      Residential proxy networks.
- aduwah 2 hours ago
  There are way too many to do that
  [-]
  - snehesht 1 hour ago
    True, most of the blacklists systems today aren’t realtime like Amazon WAF or Cloudflare.
    We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.
    Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.
    There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.
    Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.
    [-]
    - pixl97 1 hour ago
      In the real-time spam market the lists worked well with honest groups for a bit, but started falling apart when once good lists get taken over by actors that realize they can use their position to make more money. It's a really difficult trap to avoid.
rob 1 hour ago
"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"
imdsm 2 hours ago
Applied model collapse
Imustaskforhelp 3 hours ago
I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.
rvz 3 hours ago
> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!
Can't the LLMs just ignore or spoof their user agents anyway?
[-]
- phoronixrly 2 hours ago
  Well-behaved agents will obey robots.txt and not fall into the trap.
maltyxxx 1 hour ago
[dead]
devnotes77 1 hour ago
[dead]
SophieVeldman 2 hours ago
[dead]
firekey_browser 2 hours ago
[dead]
GaggiX 3 hours ago
These projects are the new "To-Do List" app.
obsidianbases1 2 hours ago
Why do this though?
It's like if someone was trying to "trap" search crawlers back in the early 2000s.
Seems counterproductive
[-]
- integralid 1 hour ago
  search crawlers used to bring people TO your site llm boots are used to keep people OUT of your site, because knowledge is indexed and distributed by corporations.
- bilekas 2 hours ago
  Because of bots that don't respect ROBOTS.txt .
  If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.
- Forgeties79 2 hours ago
  Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.
  https://www.libraryjournal.com/story/ai-bots-swarm-library-c...
splitbrainhack 3 hours ago
-1 for the name
[-]
- QuantumNomad_ 3 hours ago
  https://en.wikipedia.org/wiki/Miasma_theory
  Seems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.