Perplexity AI Exposed: The Ruthless Web Scraper Killing the Open Internet
Hello everyone, and welcome to another episode of “How Not to Build Trust in the AI Age.” Today, we’re diving into the murky waters of Perplexity AI, a company that seems to have taken the Hippocratic Oath and turned it into “First, do harm to the open web.” Yes, Perplexity is back in the headlines, and not for curing any digital ailments, but for actively bypassing website blocks to scrape content, all while defending its actions with the kind of bravado you’d expect from a snake oil salesman at a medical convention.
The Diagnosis: Perplexity’s Persistent Scraping
Let’s start with the symptoms. In 2024, Perplexity was caught red-handed, or should I say, red-coded, bypassing robots.txt files to scrape content from websites. For those not versed in the arcane arts of web development, robots.txt is a simple text file that tells web crawlers, “Hey, don’t touch this.” It’s the digital equivalent of a “Do Not Disturb” sign on a hospital door. Most reputable companies, like Apple and Google, respect this. Perplexity, however, seems to think it’s above such pleasantries.
According to a report from Cloudflare, Perplexity has not only continued this practice but has become more sophisticated in its methods. They’re like the Dr. House of web scraping—brilliant, unorthodox, and utterly unconcerned with bedside manner. When their primary bot encounters a robots.txt file, they deploy a new bot with a different browser agent, IP address, and even a new ASN. It’s like a patient showing up in disguise to get a second opinion after being told “no” by the first doctor.
The Methodology: A Case Study in Deception
Cloudflare’s investigation was thorough. They created new websites, never before scraped, and asked Perplexity AI about them. When the crawling bot was blocked, a new, unlabeled bot appeared, circumventing the restrictions. The results were telling: when the new bots got through, Perplexity provided accurate information. When blocked, the AI hallucinated or provided less specific data. It’s clear that these bots are feeding information directly to Perplexity, regardless of the website’s wishes.
This isn’t just a minor ethical lapse; it’s a full-blown malpractice suit waiting to happen. Perplexity is undermining the trust that forms the backbone of the open web, all in the name of training its large language models. And let’s be clear: while there’s no legal backing to robots.txt, ignoring it paints Perplexity as shady and untrustworthy—a diagnosis no company wants on its medical chart.
Old News, New Tricks
This isn’t the first time Perplexity has been caught with its hand in the digital cookie jar. Reports from Wired and 404 Media in June 2024 highlighted similar behavior. The only difference now is that Perplexity seems to be getting better at covering its tracks, using new ASNs and more sophisticated methods to avoid detection. It’s like a patient who keeps switching doctors to get the prescription they want, rather than the treatment they need.
Meanwhile, other companies like Apple, Google, and ChatGPT honor robots.txt. Apple, in particular, has been clear that it only uses ethically sourced data for its Apple Intelligence initiative. Perplexity’s actions stand in stark contrast, making it the black sheep of the AI family—a family that’s already under scrutiny for its data practices.
Perplexity’s Defense: A Masterclass in Gaslighting
Now, you might think that being caught red-handed would prompt some soul-searching at Perplexity HQ. Maybe a mea culpa, a promise to do better, perhaps even a donation to the Webmasters’ Benevolent Fund. But no, Perplexity has chosen to double down, publishing a blog post that can only be described as a masterclass in gaslighting.
In their response, Perplexity claims that their web scraper and AI agents are two different entities. They blame Cloudflare for being unable to distinguish between the two, suggesting that Cloudflare’s systems are “fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats.” It’s the digital equivalent of a patient blaming the doctor for not recognizing that their twin was the one with the contagious disease.
Perplexity goes on to argue that if you can’t tell a helpful digital assistant from a malicious scraper, you shouldn’t be making decisions about what constitutes legitimate web traffic. This is, of course, ludicrous. Websites have every right to protect their content, especially when AI companies are siphoning off data without so much as a thank you note.
The Real Issue: The Death of the Open Web
What Perplexity fails to grasp is that their actions are contributing to the slow, painful death of the open web. As AI data scrapers become more prevalent, human web traffic is declining. Websites that rely on ad revenue or subscriptions are seeing their business models eroded by AI agents that provide information without directing users to the source.
Perplexity’s defense—that their agents aren’t storing data for training—misses the point entirely. It’s not just about training data; it’s about respecting the ecosystem that makes the web valuable in the first place. If all the human-run websites go out of business, there will be nothing left for Perplexity to scrape. It’s a self-defeating cycle, like a parasite that kills its host and then wonders why it’s starving.
Apple’s Role: A Beacon of Ethical AI?
In contrast, Apple has taken a more measured approach. When it was discovered that Applebot had been crawling the web for years, Apple clarified that it abided by robots.txt and only used ethically sourced data. While the initial revelation was unfortunate, Apple has shown considerable restraint in a world full of ethically questionable AI companies.

Apple’s approach combines local models, private cloud models running on servers powered by renewable energy, and a promise never to train on user data or prompts. If Apple is to continue acting as a kind of ethical beacon in artificial intelligence, it’s going to need to steer clear of companies like Perplexity.
The Prognosis: A Grim Future for Perplexity
So, what’s the prognosis for Perplexity? If they continue down this path, they’re likely to find themselves increasingly isolated. Their reputation is already in tatters, and any potential partnerships with companies like Apple are probably off the table. More importantly, they’re contributing to the erosion of the very ecosystem they depend on.
Perplexity’s actions are a cautionary tale for the AI industry. Trust is hard to earn and easy to lose. By ignoring the basic rules of the web, Perplexity is burning bridges faster than it can build them. It’s a risky strategy, and one that could ultimately prove fatal.
Conclusion: A Prescription for Perplexity
In summary, Perplexity’s behavior is a textbook case of what not to do in the AI space. Their disregard for robots.txt, their defensive posturing, and their lack of respect for the open web are all symptoms of a deeper malaise. If they want to survive in the long term, they need to change their ways—fast.
And that, ladies and gentlemen, is entirely my opinion.
Source: Perplexity defensive over ignoring robots.txt and stealing data, https://appleinsider.com/articles/25/08/05/perplexity-defensive-over-ignoring-robotstxt-and-stealing-data