The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Davriellelouna@lemmy.world · edit-2 7 days ago

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Kissaki@feddit.org · edit-2 7 days ago

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

ubergeek@lemmy.today · 6 days ago

And I’m assuming if the robots.txt state their UserAgent isn’t allowed to crawl, it obeys it, right? :P

Kissaki@feddit.org · 6 days ago

No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

ubergeek@lemmy.today · 6 days ago

Except, it’s not a live user hitting 10 sights all the same time, trying to crawl the entire site… Live users cannot do that.

That said, if my robots.txt forbids them from hitting my site, as a proxy, they obey that, right?

lime!@feddit.nu · 7 days ago

yeah it’s almost like there as already a system for this in place

Amberskin@europe.pub · 6 days ago

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?

dinckel@lemmy.world · 6 days ago

No-no, see. When an AI-first company does it, it’s actually called courageous innovation. Crimes are for poor people

Silicon@lemmy.world · 6 days ago

See: Facebook/Meta

utopiah@lemmy.world · 6 days ago

puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based “AI” companies to oblivion. Side effect, might break the Internet.

iamdefinitelyoverthirteen@lemmy.world · 5 days ago

The Internet was already ruined, cloudflare is just bandaids on top of band aids.

TheGrandNagus@lemmy.world · 6 days ago

Can’t believe I’ve lived to see Cloudflare be the good guys

NotASharkInAManSuit@lemmy.world · 7 days ago

That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

WolfLink@sh.itjust.works · 7 days ago

This is a nice CloudFlare ad

gravitas_deficiency@sh.itjust.works · 7 days ago

good, that means it’s working

I’m gonna be frustrated (though not surprised) if the response is anything other than this.

Electricd@lemmybefree.net · 6 days ago

I don’t like cloudflare but it’s nice that they allow people to stop AI scrapping if they want to

tempest@lemmy.ca · 6 days ago

CloudFlare has become an Internet protection racket and I’m not happy about it.

Laser@feddit.org · 6 days ago

It’s been this from the very beginning. But they don’t fit the definition of a protection racket as they’re not the ones attacking you if you don’t pay up. So they’re more like a security company that has no competitors due to the needed investment to operate.

A1kmm@lemmy.amxl.com · 5 days ago

Cloudflare are notorious for shielding cybercrime sites. You can’t even complain about abuse of Cloudflare about them, they’ll just forward on your abuse complaint to the likely dodgy host of the cybercrime site. They don’t even have a channel to complain to them about network abuse of their DNS services.

So they certainly are an enabler of the cybercriminals they purport to protect people from.

MithranArkanere@lemmy.world · 5 days ago

Any internet service provider needs to be completely neutral. Not only in their actions, but also in their liability.
Same goes for other services like payment processors.
If companies that provide content-agnostic services are allowed to policy the content, that opens the door to really nasty stuff.

You can’t chop everyone’s arms to stop a few people from stealing.

If they think their services are being used in a reprehensible manner, what they need to do is alert the authorities, not act like vigilantes.

Glitchvid@lemmy.world · 7 days ago

When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

jve@lemmy.world · edit-2 7 days ago

Right? Isn’t this a textbook DMCA violation, too?

GamingChairModel@lemmy.world · 7 days ago

Fuck that. I don’t need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn’t want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an “inspect element” tag.

Encrypt-Keeper@lemmy.world · 7 days ago

That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.

floquant@lemmy.dbzer0.com · 7 days ago

It’s difficult to be a shittier company than OpenAI, but Perplexity seems to be trying hard.

BigFig@lemmy.world · 7 days ago

Step 1, SOMEHOW find a more punchable face than Altman

Avicenna@lemmy.world · 7 days ago

ask AI how to do it?

JeeBaiChow@lemmy.world · 7 days ago

Uh… good?

GissaMittJobb@lemmy.ml · 7 days ago

Skill issue. Cope and seethe

ubergeek@lemmy.today · 7 days ago

Good. I went through my CF panel, and blocked some of those “AI Assistants” that by default were open, including Perplexity’s.

sylver_dragon@lemmy.world · 7 days ago

You’d think that a competent technology company, with their own AI would be able to figure out a way to spoof Cloudflare’s checks. I’d still think that.

snooggums@lemmy.world · edit-2 7 days ago

Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.

flux@lemmy.ml · 6 days ago

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.

snooggums@lemmy.world · 6 days ago

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.

flux@lemmy.ml · 6 days ago

When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won’t retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn’t a DDoS attack.

Constructing the training material in the first place is a different matter, but if you’re asking about fresh events or new APIs, the training data just doesn’t cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.

The Quuuuuill@slrpnk.net · 7 days ago

see, but they’re not competent. further, they don’t care. most of these ai companies are snake oil. they’re selling you a solution that doesn’t meaningfully solve a problem. their main way of surviving is saying “this is what it can do now, just imagine what it can do if you invest money in my company.”

they’re scammers, the lot of them, running ponzi schemes with our money. if the planet dies for it, that’s no concern of theirs. ponzi schemes require the schemer to have no long term plan, just a line of credit that they can keep drawing from until they skip town before the tax collector comes

Ekybio@lemmy.world · 7 days ago

Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

snooggums@lemmy.world · 7 days ago

AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

Cloudflare is just trying to keep the bots from overwhelming everything.

panda_abyss@lemmy.ca · edit-2 7 days ago

Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.

very_well_lost@lemmy.world · edit-2 7 days ago

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

I think it’s also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Perplexity Says Cloudflare Is Blocking Legitimate AI Assistants