• Kissaki@feddit.org
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    7 days ago

    Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

    So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

    • ubergeek@lemmy.today
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 days ago

      And I’m assuming if the robots.txt state their UserAgent isn’t allowed to crawl, it obeys it, right? :P

      • Kissaki@feddit.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        6 days ago

        No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

        • ubergeek@lemmy.today
          link
          fedilink
          English
          arrow-up
          1
          ·
          6 days ago

          Except, it’s not a live user hitting 10 sights all the same time, trying to crawl the entire site… Live users cannot do that.

          That said, if my robots.txt forbids them from hitting my site, as a proxy, they obey that, right?

    • lime!@feddit.nu
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 days ago

      yeah it’s almost like there as already a system for this in place

  • Amberskin@europe.pub
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 days ago

    Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

    Isn’t that a literal computer crime?

  • ubergeek@lemmy.today
    link
    fedilink
    English
    arrow-up
    1
    ·
    7 days ago

    Good. I went through my CF panel, and blocked some of those “AI Assistants” that by default were open, including Perplexity’s.

    • tempest@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 days ago

      CloudFlare has become an Internet protection racket and I’m not happy about it.

      • Laser@feddit.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 days ago

        It’s been this from the very beginning. But they don’t fit the definition of a protection racket as they’re not the ones attacking you if you don’t pay up. So they’re more like a security company that has no competitors due to the needed investment to operate.

        • A1kmm@lemmy.amxl.com
          link
          fedilink
          English
          arrow-up
          0
          ·
          5 days ago

          Cloudflare are notorious for shielding cybercrime sites. You can’t even complain about abuse of Cloudflare about them, they’ll just forward on your abuse complaint to the likely dodgy host of the cybercrime site. They don’t even have a channel to complain to them about network abuse of their DNS services.

          So they certainly are an enabler of the cybercriminals they purport to protect people from.

          • MithranArkanere@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            5 days ago

            Any internet service provider needs to be completely neutral. Not only in their actions, but also in their liability.
            Same goes for other services like payment processors.
            If companies that provide content-agnostic services are allowed to policy the content, that opens the door to really nasty stuff.

            You can’t chop everyone’s arms to stop a few people from stealing.

            If they think their services are being used in a reprehensible manner, what they need to do is alert the authorities, not act like vigilantes.

  • Glitchvid@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    7 days ago

    When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

    • GamingChairModel@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      7 days ago

      Fuck that. I don’t need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn’t want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an “inspect element” tag.

      • Encrypt-Keeper@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        7 days ago

        That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.

  • sylver_dragon@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 days ago

    You’d think that a competent technology company, with their own AI would be able to figure out a way to spoof Cloudflare’s checks. I’d still think that.

    • snooggums@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      7 days ago

      Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.

      • flux@lemmy.ml
        link
        fedilink
        English
        arrow-up
        0
        arrow-down
        1
        ·
        6 days ago

        This is not about training data, though.

        Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

        Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

        I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

        Though technically making all this happen flawlessly is quite a big task.

        • snooggums@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          6 days ago

          Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

          They are one of the sources!

          The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

          Scraping once extensively and scraping a bit less but far more frequently have similar impacts.

          • flux@lemmy.ml
            link
            fedilink
            English
            arrow-up
            0
            ·
            6 days ago

            When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won’t retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn’t a DDoS attack.

            Constructing the training material in the first place is a different matter, but if you’re asking about fresh events or new APIs, the training data just doesn’t cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.

    • The Quuuuuill@slrpnk.net
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 days ago

      see, but they’re not competent. further, they don’t care. most of these ai companies are snake oil. they’re selling you a solution that doesn’t meaningfully solve a problem. their main way of surviving is saying “this is what it can do now, just imagine what it can do if you invest money in my company.”

      they’re scammers, the lot of them, running ponzi schemes with our money. if the planet dies for it, that’s no concern of theirs. ponzi schemes require the schemer to have no long term plan, just a line of credit that they can keep drawing from until they skip town before the tax collector comes

  • Ekybio@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    7 days ago

    Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

    • snooggums@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 days ago

      AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

      Cloudflare is just trying to keep the bots from overwhelming everything.

    • panda_abyss@lemmy.ca
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      7 days ago

      Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.

      A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

      This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.

      • very_well_lost@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        7 days ago

        A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

        I think it’s also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).