• Aatube@kbin.social
    link
    fedilink
    arrow-up
    0
    ·
    7 months ago

    robots.txt is purely textual; you can’t run JavaScript or log anything. Plus, one who doesn’t intend to follow robots.txt wouldn’t query it.

    • BrianTheeBiscuiteer@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      7 months ago

      If it doesn’t get queried that’s the fault of the webscraper. You don’t need JS built into the robots.txt file either. Just add some line like:

      here-there-be-dragons.html
      

      Any client that hits that page (and maybe doesn’t pass a captcha check) gets banned. Or even better, they get a long stream of nonsense.

    • ShitpostCentral@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      7 months ago

      You’re second point is a good one, but you absolutely can log the IP which requested robots.txt. That’s just a standard part of any http server ever, no JavaScript needed.

      • GenderNeutralBro@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        7 months ago

        You’d probably have to go out of your way to avoid logging this. I’ve always seen such logs enabled by default when setting up web servers.