• lemmyingly@lemm.ee
    link
    fedilink
    arrow-up
    1
    ·
    7 months ago

    Whilst true about anyone can scrape data off Reddit, I think it’s more of a pain since before the API updates the rate limit was 2 API calls per second. You also have to find or create a scraper. With Lemmy, you follow the instructions (copy and paste) on join-lemmy.org to create your instance and you’re done. Both methods you have to configure it to subscribe to communities, so they’re about the same.

    In the EU at least there is a right to be forgotten, so yeah, Reddit and other platforms are forced to delete the data on request. I’m not sure how the same can be applied to a distributed network like Lemmy.

    There were publicly available archives of Reddit. The last time I checked, you couldn’t find the latest submissions and comments. Maybe things have changed, maybe newer alternatives have appeared.

    • For the right to be forgotten, this only applies to personal information. E.g. information that can be associated with information, that could be used to identify you.

      Since you usually have an email for signup, that would make the data fall under personal information. But reddit could just delete the email adress and your user name and show something like:

      [deleted]
      When does the Narwhal bacon?

      And well, it is pretty difficult to find out if, when and where there is backups that still contain your information and could be given to the AI model trainers too. To find these things out, we’d need a precedence case that makes a data protection agency investigate reddit throughouly.

    • Kichae@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 months ago

      Creating a new instance only gets you access to content that users of your instance have subscribed to, and then mostly only content that comes in after subscription (I believe Lemmy primes the pump a bit on community subs, pulling in a handful of posts at the time of discovery, but discovery is done by users). So, there’s a limit on what you can scrape with your own private instance, and you’re taking a bit of a bet on which communities will yield what you’re looking for in the future.

      It’d be easier and more reliable to just crawl the network and scrape it the old fashion way.