• jmcs@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    25
    ·
    1 year ago

    I guess they will get to analyze OpenAI’s dataset during discovery. I bet OpenAI didn’t have authorization to use even 1% of the content they used.

    • maynarkh@feddit.nl
      link
      fedilink
      English
      arrow-up
      14
      ·
      1 year ago

      That’s why they don’t feel they can operate in the EU, as the EU will mandate AI companies to publish what datasets they trained their solutions on.

    • Jaded@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 year ago

      Things might change but right now, you simply don’t need anyones authorization.

      Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.

      • Flaky@iusearchlinux.fyi
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 year ago

        FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.