• WalnutLum@lemmy.ml
    link
    fedilink
    arrow-up
    3
    ·
    1 month ago

    There are VERY FEW fully open LLMs. Most are the equivalent of source-available in licensing and at best, they’re only partially open source because they provide you with the pretrained model.

    To be fully open source they need to publish both the model and the training data. The importance is being “fully reproducible” in order to make the model trustworthy.

    In that vein there’s at least one project that’s turning out great so far:

    https://www.llm360.ai/

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      2
      ·
      30 days ago

      Fortunately, LLMs don’t really need to be fully open source to get almost all of the benefits of open source. From a safety and security perspective it’s fine because the model weights don’t really do anything; all of the actual work is done by the framework code that’s running them, and if you can trust that due to it being open source you’re 99% of the way there. The LLM model just sits there transforming the input text into the output text.

      From a customization standpoint it’s a little worse, but we’re coming up with a lot of neat tricks for retraining and fine-tuning model weights in powerful ways. The most recent bit development I’ve heard of is abliteration, a technique that lets you isolate a particular “feature” of an LLM and either enhance it or remove it. The first big use of it is to modify various “censored” LLMs to remove their ability to refuse to comply with instructions, so that all those “safe” and “responsible” AIs like Goody-2 can turned into something that’s actually useful. A more fun example is MopeyMule, a LLaMA3 model that has had all of his hope and joy abliterated.

      So I’m willing to accept open-weight models as being “nearly as good” as a full-blown open source model. I’d like to see full-blown open source models develop more, sure, but I’m not terribly concerned about having to rely on an open-weight model to make an AI system work for the immediate term.

      • WalnutLum@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        30 days ago

        I suppose the importance of the openness of the training data depends on your view of what a model is doing.

        If you feel like a model is more like a media file that the model loaders are playing back, where the prompt is more of a type of control over how you access this model then yes I suppose from a trustworthiness aspect there’s not much to the model’s training corpus being open

        I see models more in terms of how any other text encoder or serializer would work, if you were, say, manually encoding text. While there is a very low chance of any “malicious code” being executed, the importance is in the fact that you can check the expectations about how your inputs are being encoded against what the provider is telling you.

        As an example attack vector, much like with something like a malicious replacement technique for anything, if I were to download a pre-trained model from what I thought was a reputable source, but was man-in-the middled and provided with a maliciously trained model, suddenly the system I was relying on that uses that model is compromised in terms of the expected text output. Obviously that exact problem could be fixed with some has checking but I hope you see that in some cases even that wouldn’t be enough. (Such as malicious “official” providence)

        As these models become more prevalent, being able to guarantee integrity will become more and more of an issue.