Depending on how much compute you have available, you can look into finetuning models from HuggingFace (e.g. Llama 3, or a smaller Phi model). Look into LoRA, and try to learn how the model you choose calculates the loss.
There are various ways to train, and usually involves masking the input by replacing random input tokens with the mask token. I won’t go into too much detail with this, because it’s a lot to explain, and I suggest you read an article on this (link1 or link2)
You’re right. I read past the “I want to learn ML” and went straight to “do something useful with the data”.
If the goal is to understand how modern LLMs work, it’s also good to read up on RNNs and LSTMs. For this, 3Blue1Brown does an amazing job, and even posted an in-depth video about transformers. I’d watch that next, followed by implementing a simple transformer in PyTorch (perhaps using the existing blocks).
You could argue that it’s important to design everything from scratch first, but it’s easier to first go high level, see how the network behaves, and then attempt to implement it yourself based on the paper. It is up to OP how comfortable he is with the topic though 😁