Create Gemma model variants for a specific language or unique cultural aspect.
Hey folks!
I found this interesting project on Kaggle and decided to work on it. While working on it, I realized a few things that are generally taken for granted, and I wanted to share them here.
This project aims to fine-tune gemma-2 and I could choose any task to fine-tune on. The computing resources are questionable, but let’s get the setup done first.
This week, I mainly worked on getting the right dataset for my task.
I chose this task -
Fine-tune Gemma to generate engaging stories.
For example,
User - “There was a curious boy named Tim,”
Fine-tuned Gemma - “In the heart of a bustling city, lived a curious boy named Tim and his teddy bear, Einstein. One sunny day, they decided to visit the aviation museum. As they admired the big, powerful machines, Timmy noticed something unusual <the story continues>”
To reach this goal, I need a dataset in a specific format. It should have two columns: the first column should contain one-liners for each story, and the second column should have the full story.
The stories have to be creative, engaging, and diverse with compelling characters and vivid plots.
So I started looking for children's stories and found these datasets -
Based on the content and format of the dataset, I chose the second option.
The third dataset already has one-liners, but they are in a text file where the stories and one-liners are mixed together. There's no clear separation between them, making it hard to identify the one-liners. And I was not happy with the stories in the first dataset.
Next big problem -
We have the stories, but what about the one-liners for each of them?
Now, I plan to send each row (story) to a language model with a prompt to "create an intrusive and magical one-liner for it."
I found a platform called Groq that provides API keys for various language models, and it's super easy to use. I used it to get the API for my model(gemma-7b-it), and with the code below, I was able to generate one-liners for each of my rows (stories).
I generated one-liners for the first 1k rows and the results were commendable.
Now, my dataset is ready with 1,000 rows. I didn’t do any extra preprocessing because the dataset was already in great shape. I also wanted to keep some randomness in it to help generate more creative responses.
I also pushed this dataset to Huggingface: One liner to story dataset
Easy implementation: just run this codepip install datasets
from datasets import load_dataset
ds = load_dataset("akshitha-k/oneliner_to_story")
Next steps,
1. Find a way to generate one-liners for all the rows.
2. Finetune Gemma using this dataset.
3. Evaluate performance.
I will write about it soon. Byee!