AI inshorts 4

Finetune Gemma | Part 1

Oct 18, 2024

Create Gemma model variants for a specific language or unique cultural aspect.

Hey folks!

I found this interesting project on Kaggle and decided to work on it. While working on it, I realized a few things that are generally taken for granted, and I wanted to share them here.

This project aims to fine-tune gemma-2 and I could choose any task to fine-tune on. The computing resources are questionable, but let’s get the setup done first.

This week, I mainly worked on getting the right dataset for my task.

I chose this task -
Fine-tune Gemma to generate engaging stories.

For example,
User - “There was a curious boy named Tim,”
Fine-tuned Gemma - “In the heart of a bustling city, lived a curious boy named Tim and his teddy bear, Einstein. One sunny day, they decided to visit the aviation museum. As they admired the big, powerful machines, Timmy noticed something unusual <the story continues>”

To reach this goal, I need a dataset in a specific format. It should have two columns: the first column should contain one-liners for each story, and the second column should have the full story.

The stories have to be creative, engaging, and diverse with compelling characters and vivid plots.

So I started looking for children's stories and found these datasets -

Based on the content and format of the dataset, I chose the second option.

The third dataset already has one-liners, but they are in a text file where the stories and one-liners are mixed together. There's no clear separation between them, making it hard to identify the one-liners. And I was not happy with the stories in the first dataset.

Next big problem -
We have the stories, but what about the one-liners for each of them?

Now, I plan to send each row (story) to a language model with a prompt to "create an intrusive and magical one-liner for it."

I found a platform called Groq that provides API keys for various language models, and it's super easy to use. I used it to get the API for my model(gemma-7b-it), and with the code below, I was able to generate one-liners for each of my rows (stories).

I generated one-liners for the first 1k rows and the results were commendable.

Now, my dataset is ready with 1,000 rows. I didn’t do any extra preprocessing because the dataset was already in great shape. I also wanted to keep some randomness in it to help generate more creative responses.

I also pushed this dataset to Huggingface: One liner to story dataset

Easy implementation: just run this code

pip install datasets
from datasets import load_dataset ds = load_dataset("akshitha-k/oneliner_to_story")

Next steps,
1. Find a way to generate one-liners for all the rows.
2. Finetune Gemma using this dataset.
3. Evaluate performance.

I will write about it soon. Byee!

Akshi’s Substack

Discussion about this post