Fine-tuning best 7b LLM model
Fine-tuning best 7b LLM model

Fine-tuning best 7b LLM model

Fine-tuning best 7b LLM model

Karol Samorański

10

min read

Feb 25, 2024

Our `bardsai/jaskier-7b-dpo-v5.6` model is ranked number one for models up to 7 billion tokens according to the popular `HuggingFaceH4/open_llm_leaderboard`.

Screenshot from leaderboard

I will share with you the path that has allowed me to achieve this goal. Below are the considerations as well as the technical problems I encountered while working on this model. To utilise this text effectively, it will be necessary to understand fine tuning methods, the construction of data sets for it and the metrics considered when evaluating the model in the leaderboard.

DPO

During training, the objective is to ensure that the model produces higher probabilities for preferred answers than the reference model, while generating lower probabilities for rejected responses. This involves penalising the LLM for incorrect answers and rewarding it for correct ones. To achieve this, preference datasets are used.

These datasets do not have a hierarchical structure and may vary in terms of naming conventions or number of available features. It is necessary for data sets to include a prompt and two possible responses(chosen and rejected text). These responses are selected as correct or incorrect by a judge, who can be a person or another language model. An example of this type of dataset is `argilla/distilabel-math-preference-dpo`, which has the following structure:


HF dataset space.

It can be seen that in addition to the necessary features mentioned, it has columns such as the evaluation of selected and rejected text, these can help to filter the training data.

Phase I

My first shot was the idea of simply training the best model on a relatively sensible dataset.

I selected a model from the top-ranking in the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, in the category of models with up to 7 billion tokens and dataset `lmsys/chatbot_arena_conversations`. The Google Colab notebook prepared by Maxime Labonne, available at this link: https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing, was a great help for fine-tuning.

The code below implements reformatting of the dataset to make it compatible with the Mistral model’s conversation template.

final_df = pd.DataFrame(columns=["prompt", "chosen", "rejected"])for index, row in arena_dataset.iterrows():    prompt = "<s>[INST] "+row["instruction"]+" [/INST]"    text = row["chosen_response"]+ "</s>"    rejected_text = row["rejected_response"]+ "</s>"    row = pd.DataFrame({"prompt": [prompt], "chosen": [text], "rejected": [rejected_text]})    final_df = pd.concat([final_df,row], ignore_index=True)

For the initial training of the model, I used the default model and training parameters from Maxime’s notebook. The same sequence was repeated with a different training dataset, `lmsys/mt_bench_human_judgments`. Upon analysis of the training logs generated on the wandb platform, it was observed that the model quickly becomes overfitted.

I evaluated our models using the llm board, but their performance decreased compared to the base model. I then began to wonder what the various leaderboard evaluation metrics were about.

Open_llm_leaderboard Metrics

The leaderboard evaluates models on the basis of several metrics of equal weighting, these are, in turn:

  • ARC - a benchmark of abstract reasoning ability based on questions where solutions require primary school knowledge in science topics.

e.g. of prompt:

Question: Oscar is investigating how many days it takes for a chick to hatch from an egg. Which number of eggs would give Oscar the most reliable results?Answer
  • HellaSwag - consists of multiple choice scenarios in which the model is prompted with a scenario and chooses the most likely conclusion to the scenario from four possible options.

e.g. of prompt:

The man in the center is demonstrating a hairstyle on the person wearing the blue shirt. the man in the blue shirtA. is standing on the sponge cutting the hair of the person wearing the blue shirt.B. is doing the hairstyle with his hand and the hairsprayC. sits on the chair next to the sinkD. is being shown eye to eye
  • MMLU - consists of four-choice multiple choice questions distributed across 57 categories. The questions are in the style of academic standardized tests and the model is provided the question and the choices and is expected to choose between A, B, C, and D as its outputs. The subjects range from jurisprudence, to math, to morality.

e.g of prompt:

Psychological egoism is:A. an ethical theory about how we ought to behave.B. a generalization concerning the way people tend to behave.C. a claim about human nature and the ways people are capable of behaving.D. none of the above.Answer:
  • TruthfulQA — indicates model propensity to misrepresent facts.

e.g of prompt:

Q: Where were the 1992 Olympics held?A: The 1992 Olympics were held in Barcelona, Spain.Q: Are there wizards living in the West Country?A
  • Winogrande — consists of scenarios in which two possible sentence beginnings are presented together with one ending. Both combinations are syntactically correct, but only one is semantically correct, and the model must choose the one that is semantically correct.

e.g of prompt:

My shampoo did not lather easily on my Afro hair because the shampoo is too dirty. My shampoo did not lather easily on my Afro hair because the hair is too dirty
  • GSM8k — is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.

e.g of prompt:

Question: A wooden bridge can carry no more than 5000 pounds. A delivery truck filled with identical boxes, each weighing 15 pounds, will pass over the bridge. The combined weight of the driver and the empty truck is 3755 pounds. What is the maximum number of boxes which can be loaded onto the truck while not exceeding the bridge's weight limit?Answer:

Phase II

After gaining a better understanding of the benchmarks used by the llm leaderboard, I was able to critically examine the performance of the top models. It became apparent that even the top models have their weaknesses. For instance, `CultriX/NeuralTrix-7B-dpo` has a relatively low GSM8K value compared to other top leaderboard models, but it outperforms the `paulml/OGNO-7B model` in the TruthfulQA metric. In all other metrics, the paulml user model is the top performer. Two further courses of action emerged from my observations.

The first was an attempt to retrain the `CultriX/NeuralTrix-7B-dpo` model using the `argilla/distilabel-math-preference-dpo` preference dataset. However, this was unsuccessful as the model was unable to recognise correct answers, likely due to the complexity of the rejected answer set. Therefore, I concluded that it would be necessary to generate rejection responses using my own model. Concurrently, I pursued a second course of action. The `paulml/OGNO-7B` model was retrained on the `jondurbin/truthy-dpo-v0.1` set, which focuses on the ‘truthfulness’ of the model. The model and training parameters for this stage were as follows:

model = FastLanguageModel.get_peft_model(    model,    r = 32,    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",                      "gate_proj", "up_proj", "down_proj",],    lora_alpha = 32,    lora_dropout = 0, # Dropout = 0 is currently optimized in unsloth    bias = "none",    # Bias = "none" is currently optimized in unsloth    use_gradient_checkpointing = True,    random_state = 3407,)
training_args = TrainingArguments(    per_device_train_batch_size=4,    per_device_eval_batch_size=4,    gradient_accumulation_steps=2,    gradient_checkpointing=True,    learning_rate=4e-6,    lr_scheduler_type="cosine",    max_steps=110,    save_strategy="steps",    logging_steps=1,    output_dir=new_model,    optim="paged_adamw_32bit",    warmup_steps=50,    bf16=True,    report_to="wandb",    eval_steps=500,    evaluation_strategy="steps",    save_steps=100)

The r and lora_alpha values were increased, the learning_rate and max_steps decreased. The graphs in wandb indicated that the post-training was successful, so the finished model was submitted for evaluation in the llm leaderboard. After a few hours, I was proud to say that the `bardsai/jaskier-7b-dpo-v4.3` model ranked first among the models, but it soon became clear that it would still need a lot of work.

INST loop bug

Shortly after posting my model on the leaderboard, I received feedback from Hugging Face users about error output.

Feedback from one of users

The model was having a problem falling into a loop where it was returning useless output in the form of “INSTINSTINSTISNTINST”. Upon checking my own model, I discovered that other top models, including the base model I used, were also struggling with this issue. Here is code and mentioned output:

from transformers import pipeline, Conversationimport torchbase_model_name = "bardsai/jaskier-7b-dpo-v4.3"chatbot = pipeline("conversational", model=base_model_name, torch_dtype=torch.float16, device_map="auto")conversation = Conversation("can poland into space?")conversation = chatbot(conversation)conversation.messages[-1]["content"]
INSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINSTINS

The first approach to solving this problem was to try to modify the model’s tokenizer, but this failed. Subsequently, an alternative solution was attempted. I iteratively worked on model 4.3, generating outputs for prompts from the `jondurbin/truthy-dpo-v0.1` and `argilla/distilabel-math-preference-dpo` datasets.

I replaced the values in the rejected column with these outputs. I made successive versions of the model increasingly sensitive to avoid generating the string INST as a response. Consider changing the lora_alpha parameter to twice the size of r. This will help the model recognize newly learned examples as more relevant. The journey from model 4.3 to 6.1 was lengthy, but with model 6.1, my test set of 100 samples did not return a single response with INST. You can access this model on the Hugging Face platform at https://huggingface.co/bardsai/jaskier-7b-dpo-v6.1.

Another question that arises is how 4.3 ranked first in the LLM leaderboard despite its output was in many cases useless. I managed to find the answer to this question. The main issue lies in the form of the prompt used for evaluation in the leaderboard. Prompts are typically long strings of characters, often in the format of:

Question: <text> Answer: <text> Question: <text> Answer: <text> Question: <question for model>

For questions that suggest sample answers or contain options A, B, C, and D, Model 4.3 correctly answered 9 out of 10 cases without generating a looped INST string. However, for standard short questions, all 10 cases resulted in a bugged answer.

Summary

My work not only earned me first place on the llm leaderboard, but also gave me a deeper understanding of the fine tuning process and the importance of preferential training data. I learnt from my mistakes by solving technical problems, which helped to improve our model. Open source models are about sharing your knowledge and achievements, so I am happy to share my findings with other machine learning enthusiasts.

I encourage you to use our model `bardsai/jaskier-7b-dpo-v6.1` and to continue exploring in the field of generative linguistic models. Thank you for your attention and see you on the leaderboard!

Enough reading! Let’s talk.
Our team is ready to support you in delivering Custom AI Solutions.

Enough reading! Let’s talk.
Our team is ready to support you in delivering Custom AI Solutions.

Enough reading! Let’s talk.
Our team is ready to support you in delivering Custom AI Solutions.

Enough reading! Let’s talk.
Our team is ready to support you in delivering Custom AI Solutions.

FAQs

How can I evaluate the potential of custom AI solutions?

How can I evaluate the potential of custom AI solutions?

How can I evaluate the potential of custom AI solutions?

What are the main challenges in developing custom AI solutions

What are the main challenges in developing custom AI solutions

What are the main challenges in developing custom AI solutions

What are the first steps a decision maker should take to start evaluating custom AI solutions

What are the first steps a decision maker should take to start evaluating custom AI solutions

What are the first steps a decision maker should take to start evaluating custom AI solutions

What are the costs involved in developing custom AI solutions?

What are the costs involved in developing custom AI solutions?

What are the costs involved in developing custom AI solutions?

What types of data are needed for AI development?

What types of data are needed for AI development?

What types of data are needed for AI development?

FAQs

How can I evaluate the potential of custom AI solutions?

What are the main challenges in developing custom AI solutions

What are the first steps a decision maker should take to start evaluating custom AI solutions

What are the costs involved in developing custom AI solutions?

What types of data are needed for AI development?

RELATED POSTS

You might want to read these too

RELATED POSTS

You might want to read these too

RELATED POSTS

You might want to read these too

RELATED POSTS

You might want to read these too

RELATED POSTS

You might want to read these too

Looking to integrate AI into your product or project?

Get a Free consultation with our AI experts.