Embedding model fine-tuning
In this blog post I show you how I fine-tune an embedding model for RAG.
Motivation
This post is an updated version of an old post from Philip Schmidt. Indeed, I recently used his post as a basis to train my own embedding model for a RAG system. However, I faced two major challenges:
- allowing PEFT and LoRa so that the training fits on a smaller GPU
- adding my custom tokens so that the model understand technical documentation
This blog post will follow these steps:
- try to reproduce the initial blog post
- add peft and LoRa
- add caching
- add custom tokens
At each step we will be evaluating the results and comparing them to the initial results.
Reproducing the initial blog post
// in progress
tf32=True, # use tf32 precision bf16=True, # use bf16 precision
creates error
ValueError: –tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7
if you encounter ‘NameError: name ‘IterableDataset’ is not defined’, just update sentence-transformers
“MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names.” Sp we need to do: “train_dataset=train_dataset.select_columns( [“anchor”, “positive”] )”
References
Enjoy Reading This Article?
Here are some more articles you might like to read next: