Embedding model fine-tuning | Timothée Guédon

In this blog post I show you how I fine-tune an embedding model for RAG.

Motivation

This post is an updated version of an old post from Philip Schmidt. Indeed, I recently used his post as a basis to train my own embedding model for a RAG system. However, I faced two major challenges:

allowing PEFT and LoRa so that the training fits on a smaller GPU
adding my custom tokens so that the model understand technical documentation

This blog post will follow these steps:

try to reproduce the initial blog post
add peft and LoRa
add caching
add custom tokens

At each step we will be evaluating the results and comparing them to the initial results.

Reproducing the initial blog post

// in progress

tf32=True, # use tf32 precision bf16=True, # use bf16 precision

creates error

ValueError: –tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7

if you encounter ‘NameError: name ‘IterableDataset’ is not defined’, just update sentence-transformers

“MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names.” Sp we need to do: “train_dataset=train_dataset.select_columns( [“anchor”, “positive”] )”

References

Philip Schmidt’s post

Motivation

Reproducing the initial blog post

References

Enjoy Reading This Article?