Embedding model fine-tuning

In this blog post I show you how I fine-tune an embedding model for RAG.

Motivation

This post is an updated version of an old post from Philip Schmidt. Indeed, I recently used his post as a basis to train my own embedding model for a RAG system. However, I faced two major challenges:

  • allowing PEFT and LoRa so that the training fits on a smaller GPU
  • adding my custom tokens so that the model understand technical documentation

This blog post will follow these steps:

  1. try to reproduce the initial blog post
  2. add peft and LoRa
  3. add caching
  4. add custom tokens

At each step we will be evaluating the results and comparing them to the initial results.

Reproducing the initial blog post

// in progress

tf32=True, # use tf32 precision bf16=True, # use bf16 precision

creates error

ValueError: –tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7

if you encounter ‘NameError: name ‘IterableDataset’ is not defined’, just update sentence-transformers

“MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names.” Sp we need to do: “train_dataset=train_dataset.select_columns( [“anchor”, “positive”] )”

References




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • graphRAG implementation in Llama Index
  • Stop Learning Patterns, Start Solving Problems - Lessons from Biology and Engineering
  • How to write your own keyword retriever in 5 minutes
  • Test procedure and metrics for RAG in 5 minutes