Fine-Tuning RNA Language Models to Predict Branch Points
This repository contains several fine-tuned RNA language models for predicting branch points within intronic sequences. The models are fine-tuned using the MultiMolecule library and evaluated on experimental datasets.
The following RNA language models were fine-tuned:
- SpliceBERT
- RNABERT
- RNA-FM
- RNA-MSM
- ERNIE-RNA
- UTR-LM
The dataset contains 177980 samples and is an experimental-data only subset of the dataset used to train BPHunter.
It has been split into approximately 80/10/10 train/validation/test by chromosome type:
- Train:
chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY, - Validation:
chr9,chr10 - Test:
chr8,chr11
Training Details
Each model was trained on the full dataset for 3 epochs with a batch size of 16, except for RNA-FM, which required a reduced batch size of 12 due to VRAM limitations. The following hyperparameters were used for most models, including RNABERT, RNA-FM, RNA-MSM, and UTR-LM:
- Optimizer: AdamW
- Learning rate: 3e-4
- Weight decay: 0.001
However, SpliceBERT and ERNIE-RNA failed to converge using these parameters. To address this, we adjusted the hyperparmeters to:
- Learning rate: 2e-5
- Weight decay of 0.01
The adjustments were made based on empirical observations during early training. While ideally, comprehensive hyperparameter tuning would be done for each model to optimize perforance, this was not feasible within the scope of the project due to the high computational cost and training time required.
GitHub
All code used to create and evaluate this model can be found at this link.