Skip to main content

Fine tune LLMs on HPC

Model and Dataset Selection Rationale

ComponentConfiguration
Base Modelgoogle/gemma-3-4b-pt (pretrained)
Comparison Modelgoogle/gemma-3-4b-it (instruction-tuned)
Datasettimdettmers/openassistant-guanaco
JustificationUsing Gemma-3 allows direct comparison between base pretrained, our LoRA fine-tuned, and official instruction-tuned variants. The OpenAssistant Guanaco dataset provides high-quality instruction-following examples.

Dataset Overview

The timdettmers/openassistant-guanaco dataset is a high-quality instruction-following dataset containing conversational exchanges between humans and AI assistants. It includes diverse question-answer pairs covering topics like creative writing, problem-solving, factual queries, and technical explanations. The dataset is specifically designed to train models to follow instructions and provide helpful, harmless, and honest responses.

Fine-tuning Benefits

Fine-tuning Gemma-3-4B-PT on this dataset significantly improves the model's ability to:

  • Follow complex instructions: Better understanding of multi-step requests and nuanced prompts
  • Maintain conversational context: Improved coherence in longer dialogues
  • Provide structured responses: Enhanced ability to format answers appropriately
  • Reduce harmful outputs: Training on carefully curated data reduces problematic responses

By using Gemma-3, we can compare three variants:

  1. Base pretrained (google/gemma-3-4b-pt) - Raw language modeling capabilities
  2. Our LoRA fine-tuned - Custom instruction following on Guanaco dataset
  3. Official instruction-tuned (google/gemma-3-4b-it) - Google's instruction tuning baseline

This provides a comprehensive evaluation of fine-tuning effectiveness.


Three-Way Model Comparison

Example Prompt: "Explain quantum computing in simple terms for a beginner"

Base Pretrained Model (google/gemma-3-4b-pt):

Quantum computing uses quantum bits or qubits instead of classical bits. Qubits can be in superposition states and can be entangled. This allows quantum computers to perform certain calculations exponentially faster than classical computers for specific problems like factoring large numbers and simulating quantum systems.

Our LoRA Fine-tuned Model (Gemma-3-4B-PT + LoRA on Guanaco):

Great question! Let me break down quantum computing in simple terms:

**Think of it like this**: Regular computers use bits that are either 0 or 1 (like light switches that are either ON or OFF). Quantum computers use "quantum bits" or "qubits" that can be 0, 1, or both at the same time - imagine a coin that's spinning in the air, being both heads and tails until it lands.

**Why this matters**: This allows quantum computers to:
- Try many solutions to a problem simultaneously
- Solve certain types of problems much faster than regular computers
- Handle complex calculations that would take classical computers thousands of years

**Real applications**: Password breaking, drug discovery, weather prediction, and financial modeling.

The fine-tuned model provides clearer explanations with better structure, analogies, and practical context!

Official Instruction-tuned Model (google/gemma-3-4b-it):

Quantum computing is a revolutionary approach to computation that leverages quantum mechanics principles. Here's a beginner-friendly explanation:

**Basic Concept**: While classical computers use bits (0 or 1), quantum computers use quantum bits (qubits) that can exist in multiple states simultaneously through "superposition."

**Key Advantages**:
- Parallel processing of multiple possibilities
- Exponential speedup for specific problem types
- Superior performance in cryptography, optimization, and simulation

**Current Applications**: Drug discovery, financial modeling, cryptography, and artificial intelligence research.

This comparison demonstrates how our custom fine-tuning can achieve similar or better instruction-following capabilities compared to the official instruction-tuned variant.


System Environment Setup

Singularity Container & Overlay Configuration

ComponentConfiguration
Singularity Image/scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif
OverlayCreated using singularity overlay create --size 25000 overlay-25GB-conda.ext3
Conda Path/ext3/miniconda3 within overlay
Singularity Shell CommandSee below
singularity shell --nv \
--overlay /scratch/<NetID>/fine-tune/overlay-25GB-conda.ext3:rw \
/scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif

Python Environment and Dependency Installation

bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
source /ext3/miniconda3/bin/activate
pip install torch transformers datasets accelerate peft trl

Model Cache Configuration for Hugging Face

To avoid exceeding home directory quotas during large model downloads:

export HF_HOME=/scratch/<NetID>/.cache/huggingface

Ensure this is set both interactively and within sbatch scripts.


This section provides a comprehensive overview of all environment-related issues encountered during the fine-tuning of google/gemma-3-4b-pt on the NYU Greene HPC cluster. Each entry includes the error symptom, root cause, and resolution strategy, categorized for clarity.

1. Filesystem and Path Setup Issues

ProblemSymptomCauseResolution
Incorrect overlay filenameNo such file: overlay-50GB-500K.ext3.gzThe filename was incorrectly assumedUse ls /scratch/work/public/overlay-fs-ext3/ to verify the correct file: overlay-50G-10M.ext3.gz
Compressed overlay used directlyFATAL: while loading overlay images...Attempted to use .gz file directly with SingularityRun gunzip overlay-50G-10M.ext3.gz before using the file
Overlay missing in working directorysbatch cannot find the overlay fileOverlay not copied to the training directoryEnsure the overlay file is placed in /scratch/<NetID>/fine-tune/ where sbatch accesses it
Invalid overlay structureFATAL: could not create upper dirOverlay created via fallocate + mkfs.ext3, missing necessary internal structureAlways use singularity overlay create --size 25000 to create overlays

2. Container Runtime and Overlay Mounting Errors

ProblemSymptomCauseResolution
GPU warning on login nodeWARNING: Could not find any nv files--nv flag used outside GPU-enabled sessionIgnore the warning, or only use --nv within a srun --gres=gpu:1 session
Overlay locked by another processoverlay in use by another processAn interactive container shell using the overlay was still activeRun lsof or ps aux and terminate blocking process

3. Python Package Installation and Environment Setup Errors

ProblemSymptomCauseResolution
which pip returns Illegal option --Unexpected error when checking pipUses /usr/bin/which instead of Bash built-inUse command -v pip or simply run pip --version
xformers install fails due to missing torchNo module named torch during installPyTorch not installed before building xformersInstall torch first: pip install torch, then pip install xformers
Missing transformers in sbatchImportError: No module named transformersConda not activated in job scriptAdd source /ext3/miniconda3/bin/activate before executing the training script
Installed pip packages not foundTraining job fails to locate modulespip used outside overlay contextOnly install packages while the overlay is mounted with :rw in an active container session

4. Disk Quota and Cache Management Issues

ProblemSymptomCauseResolution
Quota exceeded on homeOSError: [Errno 122] Disk quota exceeded: ~/.cache/huggingfaceDefault HuggingFace cache path inside /homeSet HF_HOME=/scratch/$USER/.cache/huggingface
Cache redownloading on each sbatchHugging Face cache not sharedHF_HOME not consistently definedPersist and reuse the same HF_HOME path across runs

5. Slurm Job Submission and Runtime Failures

ProblemSymptomCauseResolution
Invalid Slurm accountsbatch: Invalid account--account flag not set or invalidUse --account=pr_100_tandon_priority
Conda environment not recognizedNo module named transformersActivation missing in sbatchAdd source /ext3/miniconda3/bin/activate in sbatch
Overlay not found during jobsbatch fails to locate fileOverlay not placed in expected directoryEnsure all relevant files are in /scratch/<NetID>/fine-tune/ or update paths accordingly

RecommendationRationale
Use singularity overlay create for overlay creationEnsures upper/ and work/ directories are properly set up
Install pip packages only after mounting overlayEnsures packages persist and are isolated inside the overlay
Activate Conda explicitly in sbatchSlurm jobs do not inherit interactive shell environments
Set HF_HOME to /scratchPrevents hitting disk quota limits in home directories
Avoid return_tensors="pt" in tokenizer mappingLeads to shape mismatch errors in batched training
Use subset sampling (e.g., train[:1%]) for testingMinimizes resource consumption and enables fast debugging

LoRA Configuration Parameters

LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large models with reduced computational cost. It adapts the model's layers by adding low-rank matrices while maintaining the original model's parameters. This enables efficient training with fewer resources.

Learn more about LoRA here.

Here are the configuration parameters used for LoRA in this fine-tuning setup:

peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)

sbatch Job Script for Model Training

Training Script: train_gemma3.py

The complete training script is available here. Below are the key configuration snippets:

Model and Dataset Configuration:

# Model and dataset configuration
model_name = "google/gemma-3-4b-pt" # Base pretrained model
dataset_name = "timdettmers/openassistant-guanaco"
output_dir = "./gemma3_output"

LoRA Configuration:

# LoRA configuration
peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)

Training Arguments:

# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=50,
save_total_limit=2,
remove_unused_columns=False,
dataloader_pin_memory=False,
)

sbatch Script

#!/bin/bash
#SBATCH --job-name=gemma3-finetune
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=40GB
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=/scratch/<NetID>/fine-tune/gemma3_train_%j.out
#SBATCH --error=/scratch/<NetID>/fine-tune/gemma3_train_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<NetID>@nyu.edu

export HF_HOME=/scratch/<NetID>/.cache/huggingface

singularity exec --nv \
--overlay /scratch/<NetID>/fine-tune/overlay-25GB-conda.ext3:rw \
/scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif \
/bin/bash -c "
source /ext3/miniconda3/bin/activate
cd /scratch/<NetID>/fine-tune
python train_gemma3.py
"

Generated Output Artifacts

FileDescription
adapter_model.safetensorsLoRA adapter weights
adapter_config.jsonAdapter architecture definition
trainer_state.jsonTraining metadata
training_args.binSaved training configuration
tokenizer_config.json, tokenizer.jsonTokenizer data

Location: /scratch/<NetID>/fine-tune/gemma3_output/checkpoint-13/


Training Completion Summary

EpochsStepsStatus
113Completed successfully