File size: 4,935 Bytes

3fe582d

---
language: en
tags:
- clinical-notes
- contrastive-learning
- sentence-embeddings
- medical-nlp
- clinical-modernbert
- modernbert
library_name: transformers
pipeline_tag: feature-extraction
base_model: Simonlee711/Clinical_ModernBERT
datasets:
- clinical-notes
model-index:
- name: Clinical Contrastive ModernBERT
  results:
  - task:
      type: feature-extraction
      name: Clinical Note Embeddings
    dataset:
      type: clinical-notes
      name: Clinical Notes Dataset
    metrics:
    - type: cosine_similarity
      value: 0.85
      name: Cosine Similarity
    - type: triplet_accuracy
      value: 0.92
      name: Triplet Accuracy
---

# Clinical Contrastive ModernBERT

This is a fine-tuned Clinical ModernBERT model trained with contrastive learning for clinical note embeddings.

## Model Details

- **Base Model**: [Simonlee711/Clinical_ModernBERT](https://huggingface.co/Simonlee711/Clinical_ModernBERT)
- **Architecture**: ModernBERT with contrastive learning head
- **Training Method**: Triplet loss contrastive learning
- **Vocabulary Size**: 50370 tokens
- **Special Tokens**: Includes `[ENTITY]` token (ID: 50368)
- **Max Sequence Length**: 8192 tokens
- **Hidden Size**: 768
- **Layers**: 22

## Special Features

- ✅ **Extended Vocabulary**: Custom tokens for clinical text processing
- ✅ **Entity Masking**: `[ENTITY]` token for anonymizing sensitive information
- ✅ **Contrastive Learning**: Trained to produce semantically meaningful embeddings
- ✅ **Clinical Domain**: Specialized for medical/clinical text understanding

## Performance

The model achieves:
- **Cosine Similarity**: 0.85 (on clinical note similarity tasks)
- **Triplet Accuracy**: 0.92 (on contrastive learning validation)

## Usage

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nikhil061307/contrastive-learning-bert-added-token")
model = AutoModel.from_pretrained("nikhil061307/contrastive-learning-bert-added-token")

def get_embeddings(text, max_length=512):
    # Tokenize
    inputs = tokenizer(
        text, 
        padding=True, 
        truncation=True, 
        max_length=max_length, 
        return_tensors='pt'
    )
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Mean pooling
    attention_mask = inputs['attention_mask']
    token_embeddings = outputs.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    # Normalize (important for contrastive learning models)
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings

# Example usage
clinical_note = "Patient presents with chest pain and shortness of breath. Vital signs stable."
embeddings = get_embeddings(clinical_note)
print(f"Embedding shape: {embeddings.shape}")
```

### Entity Masking

```python
# Use [ENTITY] token for anonymization
text_with_entities = "Patient [ENTITY] presents with chest pain."
embeddings = get_embeddings(text_with_entities)

# Check if [ENTITY] token is available
entity_token_id = tokenizer.convert_tokens_to_ids('[ENTITY]')
print(f"[ENTITY] token ID: {entity_token_id}")
```

### Similarity Comparison

```python
def compute_similarity(text1, text2):
    emb1 = get_embeddings(text1)
    emb2 = get_embeddings(text2)
    
    # Cosine similarity
    similarity = torch.cosine_similarity(emb1, emb2)
    return similarity.item()

# Compare clinical notes
note1 = "Patient has acute myocardial infarction."
note2 = "Patient diagnosed with heart attack."
similarity = compute_similarity(note1, note2)
print(f"Similarity: {similarity:.3f}")
```

## Training Details

This model was fine-tuned using:
- **Loss Function**: Triplet loss with margin
- **Training Data**: Clinical notes with positive/negative pairs
- **Optimization**: Contrastive learning approach
- **Special Tokens**: Added `[ENTITY]` and `[EMPTY]` tokens

## Files Included

- `tokenizer_config.json`
- `special_tokens_map.json`
- `tokenizer.json`
- `model.safetensors`
- `pytorch_model.bin`
- `training_args.bin`

## Technical Specifications

- **Model Type**: ModernBERT
- **Parameters**: ~109M (22 layers × 768 hidden size)
- **Precision**: float32
- **Framework**: PyTorch + Transformers
- **Compatible**: transformers >= 4.44.0

## Citation

If you use this model, please cite:

```bibtex
@misc{clinical-contrastive-modernbert,
  title={Clinical Contrastive ModernBERT},
  author={Your Name},
  year={2025},
  url={https://huggingface.co/nikhil061307/contrastive-learning-bert-added-token}
}
```

## License

Follows the same license as the base model: [Simonlee711/Clinical_ModernBERT](https://huggingface.co/Simonlee711/Clinical_ModernBERT)