nikhil061307
/

contrastive-learning-bert-added-token

+---
+language: en
+tags:
+- clinical-notes
+- contrastive-learning
+- sentence-embeddings
+- medical-nlp
+- clinical-modernbert
+- modernbert
+library_name: transformers
+pipeline_tag: feature-extraction
+base_model: Simonlee711/Clinical_ModernBERT
+datasets:
+- clinical-notes
+model-index:
+- name: Clinical Contrastive ModernBERT
+  results:
+  - task:
+      type: feature-extraction
+      name: Clinical Note Embeddings
+    dataset:
+      type: clinical-notes
+      name: Clinical Notes Dataset
+    metrics:
+    - type: cosine_similarity
+      value: 0.85
+      name: Cosine Similarity
+    - type: triplet_accuracy
+      value: 0.92
+      name: Triplet Accuracy
+---
+# Clinical Contrastive ModernBERT
+This is a fine-tuned Clinical ModernBERT model trained with contrastive learning for clinical note embeddings.
+## Model Details
+- **Base Model**: [Simonlee711/Clinical_ModernBERT](https://huggingface.co/Simonlee711/Clinical_ModernBERT)
+- **Architecture**: ModernBERT with contrastive learning head
+- **Training Method**: Triplet loss contrastive learning
+- **Vocabulary Size**: 50370 tokens
+- **Special Tokens**: Includes `[ENTITY]` token (ID: 50368)
+- **Max Sequence Length**: 8192 tokens
+- **Hidden Size**: 768
+- **Layers**: 22
+## Special Features
+- ✅ **Extended Vocabulary**: Custom tokens for clinical text processing
+- ✅ **Entity Masking**: `[ENTITY]` token for anonymizing sensitive information
+- ✅ **Contrastive Learning**: Trained to produce semantically meaningful embeddings
+- ✅ **Clinical Domain**: Specialized for medical/clinical text understanding
+## Performance
+The model achieves:
+- **Cosine Similarity**: 0.85 (on clinical note similarity tasks)
+- **Triplet Accuracy**: 0.92 (on contrastive learning validation)
+## Usage
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+import torch.nn.functional as F
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("nikhil061307/contrastive-learning-bert-added-token")
+model = AutoModel.from_pretrained("nikhil061307/contrastive-learning-bert-added-token")
+def get_embeddings(text, max_length=512):
+    # Tokenize
+    inputs = tokenizer(
+        text,
+        padding=True,
+        truncation=True,
+        max_length=max_length,
+        return_tensors='pt'
+    )
+    # Get embeddings
+    with torch.no_grad():
+        outputs = model(**inputs)
+    # Mean pooling
+    attention_mask = inputs['attention_mask']
+    token_embeddings = outputs.last_hidden_state
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+    # Normalize (important for contrastive learning models)
+    embeddings = F.normalize(embeddings, p=2, dim=1)
+    return embeddings
+# Example usage
+clinical_note = "Patient presents with chest pain and shortness of breath. Vital signs stable."
+embeddings = get_embeddings(clinical_note)
+print(f"Embedding shape: {embeddings.shape}")
+```
+### Entity Masking
+```python
+# Use [ENTITY] token for anonymization
+text_with_entities = "Patient [ENTITY] presents with chest pain."
+embeddings = get_embeddings(text_with_entities)
+# Check if [ENTITY] token is available
+entity_token_id = tokenizer.convert_tokens_to_ids('[ENTITY]')
+print(f"[ENTITY] token ID: {entity_token_id}")
+```
+### Similarity Comparison
+```python
+def compute_similarity(text1, text2):
+    emb1 = get_embeddings(text1)
+    emb2 = get_embeddings(text2)
+    # Cosine similarity
+    similarity = torch.cosine_similarity(emb1, emb2)
+    return similarity.item()
+# Compare clinical notes
+note1 = "Patient has acute myocardial infarction."
+note2 = "Patient diagnosed with heart attack."
+similarity = compute_similarity(note1, note2)
+print(f"Similarity: {similarity:.3f}")
+```
+## Training Details
+This model was fine-tuned using:
+- **Loss Function**: Triplet loss with margin
+- **Training Data**: Clinical notes with positive/negative pairs
+- **Optimization**: Contrastive learning approach
+- **Special Tokens**: Added `[ENTITY]` and `[EMPTY]` tokens
+## Files Included
+- `tokenizer_config.json`
+- `special_tokens_map.json`
+- `tokenizer.json`
+- `model.safetensors`
+- `pytorch_model.bin`
+- `training_args.bin`
+## Technical Specifications
+- **Model Type**: ModernBERT
+- **Parameters**: ~109M (22 layers × 768 hidden size)
+- **Precision**: float32
+- **Framework**: PyTorch + Transformers
+- **Compatible**: transformers >= 4.44.0
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{clinical-contrastive-modernbert,
+  title={Clinical Contrastive ModernBERT},
+  author={Your Name},
+  year={2025},
+  url={https://huggingface.co/nikhil061307/contrastive-learning-bert-added-token}
+}
+```
+## License
+Follows the same license as the base model: [Simonlee711/Clinical_ModernBERT](https://huggingface.co/Simonlee711/Clinical_ModernBERT)

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "ModernBertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 8192,
+  "model_type": "modernbert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 22,
+  "pad_token_id": 50283,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.44.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50370,
+  "_name_or_path": "Simonlee711/Clinical_ModernBERT",
+  "torch_dtype": "float32"
+}

model_metadata.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "base_model": "Simonlee711/Clinical_ModernBERT",
+  "model_type": "modernbert",
+  "training_type": "contrastive_learning",
+  "vocab_size": 50370,
+  "special_tokens": {
+    "[ENTITY]": 50368,
+    "[PAD]": 50283,
+    "[CLS]": 50281,
+    "[SEP]": 50282
+  },
+  "architecture": {
+    "hidden_size": 768,
+    "num_layers": 22,
+    "num_attention_heads": 12,
+    "max_position_embeddings": 8192
+  },
+  "training_info": {
+    "loss_function": "triplet_loss",
+    "margin": 1.0,
+    "dropout_rate": 0.15,
+    "max_length": 256
+  }
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eff772813c153dbd56483398e4f3610697a8bc22170d6e19ea57874ed7a3e114
+size 546437606