---
license: mit
language:
- la
pipeline_tag: token-classification
tags:
- ner
- named-entity-recognition
- span-ner
- medieval-latin
- custom-code
- pytorch
base_model: FacebookAI/xlm-roberta-large
library_name: pytorch
metrics:
- f1
---

# Medieval Latin Span-NER (Bi-Encoder Architecture)

This repository contains a custom Span-based Named Entity Recognition (NER) model designed specifically for Medieval Latin text, such as historical charters and legal documents. Unlike standard token-level NER models, this architecture handles complex, overlapping, and highly variable span lengths by utilizing a custom Bi-Encoder approach.

## Model Architecture

The model deviates from standard Hugging Face pipelines (`AutoModelForTokenClassification`) to effectively capture both short entities and long, descriptive boundary definitions (e.g., properties and legal clauses). 

Key architectural features include:
1. **Text Encoder:** `FacebookAI/xlm-roberta-large` serves as the primary contextual sequence encoder.
2. **Label Encoder:** `BAAI/bge-m3` is utilized as a frozen semantic label encoder to map rich textual label descriptions into a dense semantic space.
3. **Span Representation Layer:** Uses Multi-Head Attention to pool sequence outputs across generated token spans, supplemented by learned span-width embeddings.
4. **Hybrid Loss Engine:** Combines a tamed Dynamic Focal Loss (to suppress inlier majority classes without gradient starvation) and Dice Loss (for boundary smoothing).
5. **Contrastive Learning:** Utilizes an InfoNCE loss branch with Hard Negative Mining (20% ratio) to push semantic representations of spans toward their corresponding label embeddings in the latent space.

## Evaluation and Ablation Results

The model has been rigorously evaluated on a custom Medieval Latin dataset. The evaluation utilizes two distinct metrics to capture different failure modes:
* **Overlap F1:** Measures span-level semantic coverage (does the model find the core entity?).
* **Exact F1:** Measures strict boundary precision (does the model correctly identify the exact start and end tokens?).

**Full Model Performance:**
* **Overlap F1:** 83.4%
* **Exact F1:** 67.7%


## Label Dictionary

The model is trained to recognize 19 distinct entity classes relevant to medieval diplomatics:

* `PER`: Individual person name.
* `ACTOR`: Full noun phrase referring to a person (name, title, profession, origin).
* `TITLE`: Social rank, noble title, or ecclesiastical office.
* `REL`: Word or phrase indicating family, kinship, or social relationship.
* `LOC`: Geographical place, settlement, city, or diocese.
* `INS`: Monastery, abbey, church, or religious order.
* `NAT`: Natural landscape feature (river, mountain, forest).
* `EST`: Physical plot of land, estate, farm, or vineyard.
* `PROP`: Detailed boundary description of a property.
* `LEG`: Legal clause declaring rights, conditions, or penalties.
* `TRANS`: Verb or phrase denoting a core transaction or donation.
* `TIM`: Time period, duration, or regnal year.
* `DAT`: Specific calendar date or liturgical feast day.
* `MON`: Money, currency, coin, or monetary value.
* `TAX`: Customary toll, legal tax, or tribute.
* `COM`: Harvested crops, physical goods, or traded animals.
* `NUM`: Number written as a word or Roman numeral.
* `MEA`: Unit of measurement for land, volume, or weight.
* `RELIC`: Holy relic, cross, altar, or sacred object.

## How to Use

Because this model uses a custom architecture, it cannot be loaded using the standard `pipeline()` API. You must download the architecture script (`span_ner_model.py`) alongside the model weights.

### Requirements
```bash
pip install torch transformers huggingface_hub
```

### Dataset
20 Named Entity Recognition (NER) Dataset for Medieval Latin Charters from Monasterium.net - https://zenodo.org/records/19009431