--- license: mit language: - la pipeline_tag: token-classification tags: - ner - named-entity-recognition - span-ner - medieval-latin - custom-code - pytorch base_model: FacebookAI/xlm-roberta-large library_name: pytorch metrics: - f1 --- # Medieval Latin Span-NER (Bi-Encoder Architecture) This repository contains a custom Span-based Named Entity Recognition (NER) model designed specifically for Medieval Latin text, such as historical charters and legal documents. Unlike standard token-level NER models, this architecture handles complex, overlapping, and highly variable span lengths by utilizing a custom Bi-Encoder approach. ## Model Architecture The model deviates from standard Hugging Face pipelines (`AutoModelForTokenClassification`) to effectively capture both short entities and long, descriptive boundary definitions (e.g., properties and legal clauses). Key architectural features include: 1. **Text Encoder:** `FacebookAI/xlm-roberta-large` serves as the primary contextual sequence encoder. 2. **Label Encoder:** `BAAI/bge-m3` is utilized as a frozen semantic label encoder to map rich textual label descriptions into a dense semantic space. 3. **Span Representation Layer:** Uses Multi-Head Attention to pool sequence outputs across generated token spans, supplemented by learned span-width embeddings. 4. **Hybrid Loss Engine:** Combines a tamed Dynamic Focal Loss (to suppress inlier majority classes without gradient starvation) and Dice Loss (for boundary smoothing). 5. **Contrastive Learning:** Utilizes an InfoNCE loss branch with Hard Negative Mining (20% ratio) to push semantic representations of spans toward their corresponding label embeddings in the latent space. ## Evaluation and Ablation Results The model has been rigorously evaluated on a custom Medieval Latin dataset. The evaluation utilizes two distinct metrics to capture different failure modes: * **Overlap F1:** Measures span-level semantic coverage (does the model find the core entity?). * **Exact F1:** Measures strict boundary precision (does the model correctly identify the exact start and end tokens?). **Full Model Performance:** * **Overlap F1:** 83.4% * **Exact F1:** 67.7% ## Label Dictionary The model is trained to recognize 19 distinct entity classes relevant to medieval diplomatics: * `PER`: Individual person name. * `ACTOR`: Full noun phrase referring to a person (name, title, profession, origin). * `TITLE`: Social rank, noble title, or ecclesiastical office. * `REL`: Word or phrase indicating family, kinship, or social relationship. * `LOC`: Geographical place, settlement, city, or diocese. * `INS`: Monastery, abbey, church, or religious order. * `NAT`: Natural landscape feature (river, mountain, forest). * `EST`: Physical plot of land, estate, farm, or vineyard. * `PROP`: Detailed boundary description of a property. * `LEG`: Legal clause declaring rights, conditions, or penalties. * `TRANS`: Verb or phrase denoting a core transaction or donation. * `TIM`: Time period, duration, or regnal year. * `DAT`: Specific calendar date or liturgical feast day. * `MON`: Money, currency, coin, or monetary value. * `TAX`: Customary toll, legal tax, or tribute. * `COM`: Harvested crops, physical goods, or traded animals. * `NUM`: Number written as a word or Roman numeral. * `MEA`: Unit of measurement for land, volume, or weight. * `RELIC`: Holy relic, cross, altar, or sacred object. ## How to Use Because this model uses a custom architecture, it cannot be loaded using the standard `pipeline()` API. You must download the architecture script (`span_ner_model.py`) alongside the model weights. ### Requirements ```bash pip install torch transformers huggingface_hub ``` ### Dataset 20 Named Entity Recognition (NER) Dataset for Medieval Latin Charters from Monasterium.net - https://zenodo.org/records/19009431