tanaos-NER-v1: A small but performant Named Entity Recognition model
This model was created by Tanaos with the Artifex Python library.
This is a multilingual (it supports 16+ languages) Named Entity Recognition model based on FacebookAI/roberta-base and fine-tuned on a synthetic dataset to recognize and classify entities in text into the following 14 entity categories:
| Entity | Description |
|---|---|
PERSON |
Individual people, fictional characters |
ORG |
Companies, institutions, agencies |
LOCATION |
Geographical areas |
DATE |
Absolute or relative dates, including years, months and/or days |
TIME |
Specific time of the day |
PERCENT |
Percentage expressions |
NUMBER |
Numeric measurements or expressions |
FACILITY |
Buildings, airports, highways, etc. |
PRODUCT |
Objects, vehicles, food, etc. bearing a specific name |
WORK_OF_ART |
Titles of creative works |
LANGUAGE |
Natural or programming languages |
NORP |
National, religious or political groups |
ADDRESS |
Full addresses |
PHONE_NUMBER |
Telephone numbers |
These entities were chosen to cover a wide range of common named entity types that are useful in various NLP applications, regardless of the specific application domain, in order to create a versatile and general-purpose Named Entity Recognition model, applicable across various industries and use cases.
How to Use
Via the Artifex library (pip install artifex)
from artifex import Artifex
ner = Artifex().named_entity_recognition
print(ner("John landed in Barcelona at 15:45."))
# >>> [{'entity_group': 'PERSON', 'score': np.float32(0.92174554), 'word': 'John', 'start': 0, 'end': 4}, {'entity_group': 'LOCATION', 'score': np.float32(0.9853817), 'word': ' Barcelona', 'start': 15, 'end': 24}, {'entity_group': 'TIME', 'score': np.float32(0.98645407), 'word': ' 15:45.', 'start': 28, 'end': 34}]
Via the Transformers library
from transformers import pipeline
ner = pipeline(
task="token-classification",
model="tanaos/tanaos-NER-v1",
aggregation_strategy="first"
)
print(ner("John landed in Barcelona at 15:45."))
# >>> [{'entity_group': 'PERSON', 'score': np.float32(0.92174554), 'word': 'John', 'start': 0, 'end': 4}, {'entity_group': 'LOCATION', 'score': np.float32(0.9853817), 'word': ' Barcelona', 'start': 15, 'end': 24}, {'entity_group': 'TIME', 'score': np.float32(0.98645407), 'word': ' 15:45.', 'start': 28, 'end': 34}]
Model Description
- Base model:
FacebookAI/roberta-base - Task: Text classification (Named Entity Recognition)
- Languages: Multilingual (16+ languages)
- Fine-tuning data: A synthetic, custom dataset of around 10,000 passages, each containing multiple named entities across 14 categories.
Training Details
This model was trained using the Artifex Python library
pip install artifex
by providing the following instructions and generating 10,000 synthetic training samples:
from artifex import Artifex
ner = Artifex().named_entity_recognition
ner.train(
named_entities={
"PERSON": "Individual people, fictional characters",
"ORG": "Companies, institutions, agencies",
"LOCATION": "Geographical areas",
"DATE": "Absolute or relative dates, including years, months and/or days",
"TIME": "Specific time of the day",
"PERCENT": "Percentage expressions",
"NUMBER": "Numeric measurements or expressions",
"FACILITY": "Buildings, airports, highways, etc.",
"PRODUCT": "Objects, vehicles, food, etc. bearing a specific name",
"WORK_OF_ART": "Titles of creative works",
"LANGUAGE": "Natural or programming languages",
"NORP": "National, religious or political groups",
"ADDRESS": "full addresses",
"PHONE_NUMBER": "telephone numbers",
},
domain="general",
num_samples=10000
)
Intended Uses
This model is intended to:
- Extract and classify named entities from text in a variety of applications, such as chatbots, information extraction systems, and data analysis tools.
- Be used in multilingual contexts, supporting over 16 languages.
- Serve as a general-purpose NER model applicable across various industries and use cases.
Not intended for:
- Highly specialized domains requiring custom entity types not covered by the 14 categories in this model.
- Idioms, slang, or very informal text where entity recognition may be less reliable.
- Downloads last month
- 423