lbourdois commited on
Commit
7e29fe3
·
verified ·
1 Parent(s): 64c51d4

Update model card for Kannada

Browse files
Files changed (1) hide show
  1. README.md +74 -69
README.md CHANGED
@@ -1,69 +1,74 @@
1
- ---
2
- pipeline_tag: sentence-similarity
3
- language: kan
4
- license: mit
5
- tags:
6
- - trimmed
7
- library_name: sentence-transformers
8
- base_model: BAAI/bge-m3
9
- base_model_relation: quantized
10
- datasets:
11
- - Lumberjackk/fineweb-2-trimming
12
- ---
13
-
14
- # bge-m3-kan-16384
15
-
16
- This model is a **42.14% smaller** version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for Kannada language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/introduction-to-trimming) method.
17
-
18
- This trimmed model should perform similarly to the original model with only **16,384 tokens** and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in Kannada were removed from the vocabulary.
19
-
20
- ## Model Statistics
21
-
22
- | Metric | Original | Trimmed | Reduction |
23
- |--------|----------|---------|-----------|
24
- | **Vocabulary size** | 250,002 tokens | 16,384 tokens | **93.45%** |
25
- | **Model size** | 567,754,752 params | 328,529,920 params | **42.14%** |
26
-
27
-
28
- ## Mining Dataset Statistics
29
-
30
- - **Number of texts used for mining**: 200,000 texts
31
- - **Dataset**: [Lumberjackk/fineweb-2-trimming](https://huggingface.co/datasets/Lumberjackk/fineweb-2-trimming)
32
-
33
- ![image](https://cdn-uploads.huggingface.co/production/uploads/613b0a62a14099d5afed7830/7UlOxvIMVUm--Wexm9yyz.png)
34
-
35
- ## Usage
36
-
37
- ```python
38
- from sentence_transformers import SentenceTransformer
39
- # Download from the 🤗 Hub
40
- model = SentenceTransformer("lbourdois/bge-m3-kan-16384")
41
- # Run inference with queries and documents
42
- query = "My query"
43
- documents = [
44
- "Chunk 1",
45
- "Chunk 2",
46
- "Chunk 3",
47
- ]
48
- query_embeddings = model.encode_query(query)
49
- document_embeddings = model.encode_document(documents)
50
- print(query_embeddings.shape, document_embeddings.shape)
51
- # Compute similarities to determine a ranking
52
- similarities = model.similarity(query_embeddings, document_embeddings)
53
- print(similarities)
54
- ```
55
-
56
- ## Citation
57
-
58
- #### BGE M3-Embedding
59
-
60
- ```bibtex
61
- @misc{bge-m3,
62
- title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
63
- author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
64
- year={2024},
65
- eprint={2402.03216},
66
- archivePrefix={arXiv},
67
- primaryClass={cs.CL}
68
- }
69
- ```
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: kan
4
+ license: mit
5
+ tags:
6
+ - trimmed
7
+ library_name: sentence-transformers
8
+ base_model: BAAI/bge-m3
9
+ base_model_relation: quantized
10
+ datasets:
11
+ - lbourdois/fineweb-2-trimming
12
+ ---
13
+
14
+ # bge-m3-kan-16384
15
+ This model is a **42.14% smaller** version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for Kannada language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
16
+ This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
17
+
18
+ ## Model Statistics
19
+
20
+ | Metric | Original | Trimmed | Reduction |
21
+ |--------|----------|---------|-----------|
22
+ | **Vocabulary size** | 250,002 tokens | 16,384 tokens | **93.45%** |
23
+ | **Model size** | 567,754,752 params | 328,529,920 params | **42.14%** |
24
+
25
+ ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/bge-m3-16384.png)
26
+
27
+ ## Mining Dataset Statistics
28
+ - **Number of texts used for mining**: 200,000 texts
29
+ - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
30
+
31
+ ## Usage
32
+
33
+ ```python
34
+ from sentence_transformers import SentenceTransformer
35
+ # Download from the 🤗 Hub
36
+ model = SentenceTransformer("alphaedge-ai/bge-m3-kan-16384")
37
+ # Run inference with queries and documents
38
+ query = "My query in Kannada"
39
+ documents = [
40
+ "Chunk in Kannada",
41
+ "Chunk in Kannada",
42
+ "Chunk in Kannada",
43
+ ]
44
+ query_embeddings = model.encode_query(query)
45
+ document_embeddings = model.encode_document(documents)
46
+ print(query_embeddings.shape, document_embeddings.shape)
47
+ # Compute similarities to determine a ranking
48
+ similarities = model.similarity(query_embeddings, document_embeddings)
49
+ print(similarities)
50
+ ```
51
+
52
+ ## Citations
53
+
54
+ #### BGE-M3
55
+ ```
56
+ @misc{bge-m3,
57
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
58
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
59
+ year={2024},
60
+ eprint={2402.03216},
61
+ archivePrefix={arXiv},
62
+ primaryClass={cs.CL}
63
+ }
64
+ ```
65
+
66
+ #### Trimming blog post
67
+ ```
68
+ @misc{hf_blogpost_trimming,
69
+ title={Introduction to Trimming},
70
+ author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
71
+ year={2026},
72
+ url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
73
+ }
74
+ ```