AbhishekBiswas12 commited on
Commit
a5a7df2
·
verified ·
1 Parent(s): 1378226

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -77
README.md CHANGED
@@ -18,93 +18,155 @@ pretty_name: Word2Vec Hindi Embeddings
18
  ---
19
  # Word2Vec_hindi
20
 
21
- Welcome to **Word2Vec_hindi**
22
- This project is my attempt at implementing the **Word2Vec model from scratch**, specifically for the **Hindi language**.
23
 
24
- The primary goal of this project is learning by building understanding how word embeddings work by implementing them myself rather than relying on high-level libraries.
 
 
 
 
 
 
 
 
 
 
25
 
26
  Feel free to explore the project, experiment with it, and raise issues or suggestions. While I may not implement every suggestion, I genuinely appreciate feedback and ideas.
27
 
28
  ---
29
 
30
- ## Project Status
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- This is an evolving project, and I will continue improving it as I deepen my understanding of NLP and representation learning.
33
 
34
  ---
35
 
36
- ## Latest Updates
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- - Combined 5 datasets:
39
- 1. kapilverma/hindi-bible
40
- 2. aiswaryaramachandran/hindienglish-corpora
41
- 3. preetviradiya/english-hindi-dataset
42
- 4. vaibhavkumar11/hindi-english-parallel-corpus
43
- 5. disisbig/hindi-wikipedia-articles-172k
44
- - The combined text data has almost 82M tokens. The dataset was broken into a vocabulary of over 500K unique words.
45
- - Now the context window has been increased to 5, which creates 10 instances of **(context, target)** pairs for each instance of a **context** word
46
-
47
  ---
48
 
49
- ## Datasets Used
50
 
51
- 1. **Hindi Bible**
52
- Source: https://www.kaggle.com/datasets/kapilverma/hindi-bible
 
53
 
54
- 2. **Hindi-English Corpora**
55
- Source: https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora
 
56
 
57
- 3. **English-Hindi Dataset**
58
- Source: https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset
 
59
 
60
- 3. **IIT Bombay English-Hindi Translation Dataset**
61
- Source: https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus
62
-
63
- 3. **Hindi Wikipedia Articles - 172k**
64
- Source: https://www.kaggle.com/datasets/disisbig/hindi-wikipedia-articles-172k
65
-
 
66
 
67
  ---
68
 
69
- ## Dataset Preprocessing
 
 
70
 
71
- The preprocessing pipeline includes:
 
 
 
 
 
 
72
 
73
- - Concatenating Hindi text from all datasets
74
- - Removing punctuation and noise
75
- - Tokenizing text into words
76
- - Building a vocabulary from the corpus
77
 
78
- ### Key Improvements
79
 
80
- #### 1. Vocabulary Pruning
81
 
82
- Instead of using all unique words, I now keep only words that appear more than 5 times in the dataset This helps:
83
  - Reduce vocabulary size
84
  - Improve training efficiency
85
- - Remove noisy/rare words
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- #### 2. Context Window Update
88
- - Previous window size: 3
89
- - Current window size: 5
90
- This allows the model to:
91
- - Capture broader context
92
- - Learn better semantic relationships
93
 
94
- ## Training Data Generation
 
 
95
 
96
- For each word in a sentence:
97
- - Treat it as the center (context) word
98
- - Select surrounding words within the window as target words
99
 
100
- #### Example
101
  Sentence:
 
102
  ```text
103
  आज सुबह मैंने अपने पुराने दोस्त के साथ बाजार में चाय पी
104
  ```
105
- If the context word is: दोस्त
106
 
107
- With a window size of 5, surrounding words are used to create pairs like:
 
 
 
 
 
 
 
108
  ```text
109
  [दोस्त, सुबह]
110
  [दोस्त, मैंने]
@@ -117,50 +179,135 @@ With a window size of 5, surrounding words are used to create pairs like:
117
  [दोस्त, चाय]
118
  [दोस्त, पी]
119
  ```
120
- This process is repeated for all words in the corpus to generate training pairs.
121
 
122
- ## Negative Sampling
 
 
 
 
123
 
124
- In addition to positive pairs, I also generate negative samples:
125
 
126
- Random words are selected from the vocabulary
127
- These words do not appear in the context window of the center word
128
- They are paired with the center word as negative examples
129
 
130
- Example:
131
  ```text
132
- [दोस्त, किताब]
133
- [दोस्त, पहाड़]
134
  [दोस्त, कंप्यूटर]
 
 
135
  ```
136
- These pairs represent words that are unlikely to co-occur with the context word.
137
 
138
- ### Why Negative Sampling?
139
 
140
- Negative sampling helps the model:
141
- - Learn to distinguish between relevant and irrelevant word pairs
142
- - Improve embedding quality
143
- - Reduce computational cost compared to full softmax
144
 
145
- This process is repeated for all words in the corpus to generate training data.
146
 
147
- ## Model Overview (Current)
148
- - Architecture: Skip-gram style training
149
- - Embeddings learned using word2vec training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  - Optimizer: Adagrad
151
  - Loss Function: BCEWithLogitsLoss
152
- - Training uses both positive and negative word pairs
 
 
 
 
 
 
 
 
153
 
154
- ## Contributions
155
- This is primarily a learning project, but suggestions and ideas are always welcome.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
- ## References
158
  - https://jalammar.github.io/illustrated-word2vec/
159
  - https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673
160
  - https://jaketae.github.io/study/word2vec/
161
 
162
- ## Author
 
 
 
163
  ```text
164
  Abhishek Biswas
165
- Software Developer | Interested in AI & Web Development
166
- ```
 
18
  ---
19
  # Word2Vec_hindi
20
 
21
+ Welcome to **Word2Vec_hindi**
 
22
 
23
+ This project is my attempt at implementing the **Word2Vec model completely from scratch**, specifically for the **Hindi language**.
24
+
25
+ The primary goal of this project is learning by building — understanding how word embeddings work internally by implementing the entire pipeline myself instead of relying on high-level NLP libraries.
26
+
27
+ The project currently includes:
28
+ - Dataset collection and preprocessing
29
+ - Vocabulary generation
30
+ - Skip-gram pair generation
31
+ - Negative sampling
32
+ - Custom PyTorch training pipeline
33
+ - Embedding evaluation and visualization
34
 
35
  Feel free to explore the project, experiment with it, and raise issues or suggestions. While I may not implement every suggestion, I genuinely appreciate feedback and ideas.
36
 
37
  ---
38
 
39
+ # Project Status
40
+
41
+ This project has evolved from a small experimental implementation into a large-scale embedding training pipeline.
42
+
43
+ Current progress includes:
44
+ - Training on a corpus containing over **82M Hindi tokens**
45
+ - Generating over **1.5 Billion skip-gram training pairs**
46
+ - Training multiple embedding models with dimensions ranging from **300–400**
47
+ - Evaluating embeddings using:
48
+ - cosine similarity
49
+ - nearest-neighbor retrieval
50
+ - analogy testing
51
+ - embedding visualization using PCA and t-SNE
52
+
53
+ The current best-performing model:
54
+ - Embedding Size: **350**
55
+ - Training Loss: **~0.38**
56
+ - Validation Loss: **~0.47**
57
 
58
+ The model is now producing meaningful semantic separation between positive and negative word pairs.
59
 
60
  ---
61
 
62
+ # Latest Updates
63
+
64
+ - Combined 5 large Hindi datasets into a single training corpus
65
+ - Final corpus size reached approximately **82M tokens**
66
+ - Vocabulary built from words occurring atleast **2 times**
67
+ - Final vocabulary size exceeds **500K unique words**
68
+ - Context window size increased from **3 → 5**
69
+ - Generated approximately:
70
+ - **1.5 Billion training skip-gram pairs**
71
+ - **40M validation pairs**
72
+ - **40M testing pairs**
73
+ - Implemented:
74
+ - Skip-gram training
75
+ - Negative sampling
76
+ - BCEWithLogitsLoss training objective
77
+ - Adagrad optimizer
78
+ - Added support for:
79
+ - PCA embedding visualization
80
+ - t-SNE embedding visualization
81
+ - cosine similarity search
82
+ - analogy-based embedding evaluation
83
 
 
 
 
 
 
 
 
 
 
84
  ---
85
 
86
+ # Datasets Used
87
 
88
+ ## 1. Hindi Bible
89
+ Source:
90
+ https://www.kaggle.com/datasets/kapilverma/hindi-bible
91
 
92
+ ## 2. Hindi-English Corpora
93
+ Source:
94
+ https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora
95
 
96
+ ## 3. English-Hindi Dataset
97
+ Source:
98
+ https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset
99
 
100
+ ## 4. IIT Bombay English-Hindi Translation Dataset
101
+ Source:
102
+ https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus
103
+
104
+ ## 5. Hindi Wikipedia Articles - 172k
105
+ Source:
106
+ https://www.kaggle.com/datasets/disisbig/hindi-wikipedia-articles-172k
107
 
108
  ---
109
 
110
+ # Dataset Preprocessing
111
+
112
+ The preprocessing pipeline currently includes:
113
 
114
+ - Combining Hindi text from multiple datasets
115
+ - Cleaning punctuation and noisy symbols
116
+ - Tokenizing text into words
117
+ - Building vocabulary mappings
118
+ - Removing extremely rare words
119
+ - Generating skip-gram training pairs
120
+ - Generating negative samples
121
 
122
+ ---
 
 
 
123
 
124
+ # Vocabulary Pruning
125
 
126
+ Instead of keeping every unique token, only words appearing atleast **2 times** are retained.
127
 
128
+ This helps:
129
  - Reduce vocabulary size
130
  - Improve training efficiency
131
+ - Remove noisy and corrupted tokens
132
+ - Improve embedding quality
133
+
134
+ ---
135
+
136
+ # Context Window
137
+
138
+ - Previous context window size: **3**
139
+ - Current context window size: **5**
140
+
141
+ With a window size of 5:
142
+ - each center word can generate up to 10 positive pairs
143
+ - broader semantic context can be captured
144
+ - embeddings learn richer relationships
145
+
146
+ ---
147
 
148
+ # Training Data Generation
 
 
 
 
 
149
 
150
+ For each word:
151
+ - The word is treated as the **center/context** word
152
+ - Neighboring words within the context window are treated as positive target words
153
 
154
+ ## Example
 
 
155
 
 
156
  Sentence:
157
+
158
  ```text
159
  आज सुबह मैंने अपने पुराने दोस्त के साथ बाजार में चाय पी
160
  ```
 
161
 
162
+ If the center word is:
163
+
164
+ ```text
165
+ दोस्त
166
+ ```
167
+
168
+ Generated positive pairs:
169
+
170
  ```text
171
  [दोस्त, सुबह]
172
  [दोस्त, मैंने]
 
179
  [दोस्त, चाय]
180
  [दोस्त, पी]
181
  ```
 
182
 
183
+ This process is repeated across the entire corpus to generate training pairs.
184
+
185
+ ---
186
+
187
+ # Negative Sampling
188
 
189
+ In addition to positive pairs, negative samples are generated.
190
 
191
+ Random vocabulary words that do not appear in the context window are paired with the center word.
192
+
193
+ ## Example
194
 
 
195
  ```text
 
 
196
  [दोस्त, कंप्यूटर]
197
+ [दोस्त, पहाड़]
198
+ [दोस्त, विज्ञान]
199
  ```
 
200
 
201
+ These represent unlikely co-occurrences.
202
 
203
+ ---
 
 
 
204
 
205
+ # Why Negative Sampling?
206
 
207
+ Negative sampling helps:
208
+ - Learn meaningful semantic separation
209
+ - Distinguish related vs unrelated words
210
+ - Scale training efficiently to very large vocabularies
211
+ - Avoid the computational cost of full softmax
212
+
213
+ ---
214
+
215
+ # Model Architecture
216
+
217
+ Current training setup:
218
+
219
+ - Architecture: Skip-gram Word2Vec
220
+ - Framework: PyTorch
221
+ - Embedding dimensions tested:
222
+ - 300
223
+ - 350
224
+ - 400
225
+ - Best-performing embedding size so far: **350**
226
  - Optimizer: Adagrad
227
  - Loss Function: BCEWithLogitsLoss
228
+ - Training uses:
229
+ - positive skip-gram pairs
230
+ - negative sampled pairs
231
+
232
+ ---
233
+
234
+ # Current Results
235
+
236
+ The model now learns strong separation between positive and negative pairs.
237
 
238
+ Observed probability ranges:
239
+ - Positive pairs: ~0.94
240
+ - Negative pairs: ~0.07
241
+
242
+ The embeddings are beginning to capture:
243
+ - semantic similarity
244
+ - contextual relationships
245
+ - syntactic structure
246
+
247
+ ---
248
+
249
+ # Embedding Evaluation
250
+
251
+ Current evaluation methods include:
252
+
253
+ ## 1. Cosine Similarity
254
+
255
+ Used to retrieve semantically similar words.
256
+
257
+ Example goals:
258
+
259
+ ```text
260
+ राजा → रानी, सम्राट, शासक
261
+ ```
262
+
263
+ ---
264
+
265
+ ## 2. Analogy Testing
266
+
267
+ Evaluating vector arithmetic relationships such as:
268
+
269
+ ```text
270
+ राजा - पुरुष + महिला ≈ रानी
271
+ ```
272
+
273
+ ---
274
+
275
+ ## 3. Embedding Visualization
276
+
277
+ Using:
278
+ - PCA
279
+ - t-SNE
280
+
281
+ to visualize learned word clusters in 2D space.
282
+
283
+ ---
284
+
285
+ # Future Improvements
286
+
287
+ Planned improvements include:
288
+
289
+ - Subsampling extremely frequent words
290
+ - Improved negative sampling strategies
291
+
292
+ ---
293
+
294
+ # Contributions
295
+
296
+ This is primarily a learning and research-oriented project, but suggestions, ideas, and feedback are always welcome.
297
+
298
+ ---
299
+
300
+ # References
301
 
 
302
  - https://jalammar.github.io/illustrated-word2vec/
303
  - https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673
304
  - https://jaketae.github.io/study/word2vec/
305
 
306
+ ---
307
+
308
+ # Author
309
+
310
  ```text
311
  Abhishek Biswas
312
+ Software Developer | Interested in AI, NLP, and Web Development
313
+ ```