AbhishekBiswas12
/

word2vec-hindi

@@ -18,93 +18,155 @@ pretty_name: Word2Vec Hindi Embeddings
 ---
 # Word2Vec_hindi
-Welcome to **Word2Vec_hindi**
-This project is my attempt at implementing the **Word2Vec model from scratch**, specifically for the **Hindi language**.
-The primary goal of this project is learning by building — understanding how word embeddings work by implementing them myself rather than relying on high-level libraries.
 Feel free to explore the project, experiment with it, and raise issues or suggestions. While I may not implement every suggestion, I genuinely appreciate feedback and ideas.
 ---
-## Project Status
-This is an evolving project, and I will continue improving it as I deepen my understanding of NLP and representation learning.
 ---
-## Latest Updates
-- Combined 5 datasets:
-   1. kapilverma/hindi-bible
-   2. aiswaryaramachandran/hindienglish-corpora
-   3. preetviradiya/english-hindi-dataset
-   4. vaibhavkumar11/hindi-english-parallel-corpus
-   5. disisbig/hindi-wikipedia-articles-172k
-- The combined text data has almost 82M tokens. The dataset was broken into a vocabulary of over 500K unique words.
-- Now the context window has been increased to 5, which creates 10 instances of **(context, target)** pairs for each instance of a **context** word
 ---
-## Datasets Used
-1. **Hindi Bible**
-   Source: https://www.kaggle.com/datasets/kapilverma/hindi-bible
-2. **Hindi-English Corpora**
-   Source: https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora
-3. **English-Hindi Dataset**
-   Source: https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset
-3. **IIT Bombay English-Hindi Translation Dataset**
-   Source: https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus
-3. **Hindi Wikipedia Articles - 172k**
-   Source: https://www.kaggle.com/datasets/disisbig/hindi-wikipedia-articles-172k
 ---
-## Dataset Preprocessing
-The preprocessing pipeline includes:
-- Concatenating Hindi text from all datasets
-- Removing punctuation and noise
-- Tokenizing text into words
-- Building a vocabulary from the corpus
-### Key Improvements
-#### 1. Vocabulary Pruning
-Instead of using all unique words, I now keep only words that appear more than 5 times in the dataset This helps:
 - Reduce vocabulary size
 - Improve training efficiency
-- Remove noisy/rare words
-#### 2. Context Window Update
-- Previous window size: 3
-- Current window size: 5
-This allows the model to:
-- Capture broader context
-- Learn better semantic relationships
-## Training Data Generation
-For each word in a sentence:
-- Treat it as the center (context) word
-- Select surrounding words within the window as target words
-#### Example
 Sentence:
 ```text
 आज सुबह मैंने अपने पुराने दोस्त के साथ बाजार में चाय पी
 ```
-If the context word is: दोस्त
-With a window size of 5, surrounding words are used to create pairs like:
 ```text
 [दोस्त, सुबह]
 [दोस्त, मैंने]
@@ -117,50 +179,135 @@ With a window size of 5, surrounding words are used to create pairs like:
 [दोस्त, चाय]
 [दोस्त, पी]
 ```
-This process is repeated for all words in the corpus to generate training pairs.
-## Negative Sampling
-In addition to positive pairs, I also generate negative samples:
-Random words are selected from the vocabulary
-These words do not appear in the context window of the center word
-They are paired with the center word as negative examples
-Example:
 ```text
-[दोस्त, किताब]
-[दोस्त, पहाड़]
 [दोस्त, कंप्यूटर]
 ```
-These pairs represent words that are unlikely to co-occur with the context word.
-### Why Negative Sampling?
-Negative sampling helps the model:
-- Learn to distinguish between relevant and irrelevant word pairs
-- Improve embedding quality
-- Reduce computational cost compared to full softmax
-This process is repeated for all words in the corpus to generate training data.
-## Model Overview (Current)
-- Architecture: Skip-gram style training
-- Embeddings learned using word2vec training
 - Optimizer: Adagrad
 - Loss Function: BCEWithLogitsLoss
-- Training uses both positive and negative word pairs
-## Contributions
-This is primarily a learning project, but suggestions and ideas are always welcome.
-## References
 - https://jalammar.github.io/illustrated-word2vec/
 - https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673
 - https://jaketae.github.io/study/word2vec/
-## Author
 ```text
 Abhishek Biswas
-Software Developer | Interested in AI & Web Development
-```

 ---
 # Word2Vec_hindi
+Welcome to **Word2Vec_hindi**
+This project is my attempt at implementing the **Word2Vec model completely from scratch**, specifically for the **Hindi language**.
+The primary goal of this project is learning by building — understanding how word embeddings work internally by implementing the entire pipeline myself instead of relying on high-level NLP libraries.
+The project currently includes:
+- Dataset collection and preprocessing
+- Vocabulary generation
+- Skip-gram pair generation
+- Negative sampling
+- Custom PyTorch training pipeline
+- Embedding evaluation and visualization
 Feel free to explore the project, experiment with it, and raise issues or suggestions. While I may not implement every suggestion, I genuinely appreciate feedback and ideas.
 ---
+# Project Status
+This project has evolved from a small experimental implementation into a large-scale embedding training pipeline.
+Current progress includes:
+- Training on a corpus containing over **82M Hindi tokens**
+- Generating over **1.5 Billion skip-gram training pairs**
+- Training multiple embedding models with dimensions ranging from **300–400**
+- Evaluating embeddings using:
+  - cosine similarity
+  - nearest-neighbor retrieval
+  - analogy testing
+  - embedding visualization using PCA and t-SNE
+The current best-performing model:
+- Embedding Size: **350**
+- Training Loss: **~0.38**
+- Validation Loss: **~0.47**
+The model is now producing meaningful semantic separation between positive and negative word pairs.
 ---
+# Latest Updates
+- Combined 5 large Hindi datasets into a single training corpus
+- Final corpus size reached approximately **82M tokens**
+- Vocabulary built from words occurring atleast **2 times**
+- Final vocabulary size exceeds **500K unique words**
+- Context window size increased from **3 → 5**
+- Generated approximately:
+  - **1.5 Billion training skip-gram pairs**
+  - **40M validation pairs**
+  - **40M testing pairs**
+- Implemented:
+  - Skip-gram training
+  - Negative sampling
+  - BCEWithLogitsLoss training objective
+  - Adagrad optimizer
+- Added support for:
+  - PCA embedding visualization
+  - t-SNE embedding visualization
+  - cosine similarity search
+  - analogy-based embedding evaluation
 ---
+# Datasets Used
+## 1. Hindi Bible
+Source:
+https://www.kaggle.com/datasets/kapilverma/hindi-bible
+## 2. Hindi-English Corpora
+Source:
+https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora
+## 3. English-Hindi Dataset
+Source:
+https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset
+## 4. IIT Bombay English-Hindi Translation Dataset
+Source:
+https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus
+## 5. Hindi Wikipedia Articles - 172k
+Source:
+https://www.kaggle.com/datasets/disisbig/hindi-wikipedia-articles-172k
 ---
+# Dataset Preprocessing
+The preprocessing pipeline currently includes:
+- Combining Hindi text from multiple datasets
+- Cleaning punctuation and noisy symbols
+- Tokenizing text into words
+- Building vocabulary mappings
+- Removing extremely rare words
+- Generating skip-gram training pairs
+- Generating negative samples
+---
+# Vocabulary Pruning
+Instead of keeping every unique token, only words appearing atleast **2 times** are retained.
+This helps:
 - Reduce vocabulary size
 - Improve training efficiency
+- Remove noisy and corrupted tokens
+- Improve embedding quality
+---
+# Context Window
+- Previous context window size: **3**
+- Current context window size: **5**
+With a window size of 5:
+- each center word can generate up to 10 positive pairs
+- broader semantic context can be captured
+- embeddings learn richer relationships
+---
+# Training Data Generation
+For each word:
+- The word is treated as the **center/context** word
+- Neighboring words within the context window are treated as positive target words
+## Example
 Sentence:
 ```text
 आज सुबह मैंने अपने पुराने दोस्त के साथ बाजार में चाय पी
 ```
+If the center word is:
+```text
+दोस्त
+```
+Generated positive pairs:
 ```text
 [दोस्त, सुबह]
 [दोस्त, मैंने]
 [दोस्त, चाय]
 [दोस्त, पी]
 ```
+This process is repeated across the entire corpus to generate training pairs.
+---
+# Negative Sampling
+In addition to positive pairs, negative samples are generated.
+Random vocabulary words that do not appear in the context window are paired with the center word.
+## Example
 ```text
 [दोस्त, कंप्यूटर]
+[दोस्त, पहाड़]
+[दोस्त, विज्ञान]
 ```
+These represent unlikely co-occurrences.
+---
+# Why Negative Sampling?
+Negative sampling helps:
+- Learn meaningful semantic separation
+- Distinguish related vs unrelated words
+- Scale training efficiently to very large vocabularies
+- Avoid the computational cost of full softmax
+---
+# Model Architecture
+Current training setup:
+- Architecture: Skip-gram Word2Vec
+- Framework: PyTorch
+- Embedding dimensions tested:
+  - 300
+  - 350
+  - 400
+- Best-performing embedding size so far: **350**
 - Optimizer: Adagrad
 - Loss Function: BCEWithLogitsLoss
+- Training uses:
+  - positive skip-gram pairs
+  - negative sampled pairs
+---
+# Current Results
+The model now learns strong separation between positive and negative pairs.
+Observed probability ranges:
+- Positive pairs: ~0.94
+- Negative pairs: ~0.07
+The embeddings are beginning to capture:
+- semantic similarity
+- contextual relationships
+- syntactic structure
+---
+# Embedding Evaluation
+Current evaluation methods include:
+## 1. Cosine Similarity
+Used to retrieve semantically similar words.
+Example goals:
+```text
+राजा → रानी, सम्राट, शासक
+```
+---
+## 2. Analogy Testing
+Evaluating vector arithmetic relationships such as:
+```text
+राजा - पुरुष + महिला ≈ रानी
+```
+---
+## 3. Embedding Visualization
+Using:
+- PCA
+- t-SNE
+to visualize learned word clusters in 2D space.
+---
+# Future Improvements
+Planned improvements include:
+- Subsampling extremely frequent words
+- Improved negative sampling strategies
+---
+# Contributions
+This is primarily a learning and research-oriented project, but suggestions, ideas, and feedback are always welcome.
+---
+# References
 - https://jalammar.github.io/illustrated-word2vec/
 - https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673
 - https://jaketae.github.io/study/word2vec/
+---
+# Author
 ```text
 Abhishek Biswas
+Software Developer | Interested in AI, NLP, and Web Development
+```