--- language: hi tags: - word2vec - embeddings - nlp - hindi - skip-gram datasets: - hindi-bible - hindienglish-corpora - english-hindi-dataset - hindi-english-parallel-corpus - hindi-wikipedia-articles-172k task_categories: - feature-extraction pretty_name: Word2Vec Hindi Embeddings --- # Word2Vec_hindi Welcome to **Word2Vec_hindi** This project is my attempt at implementing the **Word2Vec model completely from scratch**, specifically for the **Hindi language**. The primary goal of this project is learning by building — understanding how word embeddings work internally by implementing the entire pipeline myself instead of relying on high-level NLP libraries. The project currently includes: - Dataset collection and preprocessing - Vocabulary generation - Skip-gram pair generation - Negative sampling - Custom PyTorch training pipeline - Embedding evaluation and visualization Feel free to explore the project, experiment with it, and raise issues or suggestions. While I may not implement every suggestion, I genuinely appreciate feedback and ideas. --- # Project Status This project has evolved from a small experimental implementation into a large-scale embedding training pipeline. Current progress includes: - Training on a corpus containing over **82M Hindi tokens** - Generating over **1.5 Billion skip-gram training pairs** - Training multiple embedding models with dimensions ranging from **300–400** - Evaluating embeddings using: - cosine similarity - nearest-neighbor retrieval - analogy testing - embedding visualization using PCA and t-SNE The current best-performing model: - Embedding Size: **350** - Training Loss: **~0.38** - Validation Loss: **~0.47** The model is now producing meaningful semantic separation between positive and negative word pairs. --- # Latest Updates - Combined 5 large Hindi datasets into a single training corpus - Final corpus size reached approximately **82M tokens** - Vocabulary built from words occurring atleast **2 times** - Final vocabulary size exceeds **500K unique words** - Context window size increased from **3 → 5** - Generated approximately: - **1.5 Billion training skip-gram pairs** - **40M validation pairs** - **40M testing pairs** - Implemented: - Skip-gram training - Negative sampling - BCEWithLogitsLoss training objective - Adagrad optimizer - Added support for: - PCA embedding visualization - t-SNE embedding visualization - cosine similarity search - analogy-based embedding evaluation --- # Datasets Used ## 1. Hindi Bible Source: https://www.kaggle.com/datasets/kapilverma/hindi-bible ## 2. Hindi-English Corpora Source: https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora ## 3. English-Hindi Dataset Source: https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset ## 4. IIT Bombay English-Hindi Translation Dataset Source: https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus ## 5. Hindi Wikipedia Articles - 172k Source: https://www.kaggle.com/datasets/disisbig/hindi-wikipedia-articles-172k --- # Dataset Preprocessing The preprocessing pipeline currently includes: - Combining Hindi text from multiple datasets - Cleaning punctuation and noisy symbols - Tokenizing text into words - Building vocabulary mappings - Removing extremely rare words - Generating skip-gram training pairs - Generating negative samples --- # Vocabulary Pruning Instead of keeping every unique token, only words appearing atleast **2 times** are retained. This helps: - Reduce vocabulary size - Improve training efficiency - Remove noisy and corrupted tokens - Improve embedding quality --- # Context Window - Previous context window size: **3** - Current context window size: **5** With a window size of 5: - each center word can generate up to 10 positive pairs - broader semantic context can be captured - embeddings learn richer relationships --- # Training Data Generation For each word: - The word is treated as the **center/context** word - Neighboring words within the context window are treated as positive target words ## Example Sentence: ```text आज सुबह मैंने अपने पुराने दोस्त के साथ बाजार में चाय पी ``` If the center word is: ```text दोस्त ``` Generated positive pairs: ```text [दोस्त, सुबह] [दोस्त, मैंने] [दोस्त, अपने] [दोस्त, पुराने] [दोस्त, के] [दोस्त, साथ] [दोस्त, बाजार] [दोस्त, में] [दोस्त, चाय] [दोस्त, पी] ``` This process is repeated across the entire corpus to generate training pairs. --- # Negative Sampling In addition to positive pairs, negative samples are generated. Random vocabulary words that do not appear in the context window are paired with the center word. ## Example ```text [दोस्त, कंप्यूटर] [दोस्त, पहाड़] [दोस्त, विज्ञान] ``` These represent unlikely co-occurrences. --- # Why Negative Sampling? Negative sampling helps: - Learn meaningful semantic separation - Distinguish related vs unrelated words - Scale training efficiently to very large vocabularies - Avoid the computational cost of full softmax --- # Model Architecture Current training setup: - Architecture: Skip-gram Word2Vec - Framework: PyTorch - Embedding dimensions tested: - 300 - 350 - 400 - Best-performing embedding size so far: **350** - Optimizer: Adagrad - Loss Function: BCEWithLogitsLoss - Training uses: - positive skip-gram pairs - negative sampled pairs --- # Current Results The model now learns strong separation between positive and negative pairs. Observed probability ranges: - Positive pairs: ~0.94 - Negative pairs: ~0.07 The embeddings are beginning to capture: - semantic similarity - contextual relationships - syntactic structure --- # Embedding Evaluation Current evaluation methods include: ## 1. Cosine Similarity Used to retrieve semantically similar words. Example goals: ```text राजा → रानी, सम्राट, शासक ``` --- ## 2. Analogy Testing Evaluating vector arithmetic relationships such as: ```text राजा - पुरुष + महिला ≈ रानी ``` --- ## 3. Embedding Visualization Using: - PCA - t-SNE to visualize learned word clusters in 2D space. --- # Future Improvements Planned improvements include: - Subsampling extremely frequent words - Improved negative sampling strategies --- # Contributions This is primarily a learning and research-oriented project, but suggestions, ideas, and feedback are always welcome. --- # References - https://jalammar.github.io/illustrated-word2vec/ - https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673 - https://jaketae.github.io/study/word2vec/ --- # Author ```text Abhishek Biswas Software Developer | Interested in AI, NLP, and Web Development ```