Sanskrit NLP : Classification of Bhagavad Gita vs Yoga Sutras

12 min readJul 28, 2023

Work done by Srihari K. G. during summer 2023 and mentored by Kushal Shah.

Bhagavad Gita and the Yoga Sutras are two ancient Sanskrit texts containing a lot of spiritual wisdom that has transformed many lives over the centuries. Although these texts have been analyzed by many scholars from a philosophical perspective, they are yet to be analyzed properly from a linguistic perspective. In this blog, we have done some basic analysis of these two texts using simple techniques from Natural Language Processing (NLP), which you will hopefully find insightful. Earlier this was very difficult to do since Devanagari letters were not easily readable using a programming language, but now in Python 3, it’s as easy as reading and analyzing English.

Statistical Features

Each shloka of the Gita as well as each Yoga sutra is here referred to as a verse and each word within these verses is termed here as a segment. We begin by first computing the various statistical features of these verses and segments, eg. mean, median, mode, variance etc. As the Table below shows, Bhagavad Gita has 700 verses whereas the Yoga Sutras have only 195. And in terms of the number of words or segments, Bhagavad Gita has 4171, whereas Yoga Sutras have 428. Now if we look at the number of characters in these verses and words, the mean number of characters in a verse of Bhagavad Gita is 82, whereas in the Yoga Sutras it is 38. However, if we look at each word separately, the number of characters in Yoga Sutra words is 15, which is much larger than the 10 for Bhagavad Gita. So, in summary, Bhagavad Gita has much more verses which are also longer on average, but the Yoga Sutras have longer individual words.

We now plot a few histograms that represent the distribution of the number of characters present in both the Bhagavad Gita and Yoga Sutra. These graphs further support the statistical values in the above table.

Binary Classification using Logistic Regression

Logistic regression is a popular binary classification algorithm that takes feature vectors and predicts the class using the sigmoid function. The biggest advantage of such algorithms is that they are fully explainable, but of course, a limitation is that they are not useful if we do not have such quantitative features available for our data. In the case of Bhagavad Gita vs Yoga Sutras, we have seen in the previous section that there is a clear difference in their statistical measures, and so we could use these as features in our classification algorithm.

We used the length of verses, number of words, the count of vowel signs (मात्रा), vowels (स्वर) and consonants (व्यंजन) as the key features in our classification algorithm. There are a total of 895 verses in our dataset (700 from Bhagavad Gita and 195 from Yoga Sutras), which we split in the ratio 55:45 for training and testing. Initially, all these features were considered while training the model and we obtained a training accuracy of 98.78% and test accuracy of 99.01 %, which is surely very impressive!

Further analysis was done considering all the possible combinations of these five features. Except when the input feature included only the vowel (स्वर) count, both the training and test accuracy were found to be more than 95%. And when vowel_count was the only input feature, training accuracy was 78% and test accuracy was 77%. This decrease in accuracy for vowels is primarily because most of the verses have very few vowels, and so it is not a good differentiator. The table below shows the training and test accuracies with different features taken one at a time

We have also computed the decision boundary for each feature. What we found is that for an input verse, the prediction made will be Bhagavad Gita if the feature value is greater than the threshold given in the table below, and otherwise it will be predicted to be Yoga Sutra.

Since the threshold for the last feature, no. of vowels (स्वर), is negative, all the input verses will be predicted as Bhagavad Gita. This is mainly because the number of vowels is too low to be useful as a predictor.

It is clear that Logistic Regression does a great job of classifying the verses using these simple features, and we next experiment with a probability model based on Markov transitions.

Binary Classification using Markov Transition Probability Model

A frequency distribution of 2 consecutive characters taken at a time was created separately for verses from Gita and Yoga Sutras, and then the probability of their occurrence was computed. For instance, for the string समवेता, we compute the number of times सम , मव etc have occurred in the entire dataset. And then the number was normalized to compute the probabilities.

For any given input verse or segments in the input, probabilities of them belonging to both classes is calculated by multiplying the values for all the consecutive 2 letters taken at a time. Then the class with higher probability will be tagged as the predicted class. This is the usual Naive Bayes model, but instead of using word probabilities, we are using the probabilities of pairs of letters. This is primarily because the goal of this work is to figure out the deeper patterns in these two texts and not be limited to just word occurrences.

In this approach, we first construct the frequency distribution of letter pairs using the entire dataset and then use this frequency distribution to predict the class for all the verses and segments. were observed. For verses, we get a high accuracy of 99.25%, but for the segments, we get an accuracy of only 77.76 %. Next, we take 70% of the verses for constructing the frequency distribution, and then classify the remaining 30% verses and its segments using this distribution. Here are the resulting accuracies:

Accuracy with training data (70% of verses) : 99.04 %
Accuracy with training data (segments from this 70% verses) : 79.67 %
Accuracy with test data (the other 30% of verses) : 91.82 %
Accuracy with test data (segments from the other 30% of verses) : 74.82 %
Accuracy with the entire dataset (verses) : 96.87 %
Accuracy with the entire dataset (segments) : 78.53 %

We can clearly see that the Markov model is able to predict the verses with high accuracy, but performs poorly when we input individual words or segments.

Binary Classification using Artificial Neural Networks

We have seen in the above sections that very simple techniques can give us a very good classification accuracy for verses from Bhagavad Gita vs Yoga Sutras, primarily because their statistical properties are very different. However, the accuracy drops considerably when we consider individual words (or segments) since here the statistical features are not that different between the two texts. We now wish to build a ML mode using neural networks to see if this model can succeed in classifying shorter segments where manually identifying the distinguishing features is difficult.

We use a deep learning model using TensorFlow with LSTM layers and a softmax activation as the output and train it using verses from both the texts. The following snippet depicts the structure of the model.

model = Sequential()
model.add(Embedding(5000, 120, input_length = x_train.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(500, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(300, activation=’LeakyReLU’))
model.add(Dense(150, activation=’LeakyReLU’))
model.add(Dense(2,activation=’softmax’))

This ANN model consists of an Embedding layer with 5000 as size of vocabulary followed by a Dropout layer to prevent overfitting. An LSTM layer is used to learn long term dependencies. Further, a few dense layers are used with LeakyRelu activation. The output layer consists of a dense layer with softmax activation which predicts the class of the input text (which could be the full verse or a sub-string). The optimizer used was Adam.

The training data and test data was split as 70:30, i.e. 626 verses for training and 269 verses for testing. We played around with the network architecture and other hyper parameters to see what would give the best results. We do four kinds of experiments with this model. We train on verses, and test on verses. Then train on verses, and test using only segments (or words), and so on as shown in the table below. Here “P” or positive refers to Bhagavad Gita and “N” or negative refers to Yoga Sutras.

We can clearly see that the model performs much better if it’s trained using the verses instead of just the segments. The model when trained on verses performs surprisingly well even though the dataset is highly unbalanced. We also performed 10-fold cross validation to ensure that our results are stable when we change our choice of verses in the training and testing set, and our accuracy values had a standard deviation of only 1.24% over the 10 iterations indicating very stable results.

Artificial Neural Networks using SentencePiece Tokenizer

Further improvement in the ANN model was made by using SentencePiece tokenizer for tokenizing the sequences and by changing the number of nodes in the layers. The SentencePieceTrainer processed the input data and generated a SentencePiece model, which was then used for tokenization.

Unlike traditional tokenizers that tokenize based on word boundaries, SentencePiece uses a statistical model to learn and generate a vocabulary of subword units. It can tokenize words into smaller subword units, which helps capture more fine-grained linguistic information and handle rare or unseen words effectively.

The model summary with the new tokenizer is as follows:

The details of the parameters of this model are given below.

In our earlier model with a traditional tokenizer, the accuracy was quite low when the training was done just using segments (instead of the full verses). When we use the SentencePiece tokenizer, the accuracy of this case improves significantly as shown in the table below. For further experiments with the ANN model, we stick to the SentencePiece tokenizer.

Segment by Segment Testing

As mentioned above, the purpose of this study is to develop a classification model that can work well even when inputting a substring of the verses. Now of course, a simple model would be a string matching algorithm since the Gita verses are different from the Yoga Sutra verses, but our objective is to learn the abstract features of the verses in these two texts. One possible application would be composition of verses in the future that follow the linguistic structure of the Gita or the Yoga Sutras. This can be especially interesting now due to the availability of a lot of generative AI algorithms.

To analyse the behavior of our model with words or segments, we took different number segments at a time, i.e. 1, 2 , 3 and so on words at a time and fed it to the model. For instance, let the sample verse be

स तु दीर्घकालनैरंन्तर्यसत्कारासेवितो दृढ़भूमिः

दृष्टानुश्रविकविषयवितृष्णस्य वशीकारसंज्ञा वैराग्यम्

Then the various inputs with 2 words taken at a time would be :

स तु
तु दीर्घकालनैरंन्तर्यसत्कारासेवितो
दीर्घकालनैरंन्तर्यसत्कारासेवितो दृढ़भूमिः
दृष्टानुश्रविकविषयवितृष्णस्य वशीकारसंज्ञा वैराग्यम्
वशीकारसंज्ञा वैराग्यम्

Similarly with 3 words taken at a time, the various inputs would be :

स तु दीर्घकालनैरंन्तर्यसत्कारासेवितो
तु दीर्घकालनैरंन्तर्यसत्कारासेवितो दृढ़भूमिः
दृष्टानुश्रविकविषयवितृष्णस्य वशीकारसंज्ञा वैराग्यम्

For both the ANN and Markov model, we observe that the accuracy decreases as the number of words in the input decreases. This is expected. What is of real interest is to figure out if there is a threshold below which the performance degrades significantly, and then to see if we can develop other models that can work well even with single words.

For the ANN model, the decrease in accuracy was drastic when the number of segments or words fed were less than 6 at a time. But the Markov model was quite resilient even for single words, as shown in the table below..

The Markov model was further improved by making observations on the actual presence of the consecutive letters in gita or yoga. It was evident that for any word , most of the fraction of letters were present neither in Gita nor the Yoga Sutras. Upon careful observation of the Markov model that was built, improvement was made by multiplying a scalar value each time transition probabilities were multiplied. Also, instead of taking two characters at a time, a modified model was built which considers the transition probabilities taking 4 characters at a time, i.e. on an average :

1 letter + 1 maatra + 1 letter + 1 maatra.

The above modified model in fact gave better predictions on the original dataset, and we achieved an accuracy of around 95% for even single word inputs with this model!

Randomization and Word Order Permutation

Since our objective is to develop an algorithm that can understand the abstract patterns in the verses of Bhagavad Gita and Yoga Sutras, we know study the effects of randomization of the letters in the verses and word order permutations. For example, if the sample verse is शब्दज्ञानानुपाती वस्तुशून्यो विकल्पः, then the input to the model after word order permutations would be शब्दज्ञानानुपाती विकल्पः वस्तुशून्यो or विकल्पः शब्दज्ञानानुपाती वस्तुशून्यो, or in any other random order of indices. Now this won’t have any effect for the Markov model since it only looks at letter pairs, but there may be degradation in performance of the ANN model.

For randomization of letters, we replace a fraction of the letters and vowel signs (मात्रा) in a verse with other letters and vowel signs sampled from Sanskrit alphabet. We make sure that letters are replaced by letters and vowel signs by other vowel signs so as to retain the overall structure of the verse and also to make such that letters and vowel signs don’t get jumbled up rendering the sentence meaningless. This mutation was carried with different fractions ranging from 10% to 100% of the entire verse. For example, if the verse is धर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः। मामकाः पाण्डवाश्चैव किमकुर्वत सञ्जय, then upon 100% mutation or randomization, the modified input would be रछँपईःञाचेअ ञ़टोखृणिघिछो छबओँईं ऊृगोइूवनि आःतकऽं विमोठकृआहाल छुयङोझंजउ घएौचऐ. We can see in the table below that the ANN model is more resilient to randomization, and word order permutations have no impact on either model.

Finally, A large collection of randomly generated words of different lengths using letters from the Sanskrit alphabet with enough attention given to the vowel signs were also fed as input to the models. Interestingly, for smaller length words, ANN model mostly predicted them as Yoga-sutras and for longer words, predictions were biased towards Gita. The opposite results were seen for the Markov model.

Concluding Remarks

To analyse the mis-classified words and verses, we computer the fuzzy distance of the mis-classified verses and words from Yoga Sutras with all the verses and words from Gita, and vise-versa. Then these fuzzy distance scores were used to analyze the misclassified words to check if there was any close relation with verses and segments that were misclassified. It was observed that among the misclassified words only a few instances (four) had a fuzzy score greater than 90, implying that there is some other reason for the mis-classification that we are yet to figure out.

Example :

ज्योतिष्मति in Yoga Sutra and ज्योतिषामपि in Gita have a score of 90.
वस्तुसाम्ये in Yoga Sutra and संस्मृत्य in Gita have a score of 95.

This work is surely just a start, and a lot more needs to be done to properly understand the abstract linguistic features of Bhavagad Gita and Yoga Sutras. Hope other Sanskrit and NLP enthusiasts will soon join in this long journey.

Code : https://github.com/sriharikg2003/NLP-Sanskrit
Dataset : https://github.com/atmabodha/Vedanta_Datasets

Sanskrit NLP : Classification of Bhagavad Gita vs Yoga Sutras

Written by Kushal Shah

Responses (2)