Why Does Sanskrit Push Computational NLP to Its Limits?

What are tokens and why do we need them ?

Computers only understand numbers and the relationships between them. Everything that needs to be processed by computers needs to be converted into bits. Hence the first step in parsing natural language is to convert these long sequences of text into manageable chunks of words. This process of breaking down a long sequence of text into smaller words or units is called tokenization. There are broadly 4 categories of tokenization:

Sentence tokenization: This involves breaking long documents into smaller units, i.e., sections, paragraphs, etc.
Word tokenization: Separating sequences of text by whitespace and punctuation.
Subword tokenization: The dividing of words further into more meaningful subwords, e.g., unhappiness → un + happy + ness
Character tokenization: As the name suggests, this is the lowest level of tokenization where long sequences of text are literally broken down at the character level.

So the process basically involves converting texts into tokens and maintaining an inventory of them. The superset of all unique tokens is what makes up the vocabulary. Feel free to use OpenAI’s tokenizer playground to see tokenization in action.

The Tokenization Problem of Sanskrit

Most languages separate words with whitespace. Sanskrit being an Indo-Europeon language does it too, loosely — but the written words often don’t correspond to the underlying morphological units, because of a process called sandhi (संधि, joining).

Sandhi is phonological fusion at word boundaries. When two words are spoken or written in sequence, the final sound of the first word and the initial sound of the second word interact according to a set of rules, sometimes merging into a single sound, sometimes transforming each other. The written text reflects this fusion.

A few examples:

Word 1	Word 2	Fused form	Rule
rāma	uvāca	rāmovāca	a + u → o
tat	ca	tacca	t + c → cc (assimilation)
devī	api	devyapi	ī + a → ya (glide insertion)
sat	ānandam	sadānandam	t + ā → dā (voicing)
manas	īśa	manasīśa	s + ī → sī (visarga sandhi)

The fused form is what appears in the written text. A tokenizer splitting on whitespace gets rāmovāca as a single token. It has no idea this is two words. And without knowing it’s two words, it cannot look either up in a dictionary, parse either morphologically, or understand what the sentence means.

There are dozens of sandhi rules. They interact. They’re partly deterministic (given the sounds, the output is predictable), but splitting — going from rāmovāca back to rāma + uvāca — is hard because it requires knowing which splits are morphologically valid. The same fused string could sometimes result from multiple underlying word pairs.

The Compound Problem

Sanskrit compounds (samāsa, समास) make the tokenization problem worse. Sanskrit freely concatenates stems into compound words, and these compounds can be long.

The compound types matter for interpretation:

Tatpuruṣa: determinative compounds — rājakumāra = rāja (king) + kumāra (son) = “king’s son”
Bahuvrīhi: exocentric compounds that describe something else — kamalalochana (lotus-eyed) doesn’t refer to a lotus or an eye, but to a person with lotus-shaped eyes
Dvandva: copulative compounds — ahorātra = aha (day) + rātra (night) = “day-and-night”

A single Sanskrit compound can carry what English would express as a full noun phrase. Sanskrit texts texts routinely produce 8-10 word compounds. Decomposing them requires not just finding the morphological boundary points but choosing the semantically coherent parse — because the same character sequence might be segmentable in multiple ways, not all of which are real words.

The Morphological Complexity

Even after you’ve correctly split sandhi and decomposed compounds, each token is still a morphologically complex unit. Sanskrit nouns decline across:

8 cases (vibhakti): nominative, accusative, instrumental, dative, ablative, genitive, locative, vocative. This used to scare the heck out of me as a kid — I remember we were required to memorize the vibhakti of some commonly used nouns in Sanskrit. Lol, I still remember (राम, रामौ, रामाः …)
3 genders (liṅga): masculine, feminine, neuter
3 numbers (vacana): singular, dual, plural

That’s 72 possible inflected forms for a single noun stem — and many of them look different. A root like deva (god) appears as devaḥ, devam, devena, devāya, devāt, devasya, deve, deva depending on case, plus another 16 forms for dual and plural.

Verbs are similarly complex: 10 tenses, multiple moods, 3 persons, 3 numbers, active and middle voices.

And because the grammar is encoded in endings rather than word order, Sanskrit has essentially free word order. You cannot use positional heuristics.In english word order is load-bearing meaning, ie. The dog bit the main $\neq$ The man bit the dog. In Sanskrit, the case ending on the noun does that work instead. So all of these mean the same thing:

रामः रावणं हन्ति

रावणं रामः हन्ति

हन्ति रामः रावणं 

transliteration of above ,which translates to Ram killed Ravana- in every order. The -ḥ on rāmaḥ marks subject, the -m on rāvaṇam marks object. The verb can go anywhere.

rāmaḥ rāvaṇaṃ hanti

rāvaṇaṃ rāmaḥ hanti 

hanti rāmaḥ rāvaṇaṃ

What all this means is that parsing requires full morphological analysis at every token.

The Tools That Emerged

The computational linguistics response to Sanskrit’s complexity has been substantial. Here’s the landscape:

Pre-deep-learning: The Sanskrit Heritage Platform (Gérard Huet, INRIA) built a rule-based morphological analyzer covering the full declension and conjugation system. Oliver Hellwig’s work produced sandhi-aware segmenters trained on gold-standard annotated data. These rule-based systems have very high precision but require extensive hand-crafted linguistic knowledge to maintain and extend.

CLTK — Classical Language Toolkit is the Python-accessible entry point for most researchers now. It wraps multiple back-end analyzers and provides a unified API:

from cltk import NLP

# Initialize the Sanskrit NLP pipeline
# Downloads models on first run (~500MB)
cltk_nlp = NLP(language="san")

# Analyze a verse from the Bhagavad Gita
verse = "karmaṇy evādhikāras te mā phaleṣu kadācana"
doc = cltk_nlp.analyze(text=verse)

for token in doc.tokens:
    print(f"{token.string:<20} lemma={token.lemma:<15} pos={token.pos}")

Expected output:

karmaṇi             lemma=karman         pos=NOUN
eva                 lemma=eva            pos=PART
adhikāras           lemma=adhikāra       pos=NOUN
te                  lemma=yuṣmad         pos=PRON
mā                  lemma=mā             pos=PART
phaleṣu             lemma=phala          pos=NOUN
kadācana            lemma=kadācana       pos=ADV

CLTK covers tokenization, lemmatization, morphological tagging, and sandhi splitting. The quality is good for classical Sanskrit; it degrades on Vedic Sanskrit, which has different morphological patterns and more archaic vocabulary.

IndicBERT (ai4bharat/indic-bert) is a multilingual BERT model pretrained on 12 Indic languages — including Sanskrit, Hindi, Bengali, Tamil, and others — using a combined dataset of ~9 billion tokens. For Sanskrit NLP tasks that involve understanding semantics (rather than morphological parsing), IndicBERT provides strong embeddings out of the box. It handles Devanāgarī script natively.

ByT5-Sanskrit arXiv 2409.13920 takes a different approach entirely. Standard transformer tokenizers split text into subword units using BPE or SentencePiece — both of which fail badly on Sanskrit, because the relevant morphological units don’t align with high-frequency byte sequences. ByT5-Sanskrit operates at the byte level, representing text as a sequence of raw UTF-8 bytes. This bypasses the tokenization problem entirely. The model learns to work with the raw character stream.

The results are significant: ByT5-Sanskrit substantially outperforms subword-tokenized baselines on Sanskrit NLP tasks, particularly for morphological analysis where the structure is in the characters rather than the words. I will dive into some hands-on exercises with the ByT5 model in the next chapter of this series — I would encourage you to visit and play with the ByT5-Sanskrit model at dharmamitra.org

The Pali Comparison

Pali simplified Sanskrit’s morphological system deliberately. The canon was transmitted orally for centuries — by monks who had to memorize thousands of pages of text and reproduce them precisely. A language with fewer sandhi rules, simpler conjugations, and reduced compound depth is a language that’s easier to transmit without error.

Feature	Sanskrit	Pali
Cases	8	6 (dropped some instrumental/locative distinctions)
Numbers	3 (dual retained)	2 (dual dropped)
Sandhi rules	40+	~15 (simplified)
Verb tenses	10	4
Max compound depth	10+ stems	3-4 stems

The sound changes from Sanskrit to Pali are systematic. Dharma becomes dhamma (consonant cluster simplified). Karma becomes kamma (same pattern). Nirvāṇa becomes nibbāna (the rv cluster becomes bb by assimilation). These aren’t corruptions — they’re regular, predictable transformations of a related language.

This makes Pali meaningfully easier to parse computationally, though still considerably harder than English. The tokenization problem is reduced. The morphological complexity is lower. A standard modern NLP pipeline can get further with Pali before breaking.

Where the Field Stands Now

In 2025, Sanskrit NLP is a genuinely active research area. The Dharma MITRA - (btw it means friends of Dharma- which in brahmanical text could mean ethics , moral conduct while in Buddhist teaching it could mean the Way or path leading to Nirvana) model arXiv 2601.06400 achieves state-of-the-art machine translation for Sanskrit→English and Pali→English, trained on 1.74M parallel pairs. ByT5-Sanskrit pushes the frontier on morphological analysis. IndicBERT provides strong multilingual embeddings.

The gap between Sanskrit NLP and English NLP is closing — not by making Sanskrit easier, but by building models robust enough to handle its complexity. Byte-level modeling, multilingual pretraining, and large parallel corpora are all doing work here.

The next post in this series applies these tools to the Vedic corpus: what does BERTopic find when you run topic modeling across the Rigveda, Atharvaveda, and Upanishads? What are the Vedas actually about? What makes Vedic Sanskrit diffefrent from Classical Sanskrit ?