What are tokens and why do we need them ?
Computers only understand numbers and the relationships between them. Everything that needs to be processed by computers needs to be converted into bits. Hence the first step in parsing natural language is to convert these long sequences of text into manageable chunks of words. This process of breaking down a long sequence of text into smaller words or units is called tokenization. There are broadly 4 categories of tokenization:
- Sentence tokenization: This involves breaking long documents into smaller units, i.e., sections, paragraphs, etc.
- Word tokenization: Separating sequences of text by whitespace and punctuation.
- Subword tokenization: The dividing of words further into more meaningful subwords, e.g., unhappiness → un + happy + ness
- Character tokenization: As the name suggests, this is the lowest level of tokenization where long sequences of text are literally broken down at the character level.
So the process basically involves converting texts into tokens and maintaining an inventory of them. The superset of all unique tokens is what makes up the vocabulary. Feel free to use OpenAI’s tokenizer playground to see tokenization in action.
The Tokenization Problem of Sanskrit
Most languages separate words with whitespace. Sanskrit being an Indo-Europeon language does it too, loosely — but the written words often don’t correspond to the underlying morphological units, because of a process called sandhi (संधि, joining).
Sandhi is phonological fusion at word boundaries. When two words are spoken or written in sequence, the final sound of the first word and the initial sound of the second word interact according to a set of rules, sometimes merging into a single sound, sometimes transforming each other. The written text reflects this fusion.
A few examples:
| Word 1 | Word 2 | Fused form | Rule |
|---|---|---|---|
| rāma | uvāca | rāmovāca | a + u → o |
| tat | ca | tacca | t + c → cc (assimilation) |
| devī | api | devyapi | ī + a → ya (glide insertion) |
| sat | ānandam | sadānandam | t + ā → dā (voicing) |
| manas | īśa | manasīśa | s + ī → sī (visarga sandhi) |
The fused form is what appears in the written text. A tokenizer splitting on whitespace gets rāmovāca as a single token. It has no idea this is two words. And without knowing it’s two words, it cannot look either up in a dictionary, parse either morphologically, or understand what the sentence means.
There are dozens of sandhi rules. They interact. They’re partly deterministic (given the sounds, the output is predictable), but splitting — going from rāmovāca back to rāma + uvāca — is hard because it requires knowing which splits are morphologically valid. The same fused string could sometimes result from multiple underlying word pairs.
The Compound Problem
Sanskrit compounds (samāsa, समास) make the tokenization problem worse. Sanskrit freely concatenates stems into compound words, and these compounds can be long.
The compound types matter for interpretation:
- Tatpuruṣa: determinative compounds — rājakumāra = rāja (king) + kumāra (son) = “king’s son”
- Bahuvrīhi: exocentric compounds that describe something else — kamalalochana (lotus-eyed) doesn’t refer to a lotus or an eye, but to a person with lotus-shaped eyes
- Dvandva: copulative compounds — ahorātra = aha (day) + rātra (night) = “day-and-night”
A single Sanskrit compound can carry what English would express as a full noun phrase. Sanskrit texts texts routinely produce 8-10 word compounds. Decomposing them requires not just finding the morphological boundary points but choosing the semantically coherent parse — because the same character sequence might be segmentable in multiple ways, not all of which are real words.
The Morphological Complexity
Even after you’ve correctly split sandhi and decomposed compounds, each token is still a morphologically complex unit. Sanskrit nouns decline across:
- 8 cases (vibhakti): nominative, accusative, instrumental, dative, ablative, genitive, locative, vocative. This used to scare the heck out of me as a kid — I remember we were required to memorize the vibhakti of some commonly used nouns in Sanskrit. Lol, I still remember (राम, रामौ, रामाः …)
- 3 genders (liṅga): masculine, feminine, neuter
- 3 numbers (vacana): singular, dual, plural
That’s 72 possible inflected forms for a single noun stem — and many of them look different. A root like deva (god) appears as devaḥ, devam, devena, devāya, devāt, devasya, deve, deva depending on case, plus another 16 forms for dual and plural.
Verbs are similarly complex: 10 tenses, multiple moods, 3 persons, 3 numbers, active and middle voices.
And because the grammar is encoded in endings rather than word order, Sanskrit has essentially free word order. You cannot use positional heuristics.In english word order is load-bearing meaning, ie. The dog bit the main $\neq$ The man bit the dog. In Sanskrit, the case ending on the noun does that work instead. So all of these mean the same thing:
रामः रावणं हन्ति
रावणं रामः हन्ति
हन्ति रामः रावणं
transliteration of above ,which translates to Ram killed Ravana- in every order. The -ḥ on rāmaḥ marks subject, the -m on rāvaṇam marks object. The verb can go anywhere.
rāmaḥ rāvaṇaṃ hanti
rāvaṇaṃ rāmaḥ hanti
hanti rāmaḥ rāvaṇaṃ
What all this means is that parsing requires full morphological analysis at every token.
The Tools That Emerged
The computational linguistics response to Sanskrit’s complexity has been substantial. Here’s the landscape:
Pre-deep-learning: The Sanskrit Heritage Platform (Gérard Huet, INRIA) built a rule-based morphological analyzer covering the full declension and conjugation system. Oliver Hellwig’s work produced sandhi-aware segmenters trained on gold-standard annotated data. These rule-based systems have very high precision but require extensive hand-crafted linguistic knowledge to maintain and extend.
CLTK — Classical Language Toolkit is the Python-accessible entry point for most researchers now. It wraps multiple back-end analyzers and provides a unified API:
from cltk import NLP
# Initialize the Sanskrit NLP pipeline
# Downloads models on first run (~500MB)
cltk_nlp = NLP(language="san")
# Analyze a verse from the Bhagavad Gita
verse = "karmaṇy evādhikāras te mā phaleṣu kadācana"
doc = cltk_nlp.analyze(text=verse)
for token in doc.tokens:
print(f"{token.string:<20} lemma={token.lemma:<15} pos={token.pos}")
Expected output:
karmaṇi lemma=karman pos=NOUN
eva lemma=eva pos=PART
adhikāras lemma=adhikāra pos=NOUN
te lemma=yuṣmad pos=PRON
mā lemma=mā pos=PART
phaleṣu lemma=phala pos=NOUN
kadācana lemma=kadācana pos=ADV
CLTK covers tokenization, lemmatization, morphological tagging, and sandhi splitting. The quality is good for classical Sanskrit; it degrades on Vedic Sanskrit, which has different morphological patterns and more archaic vocabulary.
IndicBERT (ai4bharat/indic-bert) is a multilingual BERT model pretrained on 12 Indic languages — including Sanskrit, Hindi, Bengali, Tamil, and others — using a combined dataset of ~9 billion tokens. For Sanskrit NLP tasks that involve understanding semantics (rather than morphological parsing), IndicBERT provides strong embeddings out of the box. It handles Devanāgarī script natively.
ByT5-Sanskrit arXiv 2409.13920 takes a different approach entirely. Standard transformer tokenizers split text into subword units using BPE or SentencePiece — both of which fail badly on Sanskrit, because the relevant morphological units don’t align with high-frequency byte sequences. ByT5-Sanskrit operates at the byte level, representing text as a sequence of raw UTF-8 bytes. This bypasses the tokenization problem entirely. The model learns to work with the raw character stream.
The results are significant: ByT5-Sanskrit substantially outperforms subword-tokenized baselines on Sanskrit NLP tasks, particularly for morphological analysis where the structure is in the characters rather than the words. I will dive into some hands-on exercises with the ByT5 model in the next chapter of this series — I would encourage you to visit and play with the ByT5-Sanskrit model at dharmamitra.org
The Pali Comparison
Pali simplified Sanskrit’s morphological system deliberately. The canon was transmitted orally for centuries — by monks who had to memorize thousands of pages of text and reproduce them precisely. A language with fewer sandhi rules, simpler conjugations, and reduced compound depth is a language that’s easier to transmit without error.
| Feature | Sanskrit | Pali |
|---|---|---|
| Cases | 8 | 6 (dropped some instrumental/locative distinctions) |
| Numbers | 3 (dual retained) | 2 (dual dropped) |
| Sandhi rules | 40+ | ~15 (simplified) |
| Verb tenses | 10 | 4 |
| Max compound depth | 10+ stems | 3-4 stems |
The sound changes from Sanskrit to Pali are systematic. Dharma becomes dhamma (consonant cluster simplified). Karma becomes kamma (same pattern). Nirvāṇa becomes nibbāna (the rv cluster becomes bb by assimilation). These aren’t corruptions — they’re regular, predictable transformations of a related language.
This makes Pali meaningfully easier to parse computationally, though still considerably harder than English. The tokenization problem is reduced. The morphological complexity is lower. A standard modern NLP pipeline can get further with Pali before breaking.
Where the Field Stands Now
In 2025, Sanskrit NLP is a genuinely active research area. The Dharma MITRA - (btw it means friends of Dharma- which in brahmanical text could mean ethics , moral conduct while in Buddhist teaching it could mean the Way or path leading to Nirvana) model arXiv 2601.06400 achieves state-of-the-art machine translation for Sanskrit→English and Pali→English, trained on 1.74M parallel pairs. ByT5-Sanskrit pushes the frontier on morphological analysis. IndicBERT provides strong multilingual embeddings.
The gap between Sanskrit NLP and English NLP is closing — not by making Sanskrit easier, but by building models robust enough to handle its complexity. Byte-level modeling, multilingual pretraining, and large parallel corpora are all doing work here.
The next post in this series applies these tools to the Vedic corpus: what does BERTopic find when you run topic modeling across the Rigveda, Atharvaveda, and Upanishads? What are the Vedas actually about? What makes Vedic Sanskrit diffefrent from Classical Sanskrit ?