What Gets Embedded, Gets Remembered
5/12/2025
"I've left OpenAI! Already miss everyone on the Training team & my friends ❤️ but very excited to soon announce what's next. Until then, I'll be taking a break to solve OCR for Sanskrit so we can immortalize the classical Indian literary canon in the weights of superintelligence." — Rohan Pandey (@khoomeik)
Rohan Pandey's tweet may read like a typical tech career update. But buried in the second sentence is a larger, more consequential shift in attention: away from building intelligence, toward ensuring intelligence doesn't forget.
The problem he's tackling—OCR for Sanskrit—isn't just a technical bottleneck. It's a bottleneck to memory. Sanskrit texts, especially in handwritten or degraded forms, are difficult for machines to read. Standard OCR systems don't generalize across Indic scripts. Without structured, digitized input, these texts stay inaccessible to modern machine learning models. And models can't learn what they can't see.
This isn't just about superintelligence. It's about what enters the epistemic future—and what gets left behind.
The Philippine problem
I'm not from OpenAI. I'm a recent CS graduate from a public university in the Philippines, working with scraped PDFs, volunteer dictionaries, and low-resource tools. But I think about the same problem every day, just in a different setting.
The Philippines has more than 180 documented languages. At least four have already died. Dozens more are in steep decline. Languages like Dicamay Agta or Tayabas Ayta disappeared within living memory. Most others have no standard orthography, no print culture, and no computational presence.
Colonial education, centralized media, and the pursuit of economic mobility have created a steep hierarchy: English is the language of upward mobility. Filipino (Tagalog) is the language of national integration. Everything else is relegated to private use—if it survives at all. Many don't.
The result is a predictable sequence: declining transmission, no writing system, no documentation, then disappearance. By the time linguists mark a language as endangered, the infrastructure needed to save it often doesn't exist.
Languages don't need sympathy—they need structure
To preserve a language in the 21st century, you need more than good intentions or cultural pride. You need data infrastructure. This includes:
- Text corpora (in consistent encoding)
- Orthography (ideally standardized)
- Audio recordings with metadata
- Basic NLP tooling: tokenizers, analyzers, maybe even speech recognition
Without these, a language can't be included in language models. And if it's not included, it won't be translated, summarized, or surfaced by AI systems. It won't appear in autocomplete. It won't have a voice in search results. It won't be remembered.
This isn't theoretical. Of the ten most spoken Philippine languages, only two have functioning Wikipedia editions. Google Translate barely covers one. Most have no digital footprint beyond scattered religious tracts or PDF scans of colonial grammars.
If AI systems are how people will access knowledge, then most Philippine languages are already on the margins of collective memory.
A local working model
What I'm doing now, slowly, with limited compute:
- Scraping and structuring open-source dictionaries and grammar books from colonial and postcolonial sources
- Extracting affixation, reduplication, and variant spellings to build rules-based tokenizers
- Building basic morphological analyzers for major regional languages
- Creating structured datasets (JSON, CSV, XML) from inconsistent and handwritten sources
- Fine-tuning small language models on Cebuano, Ilocano, and Tagalog using scraped or publicly available corpora
- Generating audio-text pairs from community recordings for future ASR models
- Advocating for government digitization of endangered-language materials before they physically deteriorate
None of this is glamorous. It's slow, archival work. But without it, higher-level language modeling is impossible.
Optimization penalizes the forgotten
Modern LLMs are optimized for coverage and utility. This isn't a moral failure—it's a statistical one. Languages with small corpora don't meaningfully influence gradient updates. They get squeezed out during tokenization. They fall outside the training distribution.
This is why we shouldn't wait for OpenAI, Google, or Hugging Face to include us. If we want our languages inside these systems, we need to feed them ourselves. Pretraining data shapes model behavior. If your language isn't in the dataset, it won't be in the model. It's that simple.
The timeline is short
Language death happens quickly. One generation of children who aren't taught their grandparents' tongue is enough. After that, fluency collapses. Then memory. Then everything else.
What disappears with a language isn't just vocabulary—it's a taxonomy of the world, an ontology, a worldview. And if those don't get digitized, they don't get embedded. If they don't get embedded, they don't get remembered by the systems we're building.
This is why Rohan Pandey is working on OCR for Sanskrit. I want to do the same for Aeta, Itbayat, Ibanag, and other Philippine languages that are statistically invisible to modern models.
Not because they're sacred. But because they're real. And real things shouldn't be erased by omission.
Follow-up posts in this series
- How to Fine-Tune a Language Model on a Dying Language (coming soon)
- The Minimum Viable Corpus for a Low-Resource Tongue (coming soon)
- A Practical Toolkit for Digitizing Endangered Philippine Languages (coming soon)