BharatGen Expands to 15 Languages, Strengthens India’s Sovereign Generative AI Vision

BharatGen

BharatGen, India’s sovereign generative AI model, is rapidly emerging as a landmark initiative in the country’s artificial intelligence journey, with a strong focus on linguistic diversity, data sovereignty, and indigenous knowledge systems.

Developed as a multimodal large language model system, BharatGen represents India’s own effort to capture and reflect the nation’s rich linguistic and cultural heritage.

In a discussion on BharatGen at the recently conducted India AI Impact Summit 2026, Professor Arnab Bhattacharya, Computer Science faculty at Indian Institute of Technology Kanpur, shared insights into the vision, development roadmap, and long-term ambitions of BharatGen.

As one of the co-founders, he described BharatGen as a critical step toward building sovereign generative AI tailored specifically for India.

BharatGen Expands from 2 to 15 Languages, Targets All 22 Official Languages

BharatGen was initially built in Hindi and English. Today, the model has expanded to 15 languages and is working toward covering all 22 official languages of India.

According to Prof. Bhattacharya, the structural similarities among Indian languages have made this expansion more efficient.

“First of all, while we have 22 official languages that may appear different, there is significant structural unity among them. In computational linguistics, we have observed this through our research. Whether you consider Sanskrit, Tamil, Bengali, or Hindi – the underlying structural aspects of these languages share many similarities.”

He further explained that BharatGen leverages these structural commonalities. “If a system is developed for one language, extending it to another becomes relatively easier – we do not have to start from scratch because of these structural commonalities.”

While BharatGen aims to complete all 22 official languages, the broader ambition extends beyond that. India has over 2,200 scheduled and unscheduled languages, including tribal languages such as Santali. The long-term vision is to ensure that every citizen can access AI in their own language.

“The vision is clear: every citizen of the country should be able to use AI in their own language.”

Also Read: IndiaAI Mission 2025: BharatGen Secures INR 988.6 Crore Funding

India-Centric Datasets Power BharatGen’s Generative AI Model

A defining feature of BharatGen is its India-centric dataset strategy. Unlike global large language models, BharatGen’s training data is rooted in Indian contexts.

“Our dataset is India-centric. Since this is India’s generative AI model, our focus is entirely on India.”

Prof. Bhattacharya highlighted that the development process includes digitizing literature using OCR, incorporating government data, legal data, and newspaper archives.

These datasets are curated specifically to ensure that BharatGen reflects Indian realities, governance systems, and socio-cultural contexts.

This India-focused data approach strengthens BharatGen’s claim as a sovereign generative AI platform built for national priorities.

LegalGen: BharatGen’s Legal AI Component to Support Citizens, Judges, and Lawyers

Within the BharatGen ecosystem, a legal-focused component known as LegalGen is being developed to address challenges in India’s legal system.

Prof. Bhattacharya noted that legal language differs significantly from everyday communication. “Although many of us can speak Hindi or English, legal language is different. Legal language has its own complexity.”

The goal is to enable citizens to access legal consultation via AI-powered systems on their phones, particularly in situations such as travel or urgent legal concerns.

He outlined three key stakeholders in the legal ecosystem: citizens, judges, and lawyers.

“Judges in India are overburdened, and we have a large number of pending cases. AI cannot and should not make judgments — that is a human responsibility. However, AI can assist by summarizing cases and extracting key points. This can help reduce the backlog of pending cases.”

For lawyers and advocates operating within India’s precedent-based legal framework, BharatGen’s LegalGen can identify similar past cases instantly, significantly reducing research time and effort.

Also Read: India Sovereign AI Ecosystem Deepens Infrastructure as Funding Crosses $5.5B: Tracxn Report

Integrating Indian Knowledge Systems into BharatGen

BharatGen also seeks to integrate India’s ancient knowledge systems into modern generative AI frameworks. Discussions around the Indian Knowledge System (IKS), now recognized under the Ministry of Education, are increasingly influencing AI discourse.

Prof. Bhattacharya drew parallels between modern generative AI and classical Indian texts.

“Artificial Intelligence, particularly generative systems, has parallels in our ancient texts. Take Panini’s Ashtadhyayi, the Sanskrit grammar treatise – it is essentially a generative system. If you examine it carefully, it functions very much like a rule-based AI system.”

He also referenced Pingala’s Chhanda Sutra and Samarangana Sutradhara by King Bhoja as examples of structured logical and mechanical constructs resembling early conceptualizations of automation.

He pointed out that many Indian contributions are often absent from global AI references. For instance, what is widely known as the Pythagorean theorem was described earlier in the Baudhayana Sulba Sutra, and the Fibonacci sequence was known in India as the Virahanka series.

Since BharatGen is built for India’s advancement, the model aims to integrate and highlight these indigenous references within its knowledge base.

“Sovereign generative AI means bringing forward our traditions, culture, languages, and texts. When we ask questions to BharatGen, the answers should reflect our context and knowledge base.”

Data Sovereignty at the Core of Vision

Data sovereignty forms a central pillar of BharatGen’s development philosophy. With India’s population of 1.4 billion, the country represents one of the world’s largest and most diverse data ecosystems.

Prof. Bhattacharya emphasized that much of India’s data currently resides on foreign servers. BharatGen aims to build a complete AI ecosystem within India – encompassing servers, training infrastructure, software, engineers, scientists, startups, government stakeholders, media, and users.

“True sovereign generative AI means our data, our servers, our training infrastructure, our software, our engineers, scientists, and users – all within our control. Only then can we ensure complete control and governance. Control is extremely important.”

He also highlighted the importance of capturing India’s oral traditions and folk knowledge. By building speech models and recording dialects across villages, BharatGen aims to preserve and integrate knowledge that was never formally documented.

“This data is uniquely ours and must become part of our sovereign AI system. Work is already underway in this direction.”

A Call to Contribute to Development

Concluding the discussion, Prof. Bhattacharya issued a call to action for citizens, engineers, data collectors, annotators, and AI professionals.

“I would like to conclude by saying: since this is our country’s model, built for our country, we as Indians must come forward. Whether you are an engineer, a data collector, an annotator – everyone must contribute. If we do not step forward, no one from outside will build this for us.”

BharatGen, often described as India’s own version of ChatGPT, aspires to go beyond replication. It aims not only to serve Indian citizens but also to emerge as a global model for sovereign generative AI grounded in linguistic diversity, cultural context, and data sovereignty.

Author

  • Salil Urunkar

    Salil Urunkar is a senior journalist and the editorial mind behind Sahyadri Startups. With years of experience covering Pune’s entrepreneurial rise, he’s passionate about telling the real stories of founders, disruptors, and game-changers.

Back to top