Technology Policy - India

India Exploring Indigenous Foundational Model with Low Availability of Data


India Exploring Indigenous Foundational Model with Low Availability of Data

India’s Ministry of Electronics and Information Technology has been exploring the development of a foundational language model, similar to GPT-3, that can understand and generate text in various Indian languages. Additional Secretary Abhishek Singh while speaking at GIFT city, Gandhi Nagar announced that Meity would be coming up with a call for proposals for building a foundational model citing the geo-economic competitions faced from the US (Open AI, Google) and China (Deepseek). However, the US and China could build foundational models leveraging a large corpus of data, this poses a significant challenge for India given the limited availability of high-quality data in Indian languages.

Delving deep into the challenges will reveal that defining and developing the foundational model is a tough task since India is home to numerous languages, which are not available digitally. If the model is going to be trained in the Indian languages, it would be an uphill task as it requires a variety of data within a single language. The language data should be inclusive of dialects and context usage. Such rich and holistic datasets are yet to be made available. Most of the Indian languages do not have a rich digital presence. Adding fuel to it, most of the premier institutions including IITs, and IIMs have English as their medium of instruction, cutting down the incentive to develop knowledge repositories in Indian languages. Even if the Indian model is built only using the English language, the availability of diverse, contextual, culturally rich datasets is a problem.

However, building an Indian foundational model would be an advantage to India. If we look at how Deepseek responds to some of the questions, it is clear that the guard rails have been established taking into consideration the political and social restrictions. This provides a rationale for India to invest in developing a foundational model. An alternative to building a foundational model is to have design restrictions or maybe RAGs combined with layered restrictions that prevent the existing models from responding to certain queries. Further, the Indian market’s dependence on the existing foundational models will also create an oligopoly in the market. Such dominance could also be reflected in the knowledge generation since the researchers and firms would use these models.

Perhaps, initial investment should go into building the datasets. Without having a holistic, inclusive dataset in Indian languages, the foundational model building would be an uphill task.