
As AI adoption accelerates across industries, synthetic data are emerging as a powerful tool for training models, especially when real-world data are scarce or sensitive. This type of data can be a privacy-friendly, cost-effective substitute. While it can mimic patterns, it can not, however, fully capture the messy, unpredictable edge cases of real life, especially in complex environments like Indian healthcare or regional linguistics. Experts say that while synthetic data is a powerful supplement, it is not a silver bullet.
Synthetic data are artificially generated to replicate the statistical properties of real-world data, serving as a valuable asset in training AI models, Ameeta Roy, Senior Director – Technology and Adoption, Red Hat APAC, explained. Its accuracy and relevance are contingent upon the sophistication of the generation process; high-quality synthetic data can effectively complement real datasets, particularly when access to actual data is limited or poses privacy concerns.
GenAI tools—especially models like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs) and large language models (LLMs)—are at the core of synthetic data generation today, Jaspreet Bindra, co-founder, AI&Beyond, shared. GANs are popular for creating synthetic images, while LLMs like GPT or LLaMA generate synthetic text, conversations and chat logs.
AI labs in Indian firms like Infosys and startups like Synthetic Data Labs use these tools to simulate customer behaviour, generate realistic transactions or create speech datasets in regional languages. Companies like TCS and Wipro also use synthetic datasets in internal AI experiments, especially where real data access is restricted.
For structured data, tools like Synthetic Data Vault (SDV) are also gaining traction. Domain-specific GenAI, such as MedPaLM in healthcare, makes it easier to create contextually relevant synthetic datasets for India’s various sectors. However, the effectiveness of these tools depends on their tuning, governance and how well they reflect real-world variability, especially in India’s multilingual and multi-behavioural digital landscape.
Paramdeep Singh, Co-Founder of Shorthills AI, noted that synthetic data can also eliminate some privacy concerns. Getting real user data, especially in regulated industries like BFSI, healthcare and education, which have PII information, can violate regulations around privacy. However, using synthetic data means no human being is involved, making compliance much easier and lowering the risk of harming someone.
“Synthetic data are best viewed as a supplement rather than a full replacement for real-world data for now. While it helps overcome data scarcity, privacy constraints or regulatory hurdles, it often lacks the messiness and unpredictability of actual user behaviour. Models trained solely on synthetic data may underperform in real-world deployments due to oversimplified assumptions or gaps in behavioural nuance. Supplementing real data with high-quality synthetic datasets improves model robustness and fairness. Ultimately, it is not a complete substitute for human-generated complexity,” said Bindra.
While synthetic data offer flexibility and privacy safeguards, they have critical limitations. These datasets may lack the richness and unpredictability of real-world scenarios, especially in complex environments like Indian traffic for autonomous vehicle simulations or regional language nuances in sentiment analysis. This can lead to models performing well in lab settings but faltering in production.
If poorly generated or too reliant on biased real datasets, synthetic data risk reinforcing existing flaws. Simulating edge cases or rare conditions with high fidelity can be difficult. In sectors like healthcare, for example, synthetic data may not fully capture the nuances of patient demographics or rare medical conditions.
Bindra also highlighted that synthetic data are not always accepted by regulators or auditors, which limit their use in high-stakes areas like finance or pharmaceuticals.
Addressing the limitations of synthetic data, Singh shared, “Synthetic data are a simulation of real-life events and so, lack reality. This is the biggest limitation. Real data may look like a normal distribution, but with anomalies, which are difficult to capture. Synthetic data generally cannot capture these anomalies. Also, while there is a risk of bias in actual data, it is far higher in synthetic data generation. Another limitation is the cost of generating synthetic data; the cost and computational requirements of generating a high-quality, large dataset are very high. Creating models that can simulate these kinds of distributions will be difficult,”
Published on May 9, 2025
This article first appeared on The Hindu Business Line
📰 Crime Today News is proudly sponsored by DRYFRUIT & CO – A Brand by eFabby Global LLC
Design & Developed by Yes Mom Hosting