워드프레스 "콘텐츠 열람 전 자동 광고 시스템"을 통해, 특정 웹페이지를 열람하기 위해 먼저 봐야 하는 사전 광고를 원하는 위치에 자유롭게 배치/설정할 수 있습니다
Will synthetic data derail generative AI's momentum or be the breakthrough we need? - Kims Media Press "Enter" to skip to content

Will synthetic data derail generative AI’s momentum or be the breakthrough we need?

Data illustration
Getty Images/Yaroslav Kushta

With the rise of generative AI, synthetic images and text have become common knowledge — but are you familiar with synthetic data? As the name implies, the term refers to data that is artificially generated and used to replace real data. It is used to create solutions for healthcare, finance, the automotive industry, and, most importantly, artificial intelligence.

Synthetic data is such an integral part of the digital revolution that South by Southwest (SXSW) held an AI session titled “Impact of Simulated Data on AI and the Future,” meant to analyze the technology’s ability to bolster and support generative AI, while also evaluating the potential risks.

Also: 10 key reasons AI went mainstream overnight – and what happens next

The panel featured expert panelists Mike Hollinger, director of product management, enterprise Gen AI software at NVIDIA; Oji Udezue, CPO at Typeform; and Tahir Ekin, Fields Chair in business analytics at Texas State University, who all retained an overall positive outlook on the technology.

“For us, it [synthetic data] makes our ability to build the right thing cheaper and better — which is a holy grail,” said Udezue.

For more on synthetic data’s potential to advance the AI space, its risks, and advice from the experts on how to proceed, read more below.

The advantages

Synthetic data enables users to simulate real-world insights in situations where collecting actual data would be too costly, time-consuming, or could pose privacy concerns — such as involving sensitive financial information.

Its recent surge in popularity is largely due to its growing role in training and refining machine learning and AI models, which has become increasingly crucial amid the rapid development of these models in the past year.

Also: Is your business AI-ready? 5 ways to avoid falling behind

“With ChatGPT, with Gemini, with Claude, with DeepSeek, with any of these models, inside of that model’s training data is most likely a synthetic generation step,” said Hollinger. “This synthetic data is taking parts of that training material, and it’s amplifying it to give different variations so that I could then train the model to give whatever the output is.”

Synthetic data is especially valuable for AI models because they require large, diverse, and high-quality datasets for effective training that can be difficult or impractical to obtain. This is particularly true when targeting niche, proprietary, or original datasets that aren’t readily available through public data scraping.

In a report released last week, research firm Gartner identified synthetic data as one of the top data and analytics trends for 2025. Specifically, the report encourages the use of synthetic data to supplement areas where insight is missing or incomplete or to replace sensitive data to prioritize privacy.

The risks

To create synthetic data, complex algorithms take an original data set and replicate the patterns, structures, and other characteristics found within that data. However, like with any other AI output, there is potential for some deviations that can have a significant impact.

To illustrate that idea, Hollinger used the example of how many hours were in the day on the day of the conference, which was a tricky question because, technically, on Sunday, there were 23 hours due to daylight savings.

If a sample of data were taken from random days throughout the year, it would be possible that one of the days selected would be from a city with daylight savings time changes, where there was an hour less. A synthetic data pipeline built from this sample would have erased the model’s accuracy.

Also: Here’s what AI likely means for traditional BI and analytics tools

Consequently, when building synthetic datasets, it is imperative that the data be grounded in the real world to avoid these types of incongruences and ensure that the dataset is as representative of the scenario it is meant to represent as possible. However, even when taking this measure and accounting for entropy, it is often difficult to ensure accuracy, according to Udezue.

“Humans are unpredictable in unpredictable ways,” said Udezue. “How do you predict the variation for 8 billion people?”

Beyond the technical challenges, one of the biggest hurdles to overcome will be earning user trust when using synthetic data as the primary source to inform and create new solutions. To build that trust, transparency around how synthetic data is generated, validated, and applied, with clear delineation, such as through model cards, is important.

“The trust aspect — from the user perspective, we are utilizing these AI tools, but how do you feel getting into a self-driving car that wasn’t tested on the road but was only tested using simulated data?” said Ekin.

Looking forward

Despite the challenges, the panel remained optimistic about using the technology in the future of AI and beyond. This doesn’t mean the challenges aren’t there or that work doesn’t have to be done, but its overall potential to fuel growth across all sectors is still great.

Also: How businesses are accelerating time to agentic AI value

“Simulated data, when correctly used, will elevate science, will elevate software, will elevate the industry, but what we have to get the governance and transparency right, or we won’t be able to take advantage of it properly,” said Udezue.



Source : https://www.zdnet.com/article/will-synthetic-data-derail-generative-ais-momentum-or-be-the-breakthrough-we-need/