Retro-Futurism of Synthetic Data

Author: Olga Megorskaya

Title: CEO, Toloka.ai.
About: Olga Megorskaya is the founder and CEO of Toloka AI, a data partner for AI development, combining machine learning and human insight to deliver high-quality data. She specializes in creating business solutions based on human-AI cooperation and crowdsourcing.

Summary: Explore how synthetic data can shape a more positive technological future by enhancing privacy, security, and sustainability. Learn how it combats AI biases and supports environmental research, all while protecting sensitive information. Dive into the potential of synthetic data to reclaim privacy, create equitable AI systems, and reduce the ecological footprint of data-driven research. Discover the promise of a better future driven by synthetic data.

Humans love the idea of technology driving every aspect of our lives. We were enchanted by what we imagined the future could be when we read books like Azimov’s Foundation or watched the Jetsons. We loved the promise that technology had before it became reality.

Today, many of us feel that the path that technology has taken, with its attention economy, shrinking privacy and hyper-consumerism is the wrong one. We still believe the future promised by Azimov is attainable, if only we could right our wrongs from the past …

The rise of synthetic data is a manifestation of the data science community’s attempts to right these “wrongs”. Let’s discuss the promise of synthetic datasets, how it might help us reclaim our privacy and move closer to a more positive technological future.

Privacy And Security

Synthetic data offers a powerful solution for enhancing privacy and security in data-driven applications. Financial institutions use synthetic data to develop fraud detection systems without exposing real customer information. This allows them to train machine learning models on realistic financial data while preserving individual privacy.

In the healthcare industry, synthetic data enables medical research and testing without compromising patient privacy protected under regulations like HIPAA. Researchers can access realistic medical data, such as imaging diagnostics, to develop new treatments and technologies.

Product development teams can leverage synthetic user data to test new features and optimize user experience, without exposing real customer information to potential bugs or vulnerabilities. This enhances security by allowing rigorous testing without risk.

Synthetic data also enables DevOps teams to validate software performance using large, diverse datasets, without compromising real user data. This helps identify and address issues early in the development process, improving overall application security.

Wherever personal data is sensitive yet might significantly improve user experience every team has a choice: gather, store and protect sensitive data or use synthetic data instead. More and more teams are opting for the second choice.

Combatting Data Bias in AI Systems

One of the major challenges with current AI systems is their tendency to perpetuate or even amplify biases present in the training data. Synthetic data can help balance and diversify training datasets, representing all possible groups fairly. This helps create more equitable AI systems, such as:

Hiring tools that do not discriminate based on gender, ethnicity, or other protected characteristics;
Medical diagnostics tools that perform well across diverse patient populations;
Credit scoring models that avoid unfairly denying loans to certain demographic groups.

Synthetic data also allows testing AI systems for potential biases before deploying them in high-stakes real-world applications. This proactive approach to bias mitigation is crucial for building trustworthy and responsible AI.

Synthetic Data for Environmental Sustainability

In sectors where data collection is resource-intensive and potentially harmful to the environment, synthetic data offers a valuable alternative. For example, in climate modeling and wildlife conservation efforts, synthetic datasets can be generated to simulate various environmental scenarios without the need for disruptive and resource-intensive fieldwork. This allows researchers to explore a broader range of conditions and outcomes, while minimizing the environmental impact of data collection.

By generating synthetic data, scientists can model the effects of climate change, test mitigation strategies, and predict the impacts on ecosystems and wildlife populations — all without physically disturbing sensitive natural environments.

Similarly, synthetic data can be used to test and validate new “green” technologies, such as renewable energy systems or pollution-reducing industrial processes, without the need for extensive real-world pilots that could have unintended environmental consequences. The simulated data allows for rigorous testing and optimization of these solutions before deployment.

Overall, the use of synthetic data in environmental applications helps reduce the ecological footprint of data-driven research and innovation, making progress towards more sustainable practices across various industries.

How do you leverage synthetic data for your business?

In 2024, data quality is at the forefront of any machine learning solution. Meta recently released Llama-3 and the majority of the paper describing the model talks about data collection and preprocessing rather than about the model itself. While models are more and more seen as a commodity, the data remains the “holy grail” of value. In a controversial move Meta even included a provision in the terms of use for Llama-3 explicitly stating that if a model was trained on synthetic data generated by Llama-3, the model itself has to have a name that starts with Llama-3. This attempt underscores a bigger trend: we are out of training data. This means that further tangible progress in artificial intelligence will be centered around synthetic data. Be it synthetic data for software testing provided by such companies as Accelario, synthetic data for perception models provided by companies like Anyverse or synthetic data for SFT offered by Toloka.

One must remember that synthetic data is not exempt from limitations. In the end, it is synthetic data, and it may fail to represent the complexity and diversity of real-world datasets. Unlike biases inherent in real data, synthetic data can introduce biases through the decisions made by its creators and may still raise privacy concerns if created from protected data. However, these limitations can be mitigated through designing robust best practices that need to be followed while creating synthetic datasets. Ultimately, synthetic data can help us right our “wrongs,” allowing people to reclaim their privacy, trust that systems are not inherently biased, and lessen their environmental impact. The future of data is synthetic, and I would argue that this is a better future we hope for.