Simulated AI Training Data Supplier Datagen Raises $18.5 Million
Datagen Technologies, the leading developer of simulated, privacy-by-design visual data sets to train deep machine learning models for real-world tasks, has raised $18.5 million led by Viola Ventures, with participation by existing investors including TLV Partners and Spider Capital.
Other notable backers include Michael J. Black of the Max Planck Institute, Gal Cheshik, Director of AI at Nvidia, Anthony Goldbloom, CEO and founder of Kaggle, and Trevor Darrell, founder of UC Berkeley’s AI Research Lab. Datagen will use the funding to grow its R&D and expand into new markets.
Datagen stands out in the fast-growing computer vision field by creating visual simulations and recreations of the real world, as a better solution to the publicly available datasets currently being used by computer vision teams which often include images of real people and places scraped off the Internet or or manually captured from the real-world using labor-intensive operations. Datagen’s simulated data creation tools avoid the privacy pitfalls and unconscious biases that arise from these data collection and annotation methods.
According to Gartner, by 2023, 65% of the world’s population will have their personal data covered by modern privacy regulations, compared to 10% in 2020. This promises to make collecting data manually even more complex, especially when seeking to achieve desired diversity and to combat bias.
Datagen creates “human” data –things like body movements and skin textures — but also builds the entire environment by mimicking the statistical patterns found in “real world” data like lighting and background imagery. It then tests its data against real-world images to ensure accuracy. Controlling the physics of a simulated environment allows machine learning models to be trained more efficiently and at a far greater scale, eliminating current bottlenecks of relying on manual collection of real-world imagery. This gives developers compliance-free types of data that avoid ethnic or gender discrimination and which they can use to build applications without fear of tainting their products with real, personal or sensitive data such as faces or vehicle license plates — issues that have set off alarms with regulators around the world – and threaten to slow development of the artificial intelligence industry.
“It’s not just that simulated data is always better than real world data collection, it’s that it addresses problems which are just unsolvable without it,” said Rona Segev, founding partner at TLV and Datagen’s earliest and largest investor. “I think it’s an enabler for the whole AI industry. Without simulated data, the industry will slow,” she said.
Recommended AI News: EVA Voice Biometrics Now Available on Okta Integration Network
The company, which was founded in 2018 to create a platform for fully synthetic, privacy-by-design data sets for AI applications, counts 3 of the top US tech giants, as well as the AI research arms of several global consumer manufacturing giants, as customers.
“Our customers have full control over all the parameters that go into the data they create,” Datagen Co-founder and CEO Ofir Chakon said. “The real-world implication is that, once deployed, you can be sure it’s going to work well in different domains, with different ethnicities, in different geographic locations or any environment you can imagine.”
Zvika Orron, General Partner at Viola Ventures: “DataGen is tapping into a whole new market that could accelerate the use of AI. The potential here is tremendous: We estimate that synthetic data may surpass real data for training and testing. Maybe most importantly, DataGen’s solutions enable the democratization of AI, giving smaller companies, not just tech giants, access to proprietary, high-quality machine learning training data.”