More and more often, I am hearing startups talk about “synthetic data.” I’ve seen my existing startup investments start to use it, and I have seen entire companies formed around it. So, what is it?
Put simply, it is data created by a machine. Now, why would we do that? Imagine that you want to train a machine vision model to identify a Tesla. Now imagine you only have 10 pictures of Tesla’s to train on, so you need a bigger data set to train a better model. One way to get a bigger data set is to go get thousands of more Tesla pictures. Or, you could consider doing some simple manipulation to the pictures you have to create new pictures instead.
For example, maybe you don’t have a picture of a red Tesla. You could photoshop one of your other pictures to make the Telsa red, and add that red Tesla to your data set so you model performs better at classifying Teslas. What most people use synthetic data for is to test under different conditions. They take an image and change the lighting, shadows, etc to simulate different conditions so the machine learning model learns what an object looks like from different angles.
A common use of synthetic data now is to build data sets for autonomous vehicles. You could create an entire machine generated city, drive around that city obeying traffic laws, and feed that data into the autonmous vehicle model. This allows you to simulate things that may be harder to capture in real life (e.g. a car running a stop sign).
Now, synthetic data isn’t always good for a model. In NLP applications, one of the criticisms is that synthetic data sets generated for training are often very simple (because our language generation techniques are still weak compared to other types of AI). So training a model on all of this language data fails to capture the nuance and vagaries of messy real human language. But in other situations, like machine vision, synthetic data tends to work really well.
From a business perspective, there are a few use ways to think about synthetic data. First of all, can you use it to generate new variations of things in ways that are valid for training. Secondly, can you use it to label data about things that humans no longer need to label? And finally, should you create the synthetic data yourself, or not?
My current hypothesis is that synthetic data will mostly be done by a few third party platforms in a market that develops into an oligopoly. I think the way that software debugging works today: report a bug -> code a fix -> test on a staging environment -> deploy to production and verify, will be the way a synthetic data workflow evolves. It would look like this: report a model failure (e.g., model doesn’t detect things well at night) -> use a synthetic data platform to generate new items for a data set that increase the data for that problem (what things look like at night) -> rebuild model -> test model -> deploy new model to production. Someday it will be push button easy.
This means if I am right, synthetic data business opportunities will come in 2 flavors. The first is synthetic data for common objects, where there is lots of data. These platforms will win by being the easiest to use, connecting to the most workflows, and having the most common options for data generation. The second is cases where generating the synthetic data is hard because of the nature of the problem space and the lack of existing data sets to start with. This will lead to specialized providers who can master specific domains.
If you are working on AI, soon you will need a synthetic data strategy. And if you are company in the space, please reach out if you are looking for investment.