The Nucleus of Statistical AI: Feature Engineering Practicalities for Machine Learning
Machine Learning and its manifestation with scalable Compute Power, Deep Learning, represent the summit of statistical Artificial Intelligence. This technology fortifies numerous expressions of statistical AI including Image Recognition, Speech Recognition, and other aspects of Natural Language technologies.
Machine Learning deployments are so essential to AI because of their sophisticated pattern determinations that are largely based on predictions. The accuracy of those predictions is predicated on creating Machine Learning models that learn from their results in an iterative process that, ideally, continually improves.
Such predictive success is widely based on building dynamic statistical AI models that adapt over time—and determining the relevant features on which their learning is based. Although the model construction process is one of the core facets of Data Science that involves feature engineering, data transformations, training data and more, the way those models function can be succinctly summarized in a couple of sentences.
According to Cambridge Semantics CTO Sean Martin, “When you’re doing Machine Learning you’re looking for a formula or a line that when you put in one or more values, up pops the missing value. That’s the essence of it.”
Developments in visual approaches to model building involving embedding, graph technologies, and scattershot plots are becoming an increasingly viable means of perfecting the feature generation process described by Martin to create credible Machine Learning models underpinning statistical AI.
Read more: Data Science Tropes: Cowboys and Sirens
Embedding is a means of vectorizing data (transforming data into a numerical format) to assist in the feature engineering process. Graph technologies provide optimal settings for this aspect of predictive model building because they maintain the relationships in data while vectorizing them to provide a visual means of finding the line or formula Martin described.
Doing so requires taking model training data and “you’ve got to convert everything to numbers, then you can eventually plot what will be an end dimension vector space,” Martin explained. “And, often that requires these mathematical transforms to effectively cluster the data onto these lines.” Graph settings are primed for the transformations necessary for vectoring data in part because they natively support clustering capabilities.
Once organizations have completed the transformations to vectorize their training data, they analyze it with “various processes,” Martin noted. “One of them might be, for example, a linear regression where you’re looking for a line of these different vectors, where they line up.” This technique, in addition to utilizing a scattershot chart to plot these vectors, is highly influential in identifying the features necessary upon which to base adaptive statistical AI models.
“If you’ve got a scattershot plot and you can see effectively what the line is that goes through the…plot; it’s just algebra,” Martin offered. “You can take any point on the line, and you just follow the lines where the X is and you drop down to the Y.” These techniques enable organizations to ascertain that when given specific circumstances or features in their data, they can expect—or predict—a particular outcome.
Labeled Example Data
The feature engineering procedures Martin detailed are critical for training models with labeled examples of the phenomena they’re trying to predict. A deep neural network model for image recognition systems, for example, may need to be trained on all the different features necessary for identifying when a front bumper needs replacement following an automotive collision. Training cognitive statistical AI models with data reflecting labeled examples is necessary for supervised learning applications and many unsupervised learning applications, too.
Thus, when vectorizing data to determine relevant features via graph embedding, “if you’re doing something where you’re going to train the [model], you have to have examples, labeled examples, where the predictions are already made,” Martin revealed. In this respect, annotated training data plays an integral role in informing the learning process of adaptive statistical AI models via graph embedding “through training it with examples where you know what the missing value is for a number, and you can test [the model],” Martin commented.
Featuring engineering techniques involving graph embedding are gaining credence throughout data science for two reasons. They support scenarios in which organizations have labeled training data to teach models to predict certain outcomes, or in which modelers are simply looking for features that reveal how to predict those outcomes. For the latter, the graph approach is particularly suitable for situations in which “you don’t know what the missing numbers are and you just use math to look for clusters,” Martin mentioned.
In either instance, the ability to perform mathematical transformations, what Martin referred to as “pivots”, is critical to vectorizing data and pinpointing the features supporting both supervised learning and unsupervised learning models. “You start with what looks like human-readable record data and you transform it down to vectors and then you do manipulations of those vectors to see if you can find a plane in which there is a line through which a prediction can be made,” Martin summarized. “Then that becomes your machine learning model.”