Should AI Be Centered on Machine Learning Algorithms or Data?

Machine TranslationGuest AuthorsMachine LearningMachine Storytelling

By Arun Shastri and PKS Prakash On Jan 31, 2022

Some of the most prominent experts in AI, such as Andrew Ng (Adjunct Professor at Stanford, co-founder and head of Google Brain, former Chief Scientist at Baidu, co-founder Coursera), have begun to argue for a shift from model-centric to data-centric AI. If one thinks of “AI Systems = Data + Code (model/algorithm)”, there is a natural inclination to take the data for granted (cannot be modified, largely managed) and to work to sharpen the algorithms to drive good results. In other words, data serves as fuel to algorithms, which drive insights, which lead to action, and in order to deliver the best results fine tune the engine. Most academic benchmarks hold the data as fixed, and let teams work on the code.

Questioning these assumptions, Andrew Ng organized a competition asking teams to hold the code as fixed and work on the data.

But what exactly would such a data-centric approach entail?

What choices would it drive?

Before addressing this tradeoff, we offer a few thoughts on the data itself. Data quality and consistency are affected by feature (attribute) noise or label noise. Feature noise describes impurities within the observed values of the features (attributes). Label noise is caused by alterations within the label and occurs when a label is incorrectly assigned a value. As a rule of thumb, data scientists have found ways to mitigate noise in features, but label noise is still very common and very problematic. Label noise may occur when:

Information provided to experts is insufficient to perform reliable labeling
To reduce cost, non-experts assisted by automated labeling frameworks are employed
A certain label is subjective, leading to different interpretations for the same data
User errors – Customers provide the wrong response

Model-centric AI focuses on handling inconsistencies within the data via the algorithm itself. If the amount of noise is relatively small, many algorithms can handle it. Algorithms can also cleanse the data by looking at outliers and anomalies. Lastly, by better understanding the type of noise (e.g. feature noise as opposed to label noise), we can design specific approaches to mitigate it. Data-centric AI focuses on improving data quality and consistency by developing better data-collection frameworks.

Selecting the right approach is vital to ensure we are appropriately building AI. Three dimensions should be considered: label noise, actionability, and amount of data.

Label noise

For datasets with low label noise, a model-centric approach is an obvious choice. If noise is introduced “completely at random,” model-centric AI can handle it. If noise might be introduced both “randomly” and “not randomly,” a deeper understanding is required before selecting the right option.

Top AI Updates: Baidu’s Futuristic AI-Based EV Venture Ready To Succeed Volvo’s Legacy In 2023

95 Percent of Retail Leaders Prioritize AI, but Only 40 Percent Feel Ready Due to Data Gaps

Nov 21, 2024

Baffle Announces Vector Database Protection to Enhance Data Security for GenAI Applications

Nov 21, 2024

H2O.ai Generative and Predictive AI Now Validated on the Dell AI Factory with NVIDIA

Nov 21, 2024

Prev Next 1 of 7,347

Actionability

s our ability to intervene to correct the data or rectify the errors too expensive or otherwise infeasible?

Recent updates on privacy policies restrict companies’ access to user engagement data and as a result, all email engagement responses from users are marked as open. In these cases, manual intervention to correct data sets is not feasible, and model-centric approaches are the only options.

On the other hand, if actionability/intervention is feasible, we can collect more samples and/or improve the labeling consistency to help build better models. In manufacturing, classification of errors using images could be done more consistently and so one can invest to allow for intervention and to rectify mistakes.

Amount of Data/ Big Data for Machine Learning Algorithms

Data size is also an important dimension. If the dataset is small and experimentation for new data collection is expensive, data-centric AI approaches are superior. When we are dealing with big data, one may consider hybrid approaches to handle inconsistencies.

Adtech News of the week: IAS Report: Mobile Advertising, Social Media Marketing & Ad Fraud Prevention Top Priorities in 2022

So before deciding on data-centric or model-centric approaches, assess the following.

How is the data being generated?
What is the level of human intervention in this process?
And, what is the volume and velocity of the data?

While developing AI solutions, the focus should be on both data and models. Where you spend more energy may be dictated by your answer to the preceding questions. Clean fuel and a finely tuned engine are both required to maximize performance.

Recommended Telehealth Blog of 2021: Top Telehealth Trends for 2022

[To share your insights with us, please write to sghosh@martechseries.com]