Why Clean Data is Key to Accurate AI Implementation
By Gary Kotovets, Chief Data & Analytics Officer, Dun & Bradstreet
Generative artificial intelligence (Gen AI) has been a hot topic across industries and organizational business lines for several years. Ever since the public debut of OpenAI’s Chat-GPT in late 2022, businesses have been looking for ways to use AI to operate more efficiently and unlock competitive advantages. During this time there have been various updates to large language models (LLMs), with new releases often promising more advanced capabilities. As many decision makers face pressure to realize returns on their companies’ investments in AI, it’s important to remember that success or failure isn’t solely dependent on cutting-edge LLMs; clean data is essential to obtaining reliable, trustworthy results.
Also Read: The Future of Language Models: Customization, Responsibility, and Multimodality
As powerful as AI can be, the data it relies upon can enable or constrain its abilities from an early stage. The saying, “garbage in, garbage out” very much applies to the risk of introducing inaccurate or low-quality data to a generative AI tool. AI models use data to learn, improve, and generate output, so ensuring that data is clean, accurate, and complete is critical.
There are several major risks to neglecting data hygiene in AI. First, artificial intelligence tools can generate biased or inaccurate insights. Those that have been paying even a little attention are probably familiar with AI’s more public faux pas. From a search engine suggesting potentially harmful ingredients in recipes to AI chatbots mimicking hateful speech, these incidents have shone a light on AI’s current shortcomings and what misleading or incorrect source data can do.
Certain failures are relatively easy to spot (don’t add glue to your diet). That won’t always be the case in business, where inaccurate data can end up influencing an LLM’s output in subtle but dangerous ways. Consider if corrupt or poorly formatted payments data makes its way into the AI data supply chain. Or in a worst-case scenario, the LLM could return results that under- or over-report performance. AI needs to deliver accurate results to be useful, as there’s no value in being more efficient but wrong.
Poor quality AI insights may also lead to flawed decision making by businesses. Managers and executives need to have confidence in what AI is reporting, and while they should be able to rely on AI explainability and transparency, they won’t be able to review core assumptions for every task (again, that would call AI’s efficiency into question). While humans will still make the final call on strategic decisions for some time, supplying them with a faulty AI-powered analysis introduces garbage data into their own deliberations.
Making the wrong choice could lead to loss of business, reputational damage, and even the failure to meet regulatory and compliance requirements and ethical standards. Were such a situation to arise, it might erode a company’s overall trust in the AI models themselves and slow adoption of a technology that may have already drawn significant internal investment. The potential effects of poor data quality on AI are worrisome. Luckily, following well-established data management practices can go a long way towards supplying AI tools with useful inputs. These data management practices are among the guiding principles which allowed us to significantly improve time to value when launching ChatD&B™, our advanced Gen AI assistant that delivers trusted AI responses using Dun & Bradstreet’s comprehensive data and analytics.
Also Read: AiThority Interview with Venki Subramanian, SVP of Product Management at Reltio
Companies can begin by improving data integrity to guard against the introduction of messy, unstructured, incomplete, or inaccurate data. Two key processes often guide approaches to data integrity: standardization and data cleansing. In data standardization, businesses focus on mandating consistent formats and definitions across the company. For example, the same data entry practices should be adhered to throughout the organization and processes ought to be in place to validate they’re being followed.
Data cleansing is another tactic to help preserve data integrity even if low-quality data is present. Tools and automated processes exist to help identify and correct errors to remove irrelevant, incorrect, or outdated information before it can be fed into systems like AI. In essence, following an agreed-upon set of data creation and management processes along with continually cleaning up messy data are at the foundation of data integrity.
Next, maintaining data visibility through data lineage and metrics can help flag issues before they cause damage. Data lineage is analogous to provenance in the art world. Businesses need to know things such as the origin of data, the path it takes in their systems, and where edits or changes occur in the same way an art dealer would want to be assured a painting is authentic and has previously been sold in accordance with the law.
While we may think of data quality issues as being introduced by accident, businesses must also guard against bad actors intentionally generating incorrect information for nefarious purposes. A business that tracks data lineage is better positioned to identify points of error, follow how data is transformed, and ensure accurate information is reaching the right tools, versus one that neglects oversight.
Access to performance metrics and dashboards allow organizations to monitor the health of their data pipelines so it’s easier to know when something might be wrong. Trusting in a black box approach where data goes into a system and insights come out leaves room for errors and can slow the identification, diagnosis, and repair of data quality issues. With strong data lineage and visibility of metrics, businesses can be more confident that their AI tools are providing insights based on clean data.
As businesses integrate generative AI into their day-to-day operations, it’s critical that stakeholders have at least a high-level understanding of how these tools operate. Emphasizing the importance of clean data and building an ecosystem that protects it is an important part of this education. While it’s impossible to know how AI will transform the way we do business in the future, high-quality data will be key to its most useful contributions.
Comments are closed.