Ensuring AI Accuracy: The Role of Data Quality
By Gary Kotovets, Chief Data & Analytics Officer, Dun & Bradstreet
Generative artificial intelligence (Gen AI) has been at the forefront of many executives’ minds for several years. According to a recent Dun & Bradstreet survey, 88% of businesses polled are implementing Gen AI in their organizations. Unfortunately, 54% of those businesses have concerns over the trustworthiness and quality of data they’re using in Gen AI applications.
Also Read: How AI Startups Can Compete Against Tech Giants in the Age of OpenAI
The gulf between these numbers shouldn’t be surprising. Aside from the widespread adoption of the internet in the1990s, it’s difficult to think of another information technology that’s had such a large impact in so short a time. ChatGPT entered the public consciousness in 2022 and marked the first time many people knowingly interacted with a Large Language Model (LLM). While it may seem like Gen AI has been a topic of conversation for many years, we’re still in the early days of understanding its potential in all aspects of life, including business.
One thing that has become apparent in this short time is Gen AI’s reliance on quality data to produce useful results. Despite its staggering computational power, the abilities of Gen AI are tied to the accuracy and completeness of information it can access. In many cases that dataset is expansive, such as when a Gen AI search tool crawls the internet for answers or an enterprise-level business feeds in sales figures from around the world. However, quantity does not equal quality.
For Gen AI to be useful and potentially profitable, companies must have confidence in the accuracy of the data underpinning their efforts. Without clean data, Gen AI systems can generate inaccurate or biased results, drive poor decision-making at the business level, run afoul of regulatory and compliance standards, and destroy trust in the models across the organization. These types of mishaps risk damaging a company’s bottom line and reputation, hardly the impact you want from investing in a new technology.
To guard against the threats low-quality data poses to Gen AI, it’s important to understand common ways it’s generated. First, some information is simply inaccurate, incomplete, or inconsistent from the beginning. Careless data entry could put in motion a series of events that ends with poor quality data being fed to a Gen AI application. Second, data silos can make it problematic for Gen AI systems to access information needed for accurate insights. Data is often stored in different systems and departments, and stakeholders might not even know it’s available for analysis. Unstructured data poses another challenge to accuracy, as it may be difficult for Gen AI systems to understand or place in the proper context. Similarly, supervised Gen AI models can be hobbled by data labelling failures that negatively impact training. Finally, a lack of data transparency and an insufficient view into data lineage can help inaccurate information make its way into Gen AI systems undetected.
Addressing the risks above can help businesses reduce the likelihood that data quality issues will impact their Gen AI applications. Stakeholders should consider the following initiatives as part of their overall efforts to ensure clean data underpins their Gen AI efforts:
Also Read: Middle Markets Wants Big Returns From AI
Improving data integrity. Addressing messy, unstructured, or incomplete data can shore up the foundation of your Gen AI tools. Standardizing data formats and definitions across departments is key to avoiding situations where systems cannot “talk” to one another or easily compare data points. This means enforcing consistent data entry practices, validating data accuracy, and eliminating duplicates.
Data cleansing tools and processes are also important to ensuring data integrity. These applications often take advantage of Gen AI and machine learning techniques themselves to clean, validate, and structure data to reduce errors and inconsistencies. Since high-quality data is key to guarding against negative outcomes like Gen AI hallucinations or data poisoning, spending the time and money to address data integrity early on may prevent headaches later.
Gaining a comprehensive view of operations. The old adage, “You don’t know what you don’t know” may seem simplistic, but it rings true when considering data gathered across an organization. Identifying and eliminating data silos and working to build a centralized data infrastructure or data lake means information from marketing, sales, finance, and more can be made available to Gen AI models for cross-functional insights.
Establishing data governance practices and clear lines of responsibility are important to ensuring information is well-managed, organized, and secure across the business. Maintaining clean, useful data isn’t a one-and-done exercise, so stakeholders will need to understand their responsibilities and abide by robust access controls to ensure security.
Your view of the data ecosystem shouldn’t be limited to its current state; business leaders ought to understand what’s missing. Are there useful metrics that aren’t currently being tracked, or limitations that would cause issues down the road? Are the insights you plan to provide aligned with what the C-Suite expects? Aligning the data strategy with business goals is key to making sure leadership sees the impact of Gen AI, whether it’s improving customer service, driving revenue, or reducing costs.
Enabling enterprise visibility through lineage and metrics. It’s not enough to know which data is being collected for what reasons. Companies need to understand data lineage: where data comes from, its progress through various systems, and how it’s being used. This can be accomplished by establishing key performance metrics and dashboards that monitor the health of the data pipeline and flag issues.
The goal of establishing data lineage is to build a process where users have confidence in the Gen AI’s output and the processes it used to get there. Traceability is likely to become even more critical in a world where Gen AI comes under increased scrutiny from the public and governments.
There’s no silver bullet to gathering and maintaining high-quality data for Gen AI. Businesses that seek to unlock the full potential of Gen AI tools should take the time to understand the challenges and consider the best practices for building a reliable data pipeline to supply information it needs to best meet stakeholders’ expectations and contribute to the success of the business.
Comments are closed.