Walking the Line Between Cool and Creepy: Ethical Use of Open Data While Protecting PII
Open data makes sense financially: It gives companies full control over their ability to manage costs over time.
People have been wondering for years – when and even sometimes IF artificial intelligence will live up to its incredible potential. The technology is finally beginning to change industries and lives. Now implemented across everything from smartphone cameras and self-driving vehicles to manufacturing facilities, AI has racked up numerous high-profile success stories: People now rely on AI to silently optimize photos, perfect their parallel parking, and discover product defects. AI can either be cool or creepy, but it’s currently on the right side of that line.
At the same time, however, the public is becoming increasingly aware of AI ethics, as researchers and journalists question the sources of data powering AI innovations, and spotlight ways AI data is being misused by tech giants. High-profile lawsuits against Meta/Facebook and others revealed that facial and other biometric data are powering identification engines without user consent. Now companies are deploying AI engines that can identify individual customers from randomized data, potentially exposing your personal identity
Privacy-threatening developments like these are threatening to push AI into creepy territory. More concerning is the number of questions over the provenance of AI systems’ data – including its origins and history of transformation and/or manipulation – now loom over the systems’ output, clouding whether the results may be biased or incomplete. For practitioners who want AI to remain cool rather than creepy, it’s clear that strong ethical guardrails are needed for the development of AI solutions.
The importance of transparent, open data
Access to transparent, open data is critical to enabling AI systems to function at scale. It goes without saying that AI systems work best when they’re provided with not only significant data volume but also data variety and data veracity; in other words, AI thrives with plenty of accurate data sufficiently differentiated to inform multiple decisions. However, as data sets grow, it can become impossible to determine which data element predictions are based on what, resulting in a total lack of explainability.
Data that is not open or explainable cannot be re-used. With opaque or questionable origins, data can generate incomplete or biased insights, lead to inaccurate outcomes, and undermine people’s fundamental rights, including discrimination and even exposure of personal identifiable information (PII). Whereas traditional, small-scale data analysis enables bad data to simply be excluded upon discovery, this type of manual data cleansing can’t easily be done with AI-scale data sets.
Transparency demands an understanding of data’s provenance – its lineage as it flows and is transformed before use, and complete data traceability from its current state all the way back to its original source. Being transparent about which data sets are used in AI systems also helps to prevent possible rights violations. This is especially important in times of big data, where the volume of data is sometimes valued over quality.
Responsible, ethical AI
Companies that want to reap the benefits of AI must do so responsibly and ethically AND with transparency. Open data demands incredibly complex technological and societal changes, and either introduces or increases many challenges – especially when it comes to walking the cool/creepy line. Issues around algorithm development, AI bias, and model decay are also becoming commonplace.
Consider the FTC’s recent $1.5-million fine and ruling regarding Weight Watchers International’s Kurbo, a healthy eating app. Kurbo was gathering personal information from users as young as eight years old to use in developing algorithms and AI models. Beyond the significant fine, the court demanded the destruction of the app’s algorithms, AI models, and illegally harvested data. Due to growing numbers of personal privacy-violating incidents, Congress is now considering several bills targeting Big Tech, including new compliance standards on user data collection and algorithms.
At this point, companies using AI need a strong understanding of data architecture, robust governance capabilities, deep domain expertise, tremendous scale, and the ability to integrate and optimize data – in open data formats – to realize the true value that data can offer. Moreover, organizations must understand that ethics and responsibly following open data practices demand more than just checking compliance boxes. Regulations are effectively a last line of defense against poor practices; they should be used as a starting point, and as AI becomes more common and important across industries, companies must exceed compliance expectations to ensure they are being good stewards of customers’ data.
Understanding open data
To fully grasp the responsibility of handling open data and succeeding with AI-powered technology, organizations must first understand what open data is, and what it is not.
First, open data means interoperability and following open standards, giving organizations the flexibility to choose data formats that will support their static and changing business needs at any given time. Second, open data is about control: the company’s power to decide what data is integrated and with whom, rather than getting locked into vendors who prefer to dictate and limit access to key data sources.
Open data makes sense financially: It gives companies full control over their ability to manage costs over time. Moreover, open data enables companies to understand the lineage of data and guarantee its complete traceability, so it can be understood, visualized, reused, analyzed, and audited with ease.
Now that we’ve talked about what open data is, let’s discuss what open data is not. Open data is not a proprietary platform or data format, which means it won’t encourage pay-for-play data monopolies within a marketplace. While some unicorns have recently enjoyed lofty valuations based on lock-in strategies, the database lock-in battles of the 1990’s taught us that this approach will shackle customers, and potentially create antitrust issues in the longer term.
Open data is also not a walled garden that undermines access, discouraging experimentation, undermining collaboration, and killing innovation. Walled gardens stifle the vast potential of AI and edge computing innovations. Scaling becomes impossible, as there isn’t enough capital to push out the garden’s walls to build an ecosystem that can accommodate the complexity and explosion of data we will continue to see for generations to come.
For true innovation that is ethical and responsible, open data is the only way forward. The alternatives – proprietary formats, non-reusable data, and opaque provenances – will certainly lead organizations to dead ends, either quickly or in the foreseeable future.
Once organizations get an understanding of open data, they can then take a comprehensive approach to ensuring data best practices. This includes ensuring exceeding – not just meeting – compliance with rigorous information and data security standards such as the PCI Data Security Standard, U.S. DoD DIACAP and DISA Security Technical Implementation Guides, HIPAA security and data privacy rules, FDA 21 CFR Part 11, and Common Criteria.
Looking towards the future with Open Data
While open data becomes more popular for AI and other use cases, organizations should plan for their data responsibilities to increase – and data-related regulations to become stricter. Companies can stay ahead of public and regulatory expectations by ensuring they are diligently managing and controlling their data, establishing the internal and public trust they will need to continue innovating and growing.
Collectively, we should work to develop a data governance partnership that ensures privacy and security while enhancing foundational data assets. Doing so will enable us to widely adopt practices that will ultimately work better for everyone, improving data responsibility and driving future data innovation.
Comments are closed.