[bsfp-cryptocurrency style=”widget-18″ align=”marquee” columns=”6″ coins=”selected” coins-count=”6″ coins-selected=”BTC,ETH,XRP,LTC,EOS,ADA,XLM,NEO,LTC,EOS,XEM,DASH,USDT,BNB,QTUM,XVG,ONT,ZEC,STEEM” currency=”USD” title=”Cryptocurrency Widget” show_title=”0″ icon=”” scheme=”light” bs-show-desktop=”1″ bs-show-tablet=”1″ bs-show-phone=”1″ custom-css-class=”” custom-id=”” css=”.vc_custom_1523079266073{margin-bottom: 0px !important;padding-top: 0px !important;padding-bottom: 0px !important;}”]

How to Avoid Being Deceived by Misleading Data Security Metrics

Today’s approach to understanding data accuracy rates is flawed. True understanding starts by asking the right questions, and these questions must lead to positive outcomes and avoid legacy assumptions and vendor bias.

For example, a common question I hear from prospects is, “What are your false-positive rates for PII detections?” This simple, well-meaning question is unsound because it concentrates on “precision” and neglects “recall,” making it unbalanced. Precision aims to reduce false positives (noise), while recall focuses on minimizing false negatives (missed risks). Both are important, but in the world of data and cybersecurity, false negatives can have serious consequences that must not be overlooked.

Vendors often reinforce this flaw of focusing on precision by presenting misleading and oversimplified metrics for comparing themselves against their competitors. The composition, sensitivity, and volume of your data will influence accuracy metrics. Relying solely on a vendor-provided metric without validating it against your organization’s data will not yield expected results.

The simple accuracy question harkens back to a time not long ago when data protection technologies were limited and reducing noise was an essential factor to prevent unwanted operational impacts. With today’s advances in AI, new approaches minimize these accuracy concerns. Therefore, the more critical questions should focus on operational effectiveness. For example, “How is your solution leveraging new advances in AI to improve the operational effectiveness of our data security program?” “How will it provide enhanced visibility into our data risks across all our data?” And “How will it enable us to achieve automated remediations and better outcomes in less time?”

But, in the real world, the questions that surface in RFIs, RFPs, and solution-selection cycles are born from practitioners who, like me, have lived with the pain of yesterday’s solutions. They often just want to avoid that pain this time around and are unaware that there are alternative, more accurate approaches. Unfortunately, rather than using this opportunity to inform customers about better approaches, some vendor salespeople manipulate these questions to falsely elevate themselves above their competition.

The question of false positives uses an outdated method to measure the effectiveness of a modern data protection solution. It’s an attempt to avoid the historical pain of false positive rates from solutions that rely on policy-based detections – i.e., those that employ data classifiers/identifiers, patterns, and the like. All data loss prevention (DLP) and data security posture management (DSPM) solutions developed before 2018 fall into this category, along with most “modern” DSPM solutions that merely repackaged that same legacy, policy-based detection approach.

The advent of modern AI-based approaches that apply deep learning techniques and leverage LLMs and neural network transformer technologies have transformed the landscape for vendors who realized this and embedded it into their core product development strategies. These new approaches provide the opportunity to shift from the “heavy-lifting” policy-first based approaches to a “simply scan” method utilizing AI-augmented autonomous clustering and categorization.

Let’s examine both approaches to reveal the heart of the matter.

“Heavy-Lifting” Policy-First Approach

These solutions require customers to initially decide what content is most important to them (think of “critical assets” for each organizational function – HR, Finance, Sales, Legal, Operations, etc.). An often overlooked and certainly underestimated complexity of this approach is that you now depend on your business leaders to identify and describe their most important data. (They often don’t really know or won’t take the time, which results in gaps in the policies, leading to false negatives.) Once the decision is made about what is essential, the next step is to develop policies that describe this data in terms that machines understand. This requires significant effort from staff with specialized skills to build policies replete with regular expression pattern matching (for social security numbers, credit card numbers, medical record numbers, driver’s license numbers, and thousands more), import data dictionaries and identifiers, and configure exact data matches, LUN checks, proximity estimations, keywords, and so on.

The result? A lot of effort goes into creating policies to find an organization’s important critical data. Once scanning begins, the false positives also become evident, so policies are fine-tuned to minimize the noise. This noise presents two primary concerns: It overwhelms security analysts trying to spot real incidents among the many alerts; and it annoys your organization’s leaders and employees who may ultimately jeopardize this well-intentioned (and very needed) data security initiative.

Also Read: AIThority Interview with Rob Bearden, CEO and Co-founder at Sema4.ai

Related Posts
1 of 23,274

The Heart of the Matter

“Policy-first” data scans from security-focused solutions like DLP were never designed to precisely locate, qualify, and categorize all your data. They may scan all your data and even list your data inventory, but only those critical data assets matching your policies will be cataloged and protected, i.e., those files that contain your predetermined sensitive information. Counts will vary by organization and the competency of those specialists who developed the policies, but a generous estimate is that this will represent about 20-30% of the total count of files scanned; probably less.

So when vendors claim they detect PII with 95% accuracy, they are very likely referring to only 20-30% of your data. That is the policy-matched data, not the other 70-80% that is certainly contextually relevant and important to the organization. These vendors lack the full context of this data, which is where the false negatives occur. Achieving 95% accuracy on just 30% of the total data falls far short of most customers’ expectations and ultimately leads to disappointment and potentially serious risks.

The New Alternative – “Simply Scan and Understand” Approach

Imagine a world where you bypass the effort, cost, and time delays of interviewing your data stakeholders or creating policies. A world where you simply scan your data and are presented with a complete and organized view of what your data is, cataloged into intuitive categories. These categories enable you to show your stakeholders a clearer understanding of their data, allowing them to make informed decisions about data risks and protection objectives. They also accurately apply friendly grouping names to 99% of your data (like HR policies, employee health records, finance M&A documents, patient discharge forms, conflict minerals regulatory disclosures, intellectual property, offer letters, immigration forms, source code, and many, many more).

Accuracy is built into this new AI-driven approach, particularly when semantic intelligence is employed. Semantic intelligence focuses on contextual understanding, nuances, and subtleties of language to extract the meaning and relationships of your data. LLMs and Transformers are the engines behind this. LLMs are trained on a vast corpus of text and recognize patterns and context. They identify key themes and classify content based on learned knowledge, greatly reducing the need for large-scale policy building.

Gaining this greater visibility and context into your data assets boosts the confidence to move forward with previously difficult-to-achieve data security initiatives. These modern initiatives include automatically applying file classification labels, remediating overly permissive file access and sharing, archiving outdated and duplicate data, and protecting GenAI rollouts, among others with real benefits to the organization.

Conclusion

Asking informed questions will help you explore the possibilities presented by the latest in AI and other technologies. Focus on your desired outcomes, shed narrow legacy thinking, and reject vendor bias to gain deeper understanding of new approaches that deliver vastly improved coverage, balanced precision and recall accuracy, and time-to-value.

About The Author Of This Article

David Garrison is Senior Solutions Engineer at Concentric AI

Also Read: Leaders, Laggards, and the Walking Dead: Why Most Agencies Are Already Behind on AI

[To share your insights with us, please write to psen@itechseries.com ]

Comments are closed.