AI and Data Security: A New Concern?

Are the data security challenges of today different to those of the past?

Introduction

One of the areas impacting the adoption of Generative AI (substitute Generative for Agentic if you prefer) is that organisations are concerned about the status of data security across their estate, especially in the unstructured areas where AI can arguably provide additional value. As a result of this concern, data security has come to the fore. It is (quite rightly) seen as one of the foundational elements that must be in place before the wide-scale adoption of AI within an organisation.

Is this a new challenge?

Early in my career, I focused on developing CRM-style applications with data stored in databases.  If you implemented role-based access controls (i.e., didn’t use the SA account to access all the data, which I have seen), the responsibility for “securing” the underlying data effectively shifted to the application admins or the DBA’s, who defined the roles and permission sets, allowing me to concentrate on core functionality.

However, as I transitioned from being solely a developer and began branching out into other areas of IT, including the knowledge worker field (E.g., SharePoint), data security became increasingly visible and important. I am starting to see some of the same concerns raised now about AI as I saw then.

Legacy Challenges in Data Security

Suppose I go back to when I started my SharePoint journey with SharePoint Portal Server and to SharePoint Online with Microsoft 365 (Including OneDrive and Teams).  There is/was a consistent query that I would regularly hear (especially with the introduction of Delve).  Which, basically boiled down to variations of “Why can I (or user x) see that document?

Initially, this refrain was usually constrained to documents, document libraries or sites which were not secured, usually with the default share with everyone setting configured.  In this instance, these queries were relatively easy to resolve and explain.

However, this issue became more complex and nuanced as the product matured, for example, the improvement in search, the introduction of content query web parts & especially Delve.

Sidenote on Delve, for those that are not au fait with SharePoint, Delve was a feature of M365  that was designed to allow for easier content discovery for users, for example, you would see documents that your manager or teammates were working on (assuming they had permissions). 

This feature, alone, probably raised more of these types of queries than any other.

This was not just an issue relating to SharePoint; I had, on occasion, the opportunity to work on other solutions outside of the Microsoft space, including the relatively short-lived Google Search Appliance (https://en.wikipedia.org/wiki/Google_Search_Appliance), which was a physical Google server that sat on your network and provided enterprise search functionality across your content domain and on one memorable occasion Autonomy, before the HP acquisition and subsequent fall out (see Google Search Appliance above, but with added bells and whistles).

All of these systems relied on the organisation’s understanding of its data estate and application of a level of information governance and security across all elements of its estate that these systems accessed.  Without this level of control, employees could access any content available to the user, whether explicitly provided or due to poor data hygiene.

Does this sound familiar?

Of course, the solution (and I am going to revert back to SharePoint for this) is to apply information management policies and practices across the platform, coupled with user education.  This could involve automation of site/document library creation and controls on changing permission models, i.e. not breaking inheritance in document libraries is a good example, and credit to Microsoft they were aware of this problem and started to release features that would help organisations manage this, such as Information Rights Management (IRM), eDiscovery, Data Loss Protection etc.

The correct application of these controls would effectively eliminate the unauthorised and inadvertent discovery of data within the platform’s remits.

Indeed, the same holds true today for M365 Copilot and other Microsoft Copilots routed in the Microsoft ecosystem. Just substitute IRM, DLP, etc., for Purview, Priva, and Entra, and you have the foundational components of data security needed to secure the underlying data these Copilots rely on for their models.

So, what is different today?

Whilst I think the core challenges are the same, i.e. Data Discovery, Classification, Controls, Visibility and Response, there are some key differences when discussing AI applications versus traditional enterprise search and discovery.

Complexity and Scale

Even with the notorious SharePoint Sprawl issue, the scale and complexity of the data these tools access and the nature of the prompts and responses are different, One example I use frequently is that in SharePoint, you could perform a search and see some content that you have a good idea that you should not access, based on file name, description, extract etc. and hopefully this would dissuade the user from accessing the content.  However, due to the nature of the responses you receive from Gen AI tools, the response may include content you are unaware you should not access or is just presented to you in a way that makes it almost unavoidable.

Adversarial Attacks

Generative AI systems are particularly vulnerable to adversarial attacks, in which malicious attempts are made to manipulate input data to deceive AI models. This is a genuine concern both for externally facing models/interfaces and an insider risk concern. I doubt anyone has ever tried to do this with SharePoint search, although I would be interested in whether this has happened.

Ethical and Privacy Concerns

With the more traditional “legacy” platforms, you tended to search for or look for specific types of content, and the output was not “manipulated” by the system and tended to be returned as the original document.  This is not to say that the original content was ethical or complied with privacy rules, etc, but the system was not inferring and creating new content based on this and other training data within the corpus, and I think this is a key distinction.

Prioritisation

This last point is more personal and probably related to the scale of the systems in question.  For example, a focus was placed on insider risk identification, data exfiltration, etc. There were/are tools which can be used to at least monitor this type of activity, which can work without the overhead of a complete data security strategy. As a result, organisations (perhaps naturally so) may have de-prioritised the sometimes complex data discovery, classification and governance elements required for a robust data security solution/posture.

Conclusion

The core elements of data security for AI systems are, in effect, the same as those organisations have had for some time. The only real difference is the scope of the issue and the size of the risk, and organisations realisation that to take advantage of the shift to AI-based solutions, they have to understand their data (and identity, more on that later) estate.

Leave a comment

Website Built with WordPress.com.

Up ↑