Author:
Igor Kumagin, Cybersecurity expert, Kaspersky
Contents
Introduction
Understanding Data Protection
AI and Data Collection
Challenges of Data Protection in AI
Solutions for Data Protection in AI
Conclusion
Introduction
The appearance of new technology in our lives requires the re-evaluation of some basic concepts that have been established for a long time. A good example is the way people have been taught to grasp the steering wheel of a car. For many years, drivers were instructed to keep their hands at the 11 and 1 o’clock positions. Nowadays, traffic safety administrations in many countries recommend placing hands at the 9 and 3 o’clock positions. This shift occurred due to the introduction of new technology – airbags – that forced people to re-evaluate their behavior and adjust regulations accordingly.
Our world is rapidly changing, as new digital technologies play an increasingly crucial role. In particular, we have been witnessing the active development of artificial intelligence (AI), which is already providing many benefits worldwide. So far, AI technologies have had a meaningful impact on people around the globe, a trend that will definitely expand in the foreseeable future[1]. Like most technological innovations, AI technologies present both vast opportunities and substantial risks. It’s important to mention that artificial intelligence is not ‘good’ or ‘evil’ by design – and its impact on society depends solely on the behavior and intentions of a user or a developer.
Artificial intelligence is based on the AI triad[2] of algorithms, data, and computing power. This triad specifies that computing power is used to execute algorithms that learn from data. Each element is vital to the effectiveness of machine learning systems. Algorithms determine the way these systems process information and make decisions. Data establishes background knowledge and defines how AI systems learn about the world. If a fact is not included in the vast data provided, the system will know nothing about it. Acquiring larger and more representative datasets is extremely important for effectiveness of these systems.
As data is critical for AI, serving as the foundation upon which algorithms learn, many developers depend on publicly available sources to compile the datasets they use for their training. They gather the information either directly from the Internet through web scraping or indirectly by acquiring data from other organizations, often using a combination of these methods. Both approaches require adherence to data protection regulations.
In this article, I will explore the challenges and solutions of data collection for AI. The process of gathering data through web scraping is invisible, and people are often unaware that their publicly available information is being collected and utilized for AI training. Imagine that you are an artist or designer who posts the results of their work on a personal website for public enjoyment or to promote sales of the art. This website can be accessed by people all around the world who may enjoy the art and possibly purchase pieces they like. However, the same website can also be accessed by systems that scrape data for AI training. The artist’s work can then be used by AI to mimic the artist or use their ideas to produce similar works of art. This situation raises the question of whether artists should have the option to restrict the use of their work for AI training.
Nowadays, we can observe the tendency that an increasing number of companies are actively collecting[3] all publicly accessible data to train their intelligence systems. This raises another question: should we re-evaluate our current understanding of copyright and data protection foundations and adapt them to the new emerging AI technology?
Understanding Data Protection
The digital world was built to allow data to circulate freely. Just as money is essential to the financial system, data crucial to the digital world. Similar to money, data should be protected and used in accordance with the rights of individuals whose personal data it pertains to. Personal data protection is enforced by corresponding laws, with General Data Protection Regulation (GDPR) as a prime example of such legislation. The GDPR introduced comprehensive rights for data subjects and stringent obligations for data processors/controllers, fundamentally altering how personal data is handled not only in EU but also around the world. Data subjects in the EU are now equipped with enhanced rights, including the right to access, correct, delete, and transfer their data. Meanwhile, data processors/controllers are subject to stringent requirements to ensure lawfulness and limit the purposes of processing, protect data by design and by default, report data breaches within tight deadlines and more.
Another legislative framework designed to regulate data access and use within the EU is the EU Data Act[4]. This law set out rules on how products connected to the internet, collectively forming a network known as the Internet-of-things (IoT), should handle data they generate. It also establishes general conditions for sharing data between businesses. The law specifies how users of IoT devices can access, use and transfer the data that they co-generate through their IoT devices.
The field of all data protection still immature when compared to the regulation of personal data. Protecting personal data and all other types of data requires different approaches and strategies. The EU Data Act represents the first targeted step in regulating the protection of data that is not always personal. While the article discusses the protection of all data, mechanisms for safeguarding data (not only personal data) from the perspective of data subjects are not as developed as those for regulating personal data. Meanwhile, AI is already actively using such data sets, creating disproportionate and unfair conditions for data subjects.
Data protection extends beyond laws like GDPR, which impose legal obligations on companies collecting and processing personal data. It is primarily defined by technical standards, industry practices, and organizational measures that guide companies to build a comprehensive and secure environments for data processing. In particular, these standards[5] teach companies and organizations how to safeguard data against unauthorized changes, compromise, or loss through set of measures that include the use of a secure software development lifecycle, which ensures the creation of secure services; reliable encryption, which secures data exchanges between the user and the cloud; digital certificates for legitimate and secure server authentication and application updates; separation of data storage and strict access policies; and techniques such as data anonymization and obfuscation and others.
Advancements in AI are transforming how data is collected, stored, and used, thereby setting new challenges for the ICT industry and policy makers to keep up with the emerging technologies and improve data protection laws and ethics (which serve as the foundation for other legislation). I will outline challenges to the ICT industry and policy makers further in this article.
AI and Data Collection
As we learned, data is one of the three key elements of an AI system. Let’s explore how data for AI is acquired. Developing an AI model involves several stages, including collecting and pre-processing training data. The amount of data required for AI training is so vast that it often necessitates acquisition from as many sources as possible. For instance, GPT-4 was trained on roughly 13 trillion tokens, which is roughly 10 trillion words[6]. Most AI developers obtain the data from publicly accessible sources by collecting training data through web scraping from online environments such as blogs, social networks[7], forums, product reviews, and personal websites, which may even contain personal data that individuals have posted there.
Another way of collecting data for AI training involves acquiring data from third-party providers or brokers. This approach mitigates legal risks because companies secure data through licensing agreements. These agreements often involve purchasing rights to use sets of digital content, such as images and videos from platforms like Shutterstock or more specialized collections from entities like Photobucket[8]. Photobucket — once a leading image-hosting website — for example, is in discussions to license its vast library of media to tech companies for AI training, highlighting a lucrative market for old digital content. Another illustration is OpenAI’s collaboration with Stack Overflow, the Q&A forum for software developers, aimed at enhancing its generative AI models’ performance on programming-related tasks[9].
The EU AI Act sets specific requirements [10]for data collection, mandating that high-risk AI systems use high-quality data. These requirements stipulate that data must be accurate and updated constantly, adequately reflect the variety of real-world conditions, and be free of bias complete. They also emphasize the protection of personal information, among other aspects. However, there are no specific provisions on how data should be protected.
Challenges of Data Protection in AI
Data processing has been around for decades, with millions of systems developed to store and process personal information. Over time, the industry has established a set of practices [11]and standards for protecting data at rest and in use, securely transmitting it, and creating safe environments for personal data processing. Many of these practices are applicable to AI system development, but there are also specific risks and threats in AI that require additional measures. While it might seem straightforward to handle a piece of personal information in everyday life, such as ID details or financial information, protecting training data for AI is a completely different challenge as it involves dealing with enormous amounts of data. For instance, the dataset used to train ChatGPT 3.5 comprises 45 terabytes of text data[12]. Protecting data of this size is one of the first challenges of AI system development.
Another challenge with large datasets is ensuring transparency in personal data processing. Under data protection laws such as the GDPR, transparency is a key principle that supports several rights of data subjects including the right to be informed. This right involves providing individuals with information about the way their data is used and the purposes of this processing.
One detail that distinguishes AI systems from traditional ones is that model training requires high-quality data, which ideally should not be modified, or if needs be, should only undergo minimal modifications. Data modification is a key element of data protection, commonly recognized through the practice of data anonymization where possible. Anonymization removes personally identifiable information, which can help mitigate risks in case the data is compromised.
Another, and the most effective, principle of data protection that is also incorporated into laws such as the GDPR is data minimization. Data minimization involves collecting only the data necessary to achieve a specific goal. This approach limits potential damage in the event of a data breach. The data minimization principle might not be applicable to AI system development, as AI models often require large amounts of data to achieve better output results.
In generative AI, user input written in natural language is often presented to the model verbatim. This means that user prompts and any third-party data all co-exist in the same context window (the name of the working memory where the conversation history and context is usually stored during inference), written in the same language with little to no separation. This setup can lead to unintended behavior, as the instructions in the user prompt can alter the unexpected behavior of the conversational app. This a situation is commonly referred to as a prompt injection. The potential consequences this risk include unauthorized access or data breaches.
Solutions for Data Protection in AI
Data protection laws like the GDPR, which is a technology-neutral regulation, establish fundamental principles of data processing in the context of AI development[13]. However, these principles are sufficient for the protection of personal data acquired from internet platforms that, in turn, are GDPR compliant, and do not cover entire dataset that might be used for AI development. The EU AI Act set some additional requirements for data governance, but they only apply to high-risk AI systems.
Let’s explore how data protection might be implemented in practice and what additional measures can be introduced to protect opensource non-personal data.
First and foremost, it’s important to adhere to data protection laws when using information from publicly accessible sources, making sure that an appropriate lawful basis, such as consent, exists for the data processing. It is also crucial to publish a detailed policy outlining how all data is collected, processed and stored.
Additionally, it is essential to integrate security measures throughout the entire AI system development lifecycle. The Development, Security, and Operations (DevSecOps[14]) approach is practically relevant as many of AI systems are hosted on the cloud. By following DevSecOps practices, AI developers can ensure that security requirements are seamlessly integrated into all stages of the development such as design, deployment, and maintenance. This approach also includes implementation of data protection from the start, adopting a ‘data protection by design and by default’ approach.
Finally, developers should consider embracing responsible data collection practices to provide data owners with an option to opt out of the web scrapping of their publicly available data. This option, in particular, will guarantee that creatives are able to publish their data online without worrying that it can be used for AI training purposes. Implementing this option could be as simple as using a robots meta tag on the webpage, which will prevent the data from being used for AI:. This option is consistently advocated in the submissions I prepare for all public consultations on the safe AI during my tenure at the Kaspersky[15].
Conclusion
The field of non-personal data protection is immature compared to the regulation of personal data. Such protection requires specialized approaches and legislative frameworks that do not yet fully exist. At the same time, AI is already actively using this type of non-personal public data, leading to disproportionate and unfair conditions for data subjects. Addressing these issues could involve introducing ethical standards or industry norms as the situation requires immediate action due to the widespread integration of AI into people’s lives.
Different data collection and processing practices elicit varying public perceptions, level of trust in AI technologies. While web scraping may be viewed negatively, purchasing data from digital platforms can create a more positive impression, as individuals voluntarily accept platforms’ agreements that include use of data for AI provisions. This practice of purchasing data represents a significant step towards more responsible data acquisition practices in the AI industry.
One of the most important ethical principles that should be established in the industry is prioritizing the protection of all data. AI developers must demonstrate to their customers and users that they can be trusted with data, which is crucial in user interactions with AI systems. Besides implementing industry-accepted data protection measures, AI developers might consider providing users with an option to opt out of web scrapping of publicly available data by defining a universal robot meta tag for that purpose.
[1] More than half of companies use AI and IoT in their business processes https://www.kaspersky.com/about/press-releases/2024_more-than-half-of-companies-use-ai-and-iot-in-their-business-processes
[2] The AI Triad and What It Means for National Security Strategy https://cset.georgetown.edu/wp-content/uploads/CSET-AI-Triad-Report.pdf
[3] Google Says It’ll Scrape Everything You Post Online for AI https://gizmodo.com/google-says-itll-scrape-everything-you-post-online-for-1850601486
[4] EU Data Act https://digital-strategy.ec.europa.eu/en/policies/data-act
[5] Enhanced Security Requirements for Protecting Controlled Unclassified Information: A Supplement to NIST Special Publication 800-171 https://csrc.nist.gov/pubs/sp/800/172/final
CIS Critical Security Controls https://www.cisecurity.org/controls
[6] GPT-4 Details Have Been Leaked! https://www.kdnuggets.com/2023/07/gpt4-details-leaked.html
[7] X/Twitter has updated its Terms of Service to let it use Posts for AI training https://stackdiary.com/x-can-now-use-posts-for-ai-training-as-per-terms-of-service/
[8] Inside Big Tech’s underground race to buy AI training data https://www.reuters.com/technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/
[9] Stack Overflow signs deal with OpenAI to supply data to its models https://techcrunch.com/2024/05/06/stack-overflow-signs-deal-with-openai-to-supply-data-to-its-models/
[10] EU AI Act, Article 10, Data and Data Governance https://www.euaiact.com/article/10
[11] ISO/IEC 27018:2019 Information technology — Security techniques — Code of practice for protection of personally identifiable information (PII) in public clouds acting as PII processors https://www.iso.org/standard/76559.html
Enhanced Security Requirements for Protecting Controlled Unclassified Information: A Supplement to NIST Special Publication 800-171 https://csrc.nist.gov/pubs/sp/800/172/final
CIS Critical Security Controls https://www.cisecurity.org/controls
[12] ChatGPT in a Nutshell https://medium.com/@lij061703/chatgpt-in-a-nutshell-37ea5ddbc362
[13] Europe: The EU AI Act’s relationship with data protection law: key takeaways https://privacymatters.dlapiper.com/2024/04/europe-the-eu-ai-acts-relationship-with-data-protection-law-key-takeaways/
[14] What is DevSecOps? https://www.redhat.com/en/topics/devops/what-is-devsecops
[15] Intelligenza artificiale: Garante privacy apre un’indagine sulla raccolta di dati personali on line per addestrare gli algoritmi. L’iniziativa è volta a verificare l’adozione di misure di sicurezza da parte di siti pubblici e private https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9952078