The Case for Chain of Custody Controls

If a machine learning model is trained on 50,000 images, an attacker need alter only 50 of them, or 0.1 percent of the training data, to achieve a data poisoning attack. Consider a data curation pipeline involving a drone camera that captures images and stores them on disk, (data generation and storage). These images are labeled and split into datasets (data curation), and a machine learning model is then trained using these datasets (model training). This pipeline involves multiple instances where data is at rest or in transit and presumes the involvement of multiple people (perhaps one person to curate the data and another to train the model). Each instance presents an opportunity to alter the data while each person involved presents a potential insider threat. For example, an on-path attacker could modify the images when they are transferred from the drone to be curated, or after the data is labeled, the attacker could modify some labels, leaving the images themselves unaltered.

Data poisoning occurs when an insider or adversary modifies training data to influence the performance or operation of a model. As artificial intelligence (AI) has proliferated, corresponding security mechanisms have not kept up, leaving vulnerabilities, including in the data used to train the model. However, lessons gained from decades of experience in data protection can be applied to AI.

Organizations without mechanisms to detect or prevent data poisoning are open to an avenue of attack that is difficult to mitigate once it has succeeded. While there is burgeoning research in machine unlearning, which could be used to recover from a data poisoning attack if you know what was poisoned, it is still more effective to retrain the model, a task itself that is extremely expensive. Since recovery is meager at best, prevention is the optimal approach. Nowadays, as we see threat actors looking to influence models and degrade the trust of users through incorrect behaviors, preventing data poisoning is more important than ever.

We propose being proactive with chain of custody controls. This is because probabilistic methods to retroactively check whether data was tampered with are becoming less effective. Chain of custody, the documentation of who possesses an object and when, is a concept primarily applied to legal evidence, but it has application to other domains. This post describes data poisoning and proposes cryptographic chain of custody as a mitigating solution.

Data Poisoning

Data poisoning is an attack against the machine learning model that powers an AI system. The methodology of this attack is to subtly modify the data or labels used to train the model. An adversary can utilize data poisoning to influence or degrade model performance, leading to bias, overlooked issues, and the introduction of software vulnerabilities

As the size of models and datasets exceeds the capability of people to label data, machine learning has moved from supervised learning to semi-supervised learning. In supervised learning, all training data is labeled whereas in semi-supervised learning, only some of the training data is labeled. The rest of the data supports the training process by enabling the model to encompass patterns in data. LLM training, for example, is generally unsupervised, detecting patterns in the training data that guide the predictive generation process. Regardless, the machine learning training process typically relies on large amounts of data, and only a small fraction of that data need be malicious to achieve a data poisoning attack.

Data curation encompasses “all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data.” It can be an extremely difficult and time-consuming process when humans must review, verify, and label each data item. Due to the rapid pace of data development and the lack of data journaling software, organizations need to keep accurate logs of data manipulation and access.

Cryptographic Chain of Custody

Chain of custody is not a new topic; it is used in the legal realm to provide a paper trail for evidence and records. The documentation and control verification processes used in chain of custody management has made its way into other fields, such as digital forensics and supply chain management. Nonetheless, keeping detailed records of data is only part of the solution.

In our previous work, AI Hygiene Starts with Models and Data Loaders, we explored the value of traditional cybersecurity methods to secure AI systems. As part of that work, we described how cryptographic methods can be leveraged to provide robustness in the presence of an adversary. Use of checksums and digital signatures are key components of a secure and robust cryptographic chain of custody. When combined with detailed metadata for each data item, cryptographic methods can provide integrity and privacy assurances within the chain of custody process.

With auditable records for data transactions, it becomes more difficult for an adversary to modify the data without being noticed, thus making the model training processes robust to data poisoning attacks. How to keep these records depends on the organization, but databases, record retention systems, and transaction logs are common options.

Items of relevance for chain of custody in a data-intensive system might be features of the data such as

domain-relevant metadata
file-specific metadata
generators or processors performing the action
digital signatures for approvals
checksums and other integrity verification mechanisms

Notional Data Workflow

To facilitate our discussion of how chain of custody can be used to protect a machine learning training process from data poisoning attacks, we introduce a notational data workflow in Figure 1. Next, we elaborate on each step of the lifecycle, explaining how cryptographic chain of custody can be applied to assure data provenance. For this walkthrough, we will assume a simple scenario based on a drone that takes photos wherein a photo represents a data item. In this scenario, the data will be used to train a machine learning algorithm for object detection and classification.

Figure 1: The machine learning process is divided to three phases: data generation and storage, data curation, and model training.

Cryptographic Chain of Custody on Our Notational Data Workflow

1. Data Generation and Storage

Drones, sensors, online transactions, and the downloading of a public dataset are all mechanisms that create data items on which an organization may wish to train a machine learning model. Once a data item has been created, it typically needs to be stored somewhere for future use. Depending on the properties of the data item (e.g., how it will be used in the future and storage available), a data engineer could choose to store it in the cloud, a database, on a filesystem, in a data lake, or in a warehouse.

Data Generation

Figure 2: A drone takes pictures for data generation, the first step of the data lifecycle, and notes image metadata.

The first step of the lifecycle is data generation. As part of our hypothetical system, each drone will have a unique signature that it can use to authenticate every piece of data that it creates. This initial data signing should be done as close as possible to the source and time of data generation. In addition to signing the data generated by the drone system, checksums should be calculated for the image and its metadata so that any future changes to their integrity—as the data is transported from its remote source to the managed repository—can be detected.

To summarize, at the data generation stage, our tracking manifest individually records the initial image metadata, its checksum, and what platform generated it. The package of all relevant data items is then digitally signed, allowing future stages of our workflow to perform integrity checks.

Data Storage

Figure 3: An automated data loader creates a transfer record recording that it transferred the file image.jpg with the specified checksum into a storage location.

The next step in the lifecycle is data storage, wherein a data item is transferred from its source system and then stored for later use. To do this in an audited and verified manner, we need to track the transfer that occurred, the mechanism or tool used to transfer the data, and the destination of the transfer. After completion, our data loader will sign the record that tracks this transfer. Using the data item and its location to perform integrity checks, this signature can be verified at future stages in the workflow. This guards against tampering as the data is transported from source to the secure repository.

2. Data Curation

Once data has been created and stored for use, it needs to be curated by a data engineer or data processing system to ensure it is in a proper state for machine learning. As part of this process, called “cleaning,” the data is converted from its raw form into a format suitable for machine learning. For example, imagery might be sharpened or denoised, text records may have missing fields inputted, and videos may be broken down into single frames. Once data has been cleaned, it will be labeled or annotated to assist in the machine learning process. Finally, each data item will be analyzed by a data specialist and assigned to a training or testing dataset for the machine learning process.

Data Cleaning

Figure 4: The data engineer’s identity, the history of the data item, and the new checksum are noted.

Now that our image is in cloud storage, it is ready for any pre-processing that may be necessary before the image is used as part of a machine learning pipeline. For this example, let’s assume that our organization has multiple drones that take imagery at different resolutions; however, the native image size we use in our machine learning pipeline is 640×480 pixels. Therefore, all imagery that will be used in this pipeline must be resized. In our example organization, resizing is manually performed by data engineers using image editing software.

Critically, we need to ensure that our chain of custody is maintained while preprocessing occurs. This stage of our workflow should ensure that the image that is being edited, and the location that is loaded from, have not been modified. Because we are keeping detailed records of our actions, all that is necessary to do this is to verify that the data, checksums, and signatures all match the records we created in data generation and storage.

The cleaned record, as a new image created from the original, is added to our workflow. Just as in our data generation step, we will checksum and sign all relevant data and metadata and then store these in tracking records that can be verified at future stages.

Data Annotation

Figure 5: The data engineer’s identity and data information are noted. Observe that the checksum is the same as in the previous step.

With our data finalized and ready for use in a machine learning workflow, it next needs to be annotated for use in a supervised learning scenario. Annotation is the part of the data flow where a domain expert creates annotations to establish a ground truth that helps train a machine learning model. The key items we need to track as part of a chain of custody workflow are the image that is being labeled, who labeled the data, and the annotations that were generated. Just as in previous steps, we will add these items to our chain of custody with checksums and signatures. Having the records in the chain of custody log enables us to verify who created the annotations and their integrity when they are used in the future.

Dataset Creation

Figure 6: Checksums are added for the set of images and the associated annotations.

Creating datasets is the penultimate step in our data workflow. Dataset creation is the process of assigning data into a collection. A data engineer performs this task based on criteria such as quality, balanced representation, and task relevance. The data engineer must understand what data should be tracked for chain of custody, and the chain of custody should be updated whenever a dataset is created or modified. Upon creation or modification, a checksum of the dataset and all its attributes, such as the files and annotations for the dataset and any additional metadata associated with all entities, must be calculated. Finally, when complete, this dataset file should be signed by its creator or modifier, signifying that they approve of all the contents of the dataset.

Before creating the dataset at all, the chain of custody should be verified for all items in the dataset. This will ensure that a dataset is only composed of valid items and that none have been tampered with since their creation. The data engineer must verify every image and annotation in the dataset to ensure that their chains of custody are intact and complete. Below is a visualization of this verification process for our example Image-low-res.jpg file from our training dataset.

Figure 7: The checksums for each step of the lifecycle for the data item are validated.

If all chain of custody checks for all items in the dataset cannot be completed, then an error should be generated by the verification process, alerting system owners to the problem. This will give system owners a notification that data has been tampered with and trigger further forensics toward the cause of this tampering.

Figure 8: Checksums for each step of the lifecycle for the data item cannot be validated.

If all the items contained in the dataset pass validation, then the dataset can be signed and verified as adhering to an unbroken chain of custody from data creation through to addition to a dataset.

3. Model Training and Evaluation

Following complete curation, the data is suitable for model training. Model training is iterative in that data can be repeatedly loaded and fed into a model-training process where the final product is a machine learning model. This trained model will then be evaluated against a test set to measure the efficacy and generalizability of the model for the task it was trained to perform.

To assist in performing model training and evaluation in a chain of custody-enabled way, the data loaders for model training and evaluation should also be chain of custody-aware. For this context, chain of custody-aware means that loaded data items will always have their chain of custody rules verified at the outset to ensure there was no tampering of the dataset files, annotations, and the data itself.

Figure 9: The checksums for each step of the lifecycle for the data item are validated before being fed to a machine learning model.

If all verification steps succeed, data can then be loaded and used to train a model.

Upon model training completion, the last step in the chain of custody can be completed as part of the model training process. This step involves writing out a verified and signed manifest of all the data on which the model has been trained, in addition to a checksum and signature for the produced model. The data manifest can then be used in conjunction with a model file to have a verified manifest of all the data a model was trained on. Moreover, future invocations of the model can load and verify the chain of custody data before the model is used. A complete chain of custody process will enable system owners to have confidence that the model and the data used to create it are untampered with and are aligned with the organization’s intent.

What if We Don’t Use a Chain of Custody Mechanism?

There are two alternatives to not implementing a chain of custody system. The first, as we discussed earlier, is to track detailed statistics about all data and models. Ergo, every data item inputted to a model, every model training process, and the model’s output need to be tracked to ensure it lies within an anticipated distribution. Implementing granular tracking of these statistics has a high overhead because there are few tools to assist with this process. Additionally, these statistics must be continuously calculated for sufficient monitoring. Furthermore, unlike chain of custody, this check is probabilistic. An attacker can bypass the safeguards with well-crafted inputs, and there can be false positives that would frustrate users, reducing their trust in the data verification system.

Fortunately, there are many systems today that can minimize integration overhead. Most modern database systems can be enabled to generate checksums and create audit logs of data item modifications.

The second option is to not do anything, but this is contingent on risk appetite. For example, a low impact environment, such as research with no production systems, may choose to forgo chain of custody controls. If other security controls are in place, such as the system environment being completely isolated from the outside world and having endpoint protection, then the attack surface is largely minimized. Conversely, a large organization creating production-quality AI models should consider a chain of custody mechanism to prevent data poisoning.

Looking ahead, we are seeking collaborators to partner with us to advance the state of the art on protecting data in machine learning pipelines. If you are interested, please contact us at info@sei.cmu.edu.

Source link

What's Hot

The last piece in the DC construction puzzle: Ongoing operations

Masimo's Apple Watch ban complaint dismissed by U.S. District Court

The Reality of AI in Engineering: Why Productivity Gains Get Absorbed by System Constraints

The Case for Chain of Custody Controls

Hype and Reality of the AI Coding Shift

Eric Tschetter on Decoupling Observability – Software Engineering Radio

Unlocking the Data Layer for Agentic AI with Simba Khadder

The last piece in the DC construction puzzle: Ongoing operations

Masimo's Apple Watch ban complaint dismissed by U.S. District Court

The Reality of AI in Engineering: Why Productivity Gains Get Absorbed by System Constraints

The Case for Chain of Custody Controls

Don't Miss!

The last piece in the DC construction puzzle: Ongoing operations

Masimo's Apple Watch ban complaint dismissed by U.S. District Court

Subscribe to Updates

What's Hot

The Case for Chain of Custody Controls

Data Poisoning

Cryptographic Chain of Custody

Notional Data Workflow

Cryptographic Chain of Custody on Our Notational Data Workflow

1. Data Generation and Storage

Data Generation

Data Storage

2. Data Curation

Data Cleaning

Data Annotation

Dataset Creation

3. Model Training and Evaluation

What if We Don’t Use a Chain of Custody Mechanism?

Related Posts

Subscribe to Updates