How to use AWS Managed Blockchain to train a multi-party Machine Learning model

5 min readMay 17, 2021

\ Machine Learning

Machine Learning and Deep Learning are a whole set of algorithms that are used to read patterns in a data lake. If you want to understand a bit more about ML/DL you can read this, and follow this channel on Youtube.

In this post, I`ll go through a common problem in ML, which is how companies use their data to train models, a multi-party training can be tough due to data constraints, regulations, compliance, and privacy

You can read more about it in here — You can know more about it here

Imagine that you have an Insurance company that has a bunch of personal data, and want to train a model with it, but all your Insurance Industry partners want to develop this model with you, sharing the costs and the benefits. How can they train the same model as you, without sharing their client’s personal data?

We will address this problem here in this post, and you can read more about this here, here, and also here, where a lot of academic people did an awesome job researching for a good solution. I made a simpler one, but I will let this pretty clear: I’m not a Machine Learning neither a Blockchain specialist, this post was based on studies only, I was not able to put it into practice, but I’m interested to try with your help!

\ Blockchain & Distributed Ledger

Reading about the issue above, I thought of a solution using Blockchain and went through this article and this project about Learning Chain. These two studies gave me an idea of use Amazon AWS Managed Blockchain, which you can read more about it here.

Blockchain is a good choice for this because it is unchangeable, solid, and distributed, facilitating the multi-party system.

TL;DR: Each company will run an AWS environment, using Sagemaker, Lambda, S3, CodeDeploy, and CodePipeline, as a common Machine Learning flow on AWS. But, once the model is trained, it will be thrown into the blockchain for a consensus analysis, if it is good, it will be added to the ledger, and every other party member can make use of the trained model inside their own environment, securely, privately and encapsulated.

Let`s talk about the architecture in-depth?

\ Architecture

As we said, we will use AWS Managed Blockchain as Orchestrator of our learning chain and Sagemaker for the training, so let us see how Machine Learning works in the AWS Well-Architected Framework

(Image1) — This flow is entirely under the party member cloud, isolated from the chain, securing the privacy of the data

The well-architected framework sets the flow for the model and the data, at the edge, it will export a Lambda for the world through an API Gateway for a user interface analysis of the trained model.

(Image2) — One step up view, here we can see how the data get dynamically to the model and how it is sent to the chain after trained

This flow finishes on an S3 Bucket with the trained model. This upload triggers an AWS Lambda that connects to a Hyperledger Fabric Client. This client is responsible for transforming the model into a valid block and sending out the trained model to the whole chain.

Image2 is the orchestrator of Image1, so the whole Image1 fits inside the “Train Model” object at Image2.

(Image3) — This is the last step out of our architecture, detailing how each party member will connect through the chain

Once a new block is spread over the chain, all nodes will test it to be sure that is a valid model, a Consensus system will trigger, and if the new block is invalid, it will be discarded.

In this way, every member in the chain can use/train the model with real, private, and protected user data without the fear of data leaking.

Also, this flow partially automates the training, as it uses a dynamic model saved on S3 and gets the data coming dynamically from a data lake that is fed with wrangled data coming from a stream.

\ Scenario

A real scenario for this solution is in the banking industry, as a security sensible business, data privacy is the top priority.

Imagine if banks all over the country want to develop a credit insurance model, that predicts whenever a client can be a good fit for a credit product.

How can they train and use the same model, dynamically, without a security breach?

It is the perfect wheater for this solution, all the parties over the system can use their own isolated cloud environment to set up, run and train the model without exposing their own data.

The downside of this solution is that all parties should sanitize their data to a common pattern, as Machine Learning can be unsupervised, the data didn’t have to be labeled, but the entry data must have the same data structure; For example, You didn’t need to specify the field “country”, but every member should sanitize their field “country” to a two digits patter, like “BR” or “US”

\ Conclusion

This solution is still a study case, I didn’t have the capability to develop and put this project to run, but in my opinion, this solution could be used to solve the privacy issue on distributed machine learnings.

It looks like a really simplistic solution for a very complex problem, and I am probably missing something due to my lack of knowledge, so, if you do have some insights or if you want to join forces with me to put this to run on a development environment, feel free to reach me out.