How to use AWS Managed Blockchain to train a multi-party Machine Learning model
\ Machine Learning
Machine Learning and Deep Learning are a whole set of algorithms that are used to read patterns in a data lake. If you want to understand a bit more about ML/DL you can read this, and follow this channel on Youtube.
In this post, I`ll go through a common problem in ML, which is how companies use their data to train models, a multi-party training can be tough due to data constraints, regulations, compliance, and privacy
Imagine that you have an Insurance company that has a bunch of personal data, and want to train a model with it, but all your Insurance Industry partners want to develop this model with you, sharing the costs and the benefits. How can they train the same model as you, without sharing their client’s personal data?
We will address this problem here in this post, and you can read more about this here, here, and also here, where a lot of academic people did an awesome job researching for a good solution. I made a simpler one, but I will let this pretty clear: I’m not a Machine Learning neither a Blockchain specialist, this post was based on studies only, I was not able to put it into practice, but I’m interested to try with your help!
\ Blockchain & Distributed Ledger
Reading about the issue above, I thought of a solution using Blockchain and went through this article and this project about Learning Chain. These two studies gave me an idea of use Amazon AWS Managed Blockchain, which you can read more about it here.
Blockchain is a good choice for this because it is unchangeable, solid, and distributed, facilitating the multi-party system.
TL;DR: Each company will run an AWS environment, using Sagemaker, Lambda, S3, CodeDeploy, and CodePipeline, as a common Machine Learning flow on AWS. But, once the model is trained, it will be thrown into the blockchain for a consensus analysis, if it is good, it will be added to the ledger, and every other party member can make use of the trained model inside their own environment, securely, privately and encapsulated.
Let`s talk about the architecture in-depth?
\ Architecture
As we said, we will use AWS Managed Blockchain as Orchestrator of our learning chain and Sagemaker for the training, so let us see how Machine Learning works in the AWS Well-Architected Framework
The well-architected framework sets the flow for the model and the data, at the edge, it will export a Lambda for the world through an API Gateway for a user interface analysis of the trained model.
This flow finishes on an S3 Bucket with the trained model. This upload triggers an AWS Lambda that connects to a Hyperledger Fabric Client. This client is responsible for transforming the model into a valid block and sending out the trained model to the whole chain.
Image2 is the orchestrator of Image1, so the whole Image1 fits inside the “Train Model” object at Image2.
Once a new block is spread over the chain, all nodes will test it to be sure that is a valid model, a Consensus system will trigger, and if the new block is invalid, it will be discarded.
In this way, every member in the chain can use/train the model with real, private, and protected user data without the fear of data leaking.
Also, this flow partially automates the training, as it uses a dynamic model saved on S3 and gets the data coming dynamically from a data lake that is fed with wrangled data coming from a stream.
\ Scenario
A real scenario for this solution is in the banking industry, as a security sensible business, data privacy is the top priority.
Imagine if banks all over the country want to develop a credit insurance model, that predicts whenever a client can be a good fit for a credit product.
How can they train and use the same model, dynamically, without a security breach?
It is the perfect wheater for this solution, all the parties over the system can use their own isolated cloud environment to set up, run and train the model without exposing their own data.
The downside of this solution is that all parties should sanitize their data to a common pattern, as Machine Learning can be unsupervised, the data didn’t have to be labeled, but the entry data must have the same data structure; For example, You didn’t need to specify the field “country”, but every member should sanitize their field “country” to a two digits patter, like “BR” or “US”
\ Conclusion
This solution is still a study case, I didn’t have the capability to develop and put this project to run, but in my opinion, this solution could be used to solve the privacy issue on distributed machine learnings.
It looks like a really simplistic solution for a very complex problem, and I am probably missing something due to my lack of knowledge, so, if you do have some insights or if you want to join forces with me to put this to run on a development environment, feel free to reach me out.