How to backup your SMB shares and blob storage from Azure to AWS with DataSync

Introduction

Have you ever been in a situation where you had files on Azure but would like to have them on another cloud provider? Well, one of our customers had exactly this requirement.

We provided a solution for this problem and since sharing is caring, I’m going to provide you with some insights and the actual code, that you can make use of.

Overview

So let’s start with the requirements we had from the customer:

  • Tertiary backup should be on AWS
  • Data transfer from Azure to AWS must be encrypted
  • Both SMB shares and blob storage from all storage accounts must be backed up
  • Backup should also be restorable in AWS
  • As always: keep costs at a minimum

Key data for this project:

  • Over 100 storage accounts
  • Each storage account has both SMB shares and blob data
  • Storage size of all data, approx. 90TB

First approach

The first solution we investigated looked like this:

Diagram of first solution: https://aws.amazon.com/de/blogs/modernizing-with-aws/azure-blob-to-amazon-s3/

It has Lambas, SNS, and other serverless services in there! Perfect! But after closer examination, some points did not quite fit the requirements. The first point is that the solution here only covers blob storage. In other words, we would have to extend the solution so that it also covers SMB files.

Another point: There are quite a few components that are used here. If that works right from the start great, but if stuff fails, happy debugging. This will increase the costs by a lot.

Looking at the first paragraph from the article it states that this solution not only transfers data once but is also capable of transferring data continuously.

It’s a nice solution, but not exactly what we need for our purposes. We need something simpler!

If only AWS had a service that could move data from A to B and we didn’t have to worry about anything other than connecting everything. That would be a nice solution. Luckily, there is!

The solution

Ladies and Gentlemen, let me introduce you to – AWS DataSync!

DataSync is exactly the service we needed for the solution. It supports both Blob storage and SMB shares. Plus, we can store everything directly in S3 as is. Easy to use, simple to maintain, and serverless. Just perfect!

Here is the solution as a diagram:

Diagram of the solution with DataSync

Much simpler, isn’t it? But now the exciting question is, how do I implement this? Time for a demo.

Demo with ClickOps

Since we first want to have a demo to see how the whole thing works, the first step is to create the solution using Click Ops. There are a few steps that need to happen on Azure and AWS.

I could write them all down or I could simply point you to the instructions from AWS itself that I used. Enjoy:

Once everything is set up, the first transfer can begin. In our test, we used a couple of files with approx. 90Gb in size. However, it is important to mention here that one file is already 85 GB. Why is this important? I’ll tell you later.

So the result looks like this:

Perfomance result for the first test

The transfer was successful and all data is now in S3. Perfect! But after looking at these numbers, the first thought was: “Hmpf, that’s kind of slow, even for German transfer rates”.

This might become a massive problem if the transfer takes too long and the next job is due. But don’t worry, we wouldn’t be evoila, if we didn’t have a solution for this!

Fixing the performance problem

When transferring data from cloud to cloud, everyone would probably expect significantly more than just 30Mib/s. So, what the fricking 🦆 is going on here? Let’s go in search of clues.

We can rule out DataSync directly because there is only the option of throttling the transfer or working with full power. This means that the only adjustment we have is the VM on Azure which the agent is running on.

Our test machine is the Standard_E4as_v4 with the recommended resources:

  • 32GiB RAM
  • 4 cores

Theoretically, everything is sufficient and according to the Docs, this machine is capable of putting 4000 MBit/s through the network.

So what is the problem then? Simply put: Size does matter!

Size does matter

As described above, one file is 85 GB! It turns out the bottleneck on our test is the compressing of these 85Gb. Nothing more. We did another test with a lot smaller files and we achieved speeds of over 100 Mib/s.

From this, I can recommend keeping the files as small as possible. If, for whatever reason, you still might have issues regarding the performance, AWS has your back. This article covers more options to accelerate your data transfer:

How to accelerate your data transfers with AWS DataSync scale-out architectures

Make it done

Now that all the problems have been solved and the demo works, it’s time to make it usable.

As already mentioned, there are more than 100 storage accounts and each of them has SMB shares as well as blob storage. This means for each storage account, we would need to create a task in DataSync for the blob storage and a task for the SMB share. Doing this via Click Ops would be madness, so in this case we naturally use the power of IaC.

You can find the repository for this project in our GitHub repository evoila/azure-to-aws-datasync.

Let me explain to you some decisions we made and why some things in the repository were solved a certain way.

List of accounts

We have decided to manage the accounts in a simple list. The reason is so we can get started with it sooner rather than later.

An example of how such an account is defined can be found in the file terraform/storage_accounts.tfvars.example:

storage_account_list = [{
  name = "my-storage-account"
  smb = {
    server_hostname = "mystorage.file.core.windows.net"
    subdirectory = "/fileshare/"
    user = "user"
    password = "UGxlYXNlRG9udEhhY2tNZUhhY2tlcm1hbg=="
  }
  azure_blob = {
    container_url = "https://mycontainer.blob.core.windows.net/example"
    token = "sp=rl&st=2023-10-19T13:15:31Z&se=2023-10-19T21:15:31Z&spr=https&sv=2022-11-02&sr=c&sig=<string>"
  }
}]

As you can see, credentials are stored in this file. So make sure to keep it a secret!

CMK

The CMK is only created initially so that it is a terraform resource. But the key material is created and managed by the customer as the compliance requirement states. This is why we are ignoring changes in the key on the alias:

resource "aws_kms_alias" "cmk_s3_alias" {
  name          = "alias/${var.cmk_s3_alias}"
  target_key_id = aws_kms_external_key.cmk_s3.arn
  lifecycle {
    ignore_changes = [target_key_id]
  }
}

S3

For S3, we have decided to put all files in a single bucket and sort them using unique names and keys.

An example would be:

account1
├── blob
│ └── files
└── smb
    └── files
account2
├── blob
│ └── files
└── smb
    └── files

This way we can manage the control policy on a single bucket and if more precise permissions are required later, it will be done so on the key level.

Monitoring

I didn’t include monitoring into the terraform code, because this is heavily depented on your scenario. Setting it up is easy though. Trust me. Just create an EventBridge rule for DataSync, that’s it. From here, you can trigger a Lambda, SNS or whatever you like. You can find a great article here on how to setup a notification of a successull or failed task.

Conclusion

In our journey of transferring files from Azure to AWS, AWS DataSync emerged as the optimal choice. Despite performance hitches, streamlining processes and prioritizing smaller file sizes led to successful transfers.

The Infrastructure as Code solution we developed, efficiently created tasks for storage accounts, and we applied strong security measures ensuring centralized, controlled access to data.

Our solution is open source, and the provided Terraform code is readily adaptable for your projects. If you also need to synchronize data from a storage account having both SMB shares and blob storage, simply clone our GitHub repository, update the storage account list with your specification, and use the Terraform apply command to initiate the process. This simplifies the process of migrating substantial data securely between cloud platforms.


This article was originally published on evoila.