What I Learned at Work this Week: AWS Transfer Family
In my experience, a lot of computer programming is about abstraction. When I first started learning, I coded in a repo that already had a testing infrastructure in place so that I could check my work. I didn’t have to write the tests or even understand how they worked, nor did I have to look into the node modules that hid all the non-native logic. As we learn, we encounter new challenges that force us to dig into concepts that we previously hand-waved as “magic.”
Over the past few months, a few of my coworkers have been getting their hands dirty with the infrastructure behind my famous super long Python script. I knew it related to my work but whenever they stopped to explain what they were doing, my eyes would glaze over. This weekend, I’ve committed to trying a little harder to understand a previously abstract concept.
Until about 30 seconds ago, my understanding was that SFTP stood for Secure File Transfer Protocol and, to be fair, it looks like it sort of does. But according to a couple of sources, the acronym is primarily used to abbreviate Secure Shell (SSH) File Transfer Protocol. This doesn’t really change the meaning too much, but it does specify that the files are transferred using the SSH protocol, which provides a particular brand of security and authentication. We’re not going to be examining exactly what makes SFTP special today (we’ll leave that abstract for now), but it’s important to at least understand what an SFTP is.
The script I’m working on pulls data from a DB, parses it, and drops it into a CSV. We need a way to share this data with the clients who requested it, but some of it is very sensitive, so we wouldn’t want to just send it in an email. Instead, we use an SFTP to house the data in a location that can only be accessed via a secure shell connection. We can create unique credentials that, when used in conjunction with an SSH key, will grant specific users access to specific parts of our SFTP. But that’s where things get tricky.
My team used a single configuration file to create accounts for clients and make them accessible. We did this all through the command line either because our SFTP service didn’t provide a UI or we had never gotten around to setting it up. This created a problem because the file was constantly being updated to add new users, but if an errant update was made, it could take down the whole product for all users. As we grew, we edited the file more frequently and the stakes increased because more data was on the line. We needed a more developed solution, so we turned to AWS Transfer Family.
AWS Transfer Family
I’ll start by reiterating that I’m working on understanding this problem and that it’s a Saturday, so I don’t have access to my coworkers to fact-check my description of our situation. It’s definitely possible, if not likely, that we could have solved our problem using functionality of the service we already employed. I would guess we ended up going with AWS because it was easy to set up, faster, more reliable, more affordable, or any combination of the above. With that out of the way, what makes it unique?
Amazon’s big selling point for Transfer Family is that it’s serverless. As we’ve learned, serverless computing gives us greater flexibility in using and paying for a service. The docs point this out by describing Transfer Family as “A fully managed service that scales in real time to meet your needs” and writing that “There are no upfront costs, and you pay only for the use of the service.” This is a big boon for our case, as we’re already addressing a scaling issue. Here’s another note:
Amazon EFS is built to scale on demand to petabytes without disrupting applications, growing and shrinking automatically as you add and remove files. This helps eliminate the need to provision and manage capacity to accommodate growth.
This past Thursday, I had to submit multiple PRs just to alter the memory allocation limits for a certain job I was running, so I view this as another big win. Admittedly I haven’t shopped around with different server managers, so this might be a standard feature by now, but the combination of serverless computing and easy UI for configuration seem to be strong selling points for Amazon.
Transfer Family Setup
We’ve established that we want to use Transfer Family, so what exactly do we have to do? Let’s look at the steps laid out in the Getting Started guide:
- Create an Amazon S3 bucket or Amazon EFS file system. EFS stands for Elastic File System. When we saw it in an earlier paragraph, it was describing dynamic storage size. Whether we go with that option or simply create a bucket, we can initialize and manage our storage through the AWS UI. No PR needed.
- Create an IAM role that contains two IAM policies…Here’s another acronym I’ve finally looked up: IAM stands for Identity and Access Management and according to yet another doc, an IAM policy is a statement, typically in JSON format, that allows a certain level of access to a resource. In other words, this is how we’re going to grant clients access to the files we’re generating for them. We want them to be able to log into the SFTP and perform certain actions, but we don’t want to give everyone a full range of permissions. This is definitely something I want to ask my teammates about because it seems to me that each client that access files in Transfer Family would need their own unique IAM permissions which allow them to read and write to their bucket and their bucket only. I did find a Terraform PR that contained some variables in its configs, but I still think there’s something that I’m missing. In any case, we certainly have to create some IAM roles to make this work.
- (Optional) If you have your own registered domain, associate your registered domain with the server. SFTP servers must have a domain that’s used for logging in (email@example.com). If we want to implement our own registered domain, we can use a Domain Name System (DNS) of our choosing. When setting up URL redirects for clients in the past, I’ve used Terraform to generate namespace (NS) values that they can plug into their DNS (client’s DNS because client owns the website). Since this is our SFTP, I don’t think we had to do that. As usual, Amazon makes it easy to change a domain without writing a PR. I would guess that my teammates used Route 53 since I didn’t see a Terraform file for a custom domain.
- Create a Transfer Family server and specify the identity provider type used by the service to authenticate your users. Yet again, this can be fully managed through the AWS console. We want to create an SFTP server, so we’d follow this set of instructions. An identity provider is some service that manages credentials/authentication. Our options include the standard AWS provider, AWS Directory Service for Microsoft Active Directory, and a third option for all other providers. We could use the custom setup and write a Lambda to integrate with Okta, for instance, but in our case I believe we stuck with the AWS provider.
- If you are working with a server with a service-managed identity provider, as opposed to a custom identity provider, add one or more users. We just have to add a user so that we can try signing in and check permissions/access. If we’re using a custom identity provider, we’ll have already tested as part of the setup.
- Open a file transfer protocol client and configure the connection to use the endpoint hostname for the server that you want to use. You can get this hostname from the AWS Transfer Family console. The last step is…try it out! Some common file transfer protocol clients are Cyberduck or FileZilla. On a Mac, you can even use OpenSSH to log in through your terminal!
More questions than answers
In this post, I tried to take a weekend to understand some of the work my teammates had done over several weeks. If they were here with me, I’m sure that would have been a breeze, but without any guidance, I’ve ended up with a pretty vague concept of how we implement AWS Transfer Family in my workplace. With that said, here’s the good news: I now understand the service to a level where I can ask educated, specific questions about our implementation. I hope to ask what identity provider we use and whether we set up a custom domain. I definitely want to ask how we manage IAM permissions for unique clients and what we devote to a Lambda vs what we control in the AWS console. Sometimes my blog posts aren’t especially helpful for the general public to read — you’d definitely get more out of going straight to the AWS documentation. They’re always a good exercise in problem solving for myself, and I think it’s useful to put that out into the world so that others can see that sometimes that can be a struggle too.
- SFTP File Transfer Protocol — get SFTP client & server, SSH Academy
- SSH File Transfer Protocol, Wikipedia
- What is AWS Transfer Family?, AWS Docs
- How AWS Transfer Family Works, AWS Docs
- Managing access controls, AWS Docs
- Working with custom hostnames, AWS Docs
- How domain registration works, AWS Docs
- Create an SFTP-enabled server, AWS Docs
- Identity Providers (IdPs): What They Are and Why You Need One, Okta