Backup Github repos to S3

Application and infrastructure code is often the result of months or years of combined effort from a team, costing a large amount of money to create. It makes sense to keep a backup of this digital asset, in case of accidental (or malicious) loss.

Years ago, IT teams would take responsibility for backing up the on-premises SVN/SourceSafe/Mercurial/Git servers to tape, and organise shipping the tapes off-site on a daily basis.

These days, I’m using Github, or other hosted SaaS platforms to store code, but that doesn’t absolve me of the responsibility to take backups, since no service is perfect.

For example, in 2017, Gitlab lost some customer data in a widely publicised incident [1]

https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/ [1]

There are 3rd party backup services that offer code backup, and if you’re able to use them, these are probably the best route. However, in some cases, procurement may not be possible, or there might not be a solution in place. That’s how I ended up writing my own backup script.

What’s needed?

A script that can download the repos, and upload them to S3.
Permissions for the script to access the repos, and write to S3.
A backup S3 bucket.
A way of running the script every day.

The script

The script uses the Github CLI [2] to list up to 1000 repositories, then uses xargs to execute a gh repo clone command for each of the repos.

https://cli.github.com/ [2]

Once that’s done, the script uses the AWS CLI to upload the content to S3.

#!/bin/bash

# https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425
set -euxo pipefail

echo "Logging in with personal access token."
export GH_TOKEN=$BACKUP_GITHUB_PAT
gh auth setup-git

echo "Downloading repositories for" $BACKUP_GITHUB_OWNER
gh repo list $BACKUP_GITHUB_OWNER --json "name" --limit 1000 --template '{{range .}}{{ .name }}{{"\n"}}{{end}}' | xargs -L1 -I {} gh repo clone $BACKUP_GITHUB_OWNER/{}

echo "Downloaded repositories..."
find  . -maxdepth 1 -type d

echo "Uploading to S3 bucket" $BACKUP_BUCKET_NAME "in region" $BACKUP_AWS_REGION
aws s3 sync --region=$BACKUP_AWS_REGION . s3://$BACKUP_BUCKET_NAME/github.com/$BACKUP_GITHUB_OWNER/`date "+%Y-%m-%d"`/

echo "Complete."

Permissions

To give the script access to download all of the repositories, the script uses a Github Personal Access token [3].

https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token [3]

The token needs to be given permissions to read from all repositories.

To write to S3, the script requires that the AWS CLI is configured with access. This can be done using lots of techniques, including setting various AWS environent variables. The best way to provide access is to create an IAM role in AWS (not a user) that has write access to your backup bucket, and to allow the machine or human user that’s running the script to “assume the role”.

Using a role instead of an IAM user with static credentials avoids using the same AWS credentials for months or years, and simplifies administration.

Backup bucket

Your backup S3 bucket should be configured to the latest best practice. At the time of writing, it would look something like this.

const backupBucket = new s3.Bucket(this, "backupBucket", {
	blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
	enforceSSL: true,
	versioned: true,
	encryption: s3.BucketEncryption.S3_MANAGED,
	intelligentTieringConfigurations: [
		{
			name: "archive",
			archiveAccessTierTime: Duration.days(90),
			deepArchiveAccessTierTime: Duration.days(180),
		},
	],
})

Running the script every day

Since we’re using Github already, the easiest way to run some code every day is to use Github Actions [4].

https://docs.github.com/en/actions [4]

Github Actions can run a YAML-based workflow triggered by changes to git repositories, as you might expect, but it can also run code on a schedule.

name: Backup

on:
  workflow_dispatch:
  schedule:
    - cron: '0 0 * * *'

To give the script the AWS permissions, I use a Github Action that assumes a role inside AWS (more on this shortly). At the Github Actions side, the id-token: write permission needs to be granted to enable Github Actions to log on to AWS.

permissions:
  id-token: write
  contents: read

I’ve created a Docker container which has all of the dependencies of the script (AWS CLI, Github CLI) pre-installed, along with the script itself, and shipped it as a public image in Github’s container registry (ghcr.io).

The Github Actions workflow can then be configured to run inside that container.

jobs:
  Backup:
    runs-on: ubuntu-latest
    container: ghcr.io/a-h/githubbackup:main
    name: Backup

The next step is to assume the AWS role that has permissions to write to the backup S3 bucket, using the configure-aws-credentials Github Action [5].

https://github.com/aws-actions/configure-aws-credentials [5]

The IAM role needs to be configured to enable it to be assumed by Github. An example is in the documentation [6]:

https://github.com/aws-actions/configure-aws-credentials#sample-iam-role-cloudformation-template [6]

    steps:
      - name: Assume role
        uses: aws-actions/configure-aws-credentials@v1
        with:
          role-to-assume: ${{ secrets.BACKUP_AWS_ROLE }}
          aws-region: ${{ secrets.BACKUP_AWS_REGION }}

Once the role is assumed, I think it’s a good idea to print out the role to the logs, so you can check that it worked OK.

      - name: Display assumed role
        run: aws sts get-caller-identity

Finally, the backup-organisation-code script can be run. Note the use of Github Secrets to store the parameters.

      - name: Backup
        shell: bash
        env:
          BACKUP_GITHUB_PAT: ${{ secrets.BACKUP_GITHUB_PAT }}
          BACKUP_GITHUB_OWNER: ${{ secrets.BACKUP_GITHUB_OWNER }}
          BACKUP_AWS_REGION: ${{ secrets.BACKUP_AWS_REGION }}
          BACKUP_BUCKET_NAME: ${{ secrets.BACKUP_BUCKET_NAME }}
        run: backup-organisation-code

You might be surprised to see a Github Personal Access Token in the list. By default, Github Actions only has access to read from the current repository, so the personal access token is used to grant read access to all the repos in the organisation to the backup script.

Summary

It’s fairly straightforward to back up your Github accounts to AWS, and you get automated email alerts on workflow failures to identify when backups have failed.

It probably took me a day to set this up, but that’s less time than it would have taken to procure a 3rd party service and deal with the security audits.

All of the code, and an example is available at [7].

https://github.com/a-h/githubbackup [7]