adrianhesketh.com

Migrate from Wordpress to self-hosting on AWS

This year, I switched my blog away from wordpress.com (where it had been for years), over to hosting it myself on AWS because I didn’t like the new editor in Wordpress (and the lack of markdown support). It’s cheaper to host it on AWS, but that wasn’t the main point.

I thought I’d document the process for anyone else, that wants to do it.

Export from Wordpress

The first to do is to export the content from Wordpress (https://wordpress.com/support/export/)

You should then have two archive files:

Converting from Wordpress to Hugo

Hugo (https://gohugo.io) is a static site generator. That means you use it to build a Website from templates and content. It’s a lot cheaper to host the results of a generated site than using something like Wordpress. Wordpress requires an application server to run code, and a database server to store content. Hugo just requires something to serve up the HTML, CSS and JavaScript generated by the build step.

I use Hugo, because it’s fast, and easy to both install and use. Hugo can build Markdown files, which is also how I like to write.

To convert my blog from Wordpress to Hugo, I’d need to get the content out from my old site. Some of the solutions I found depended on installing a plugin into Wordpress, but I was on wordpress.com rather than running my own Wordpress installation. Plugins can only be installed on custom Wordpress installations), however I found that https://github.com/palaniraja/blog2md could convert the content from the exports.

First, clone the repo:

git clone https://github.com/palaniraja/blog2md
cd blog2md

Install the Node.js dependencies (if you don’t have Node.js installed, you’ll need to get that installed first).

npm install

The script requires the export to be unzipped:

unzip ~/Downloads/adrianhesketh.wordpress.com-2020-07-16-10_28_08-sopboj4dtvekjo1mrrdyncxok0dhtqou.zip

Then, execute the script:

node index.js w adrianhesketh.wordpress.com-2020-07-16-10_28_05/adrianhesketh.wordpress.2020-07-16.001.xml out 

You should now have a lot of markdown files in the out directory.

Setting up Hugo

Next, time to set up hugo. I used the Nix package manager to install it, but you can also use brew on the Mac (brew install hugo) or install a binary from Github for other platforms.

cd ../
mkdir newblog
cd newblog
hugo new site .

Setting up a theme

Next, configur a theme, as per the Hugo quickstart instructions (https://gohugo.io/getting-started/quick-start/). For this example, I chose a Wordpress-y theme.

git submodule add https://github.com/vimux/mainroad themes/mainroad
echo 'theme = "mainroad"' >> config.toml

Copy the converted content into the structure, under the posts directory.

mkdir ./content/posts
cp -r ../blog2md/out/ ./content/posts

To preview the site locally, you can run:

hugo server

You should now be able to see the content at http://localhost:1313

Conversion problems

Images

If you look at the source HTML, you will find that the images still point at the old server. That’s no good, because when you shut down Wordpress, you don’t want to lose those.

To fix this, you’ll need to put the data from the media export into the static directory, and then update the links in the content.

First, unzip the content from the media export.

cd static
tar -xf ../media-export-40546272-from-0-to-3610.tar
cd ..

This puts all of your photos and stuff are in the right place. If the export points at https://adrianhesketh.files.wordpress.com/2019/01/img_1440.jpg, you’ll want it to point to /2019/01/img_1440.jpg.

To fix this, you’ll need to update the posts. You can use a text editor’s find/replace feature if you like, but I used the goreplace tool to do it (https://github.com/piranha/goreplace) in a single operation.

cd content/posts
goreplace https://adrianhesketh.files.wordpress.com/ --replace "/"
cd ../../

Gists

My site contained links to Github Gists that Wordpress would convert into JavaScript to render the code. To switch this to a hugo shortcode (https://gohugo.io/content-management/shortcodes/#gist) to render the content.

I used this search and replace to make the change.

goreplace 'https://gist.github.com/(.+?)/([a-zA-Z0-9]+)' --replace '\{\{< gist $1 $2 >\}\}'

Comments

Wordpress comments were also added to the site, I just deleted them out of the posts directory.

Inline HTML

Hugo expects to see Markdown in your posts not HTML. In a recent upgrade, the Hugo team changed the Markdwn renderer to a new one that ignores any raw HTML added into your posts.

This is most likely not what we want, because we wrote all the HTML and we definitely want it in our output. I can understand why it’s not the default for new projects, but it very much is a breaking change.

To fix that, add the following to config.toml:

[markup]
  [markup.goldmark]
    [markup.goldmark.renderer]
      unsafe = true

This config.toml is also where you can change the URL and set the site title, and configure themes.

Safety

You might run into other problems, it’s definitely worth checking through your site locally.

Hosting with AWS Amplify

I used custom CloudFormation to configure my blog, but when I was writing this, I wondered if AWS Amplify was actually the right way to go now. I tried it out so you don’t have to.

I already had the AWS CLI set up to use my account, so I didn’t need to configure AWS Amplify at all, I just ran amplify init and told it how to handle everything. The documentation states to use amplify configure, but that tries to get you to create an AWS IAM user, which is totally not required if you already have the AWS CLI setup.

In this example, I told it I was building a JavaScript project, because it’s the closest to a hugo build. For hugo projects, the default output is public, the build command is just hugo and the local run is hugo server.

With the project configured, I can add hosting with amplify add hosting.

What not to do

At the first attempt, I chose CloudFront and S3 hosting because I didn’t think I needed needed the extra config steps, and I’m happy to run amplify publish myself from the command line.

This is one of many errors that should lead you to NOT choose this path.

Why Cloudfront and S3 is not the way to go

At the end of the deploy, it Amplify wrote out the URL of https://d2r8sxi1ayjwal.cloudfront.net, but when I visited it I got an error:

<Error>
  <Code>AccessDenied</Code>
  <Message>Access Denied</Message>
  <RequestId>BDEE096F42823A6C</RequestId>
  <HostId>6hWO3pYE4MK3sxd9SJ/AmqwNNnrEfTr78fTO8OcrK6LgSVpKOVMZVP4LHtSNqysNk0dlsR/ejy0=</HostId>
</Error>

In the URL, I noticed that it had even redirected me to an S3 bucket: https://newblog-20200718211427-hostingbucket-prod.s3.eu-west-2.amazonaws.com/index.html

If this happens to you, don’t panic. Believe it or not, this happens to everyone, and AWS haven’t fixed it. CloudFront distributions are globally distributed and configured in North Virginia (that’s why Lambda@Edge must be deployed there), and my S3 bucket was created in a different region. The metadata about the S3 bucket is eventually consistent, and hadn’t made it to North Virginia yet. It can take over an hour, so just go and have a cup of tea. Don’t waste time trying to “fix” it.

The next problem is that default documents within subdirectories don’t work (e.g. https://d2r8sxi1ayjwal.cloudfront.net/posts/serving-web-content-and-redirects-from-the-domain-apex-without-route53-on-aws/index.html works fine, but https://d2r8sxi1ayjwal.cloudfront.net/posts/serving-web-content-and-redirects-from-the-domain-apex-without-route53-on-aws/ doesn’t).

I think the developers might not have noticed, but IIRC, enabling S3 bucket hosting is OK with handling index.html in subdirectories, but if you put CloudFront in front, it uses AWS S3 APIs, which don’t know anything about index.html files.

To add to this, the default CloudFront distribution settings are not great. Error pages are set to be the home page, and hide produce the expected HTTP status codes. Not helpful, because if you’re not paying attention, you might not realise you just hit a dead link. Using amplify hosting gives you a chance to change the settings, but not all of them. There is a CloudFormation template in JSON format (oh no) in the amplify directory, but I’m not sure what Amplify does to it and whether it would be safe to edit it by hand.

The amplify publish seems to take forever, it looks like it copies the whole set of files up each time instead of executing an S3 sync, which is what my custom script does.

The S3 bucket that it creates will show up in AWS Config checks, and any security audits, because versioning isn’t enabled, logging isn’t enabled, encyryption isn’t enabled, and there’s no policy in place to force HTTPS access.

All-in-all, it’s a pretty dire experience.

What to do

I decided to rip out the CloudFront hosting and see what the Amplify hosting experience is like. To remove the S3 hosting, I used amplify hosting remove, followed by amplify push to make the changes.

Then, tried out using the Amplify hosting, by executing amplify hosting add and amplify publish. Much more successful.

I got a domain (https://prod.d117qm1ig3o2d5.amplifyapp.com), and things worked as expected. The deployment is still slow, because it uploads everything in the site, instead of just the changes (as an s3 sync would do, but it’s OK).

Custom domains

Within the Amplify app in the Web console, you can configure custom domains, so that your Website isn’t under amplifyapp.com.

You can go into domains and configure it there.

When you configure a domain, if you bought it using AWS, or if you’ve done a domain transfer to AWS (like I did), then the DNS enrties and the TLS configuration will be done automatically for you. That’s probably the easiest way, or you’ll end up having to deal with DNS from your current provider, to enable AWS Certificate Manager and add CNAME records to the amplifyapp domain.

Hope that’s useful for you.

For the experienced…

“Amplify hosting” costs a bit more than S3 and CloudFront would, but as we’ve seen, Amplify isn’t nailing that at the moment, so unless you really know your AWS, I’d stick with Amplify hosting and save yourself the time.

However, I wrote my own CloudFormation, and I use a Lambda@Edge to fill in the missing pieces so if that’s your thing. You can refer to this:

Lambda@Edge

This has to be deployed to North Virginia.

var path = require('path');

const redirects = {
  "/redirect-from/example1": { to: "/target1", statusCode: 301 },
  "/redirect-from/example2": { to: "/target2", statusCode: 302 },
};

exports.handler = async event => {
  const { request } = event.Records[0].cf;
  const normalisedUri = normalise(request.uri);
  const redirect = redirects[normalisedUri];
  if (redirect) {
    return redirectTo(redirect.to, redirect.statusCode);
  }
  if (!hasExtension(request.uri)) {
    request.uri = trimSlash(request.uri) + "/index.html";
  }
  return request;
};

const trimSlash = uri => hasTrailingSlash(uri) ? uri.slice(0, -1) : uri;
const normalise = uri => trimSlash(uri).toLowerCase();
const hasExtension = uri => path.extname(uri) !== '';
const hasTrailingSlash = uri => uri.endsWith('/');

const redirectTo = (to, statusCode) => ({
  status: statusCode.toString(),
  statusDescription: 'Found',
  headers: {
    location: [{
      key: 'Location',
      value: to,
    }],
  },
});

CloudFormation template

To create the site you’ll need something like this:

---
AWSTemplateFormatVersion: '2010-09-09'
Description: adrianhesketh.com
Parameters:
  DomainName:
    Type: String
    Description: The website domain name.
    Default: adrianhesketh.com
  CloudFrontCertificateArn:
    Type: String
    Description: ARN of the SSL certificate used for the CloudFront distribution (must be in us-east-1).
  WebsiteCloudFrontViewerRequestLambdaFunctionARN:
    Type: String
    Description: ARN of the Lambda@Edge function that does rewriting of URLs (must be in us-east-1). See lambda_at_edge.js

Resources:      
  WebsiteBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Ref 'DomainName'

  WebsiteCloudFrontOriginAccessIdentity:
    Type: AWS::CloudFront::CloudFrontOriginAccessIdentity
    Properties:
      CloudFrontOriginAccessIdentityConfig:
        Comment: !Sub 'CloudFront OAI for ${DomainName}'

  WebsiteBucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref WebsiteBucket
      PolicyDocument:
        Statement:
          -
            Action:
              - s3:GetObject
            Effect: Allow
            Resource: !Join [ "", [ "arn:aws:s3:::", !Ref WebsiteBucket, "/*" ] ]
            Principal:
              CanonicalUser: !GetAtt WebsiteCloudFrontOriginAccessIdentity.S3CanonicalUserId

  WebsiteCloudfront:
    Type: AWS::CloudFront::Distribution
    DependsOn:
    - WebsiteBucket
    Properties:
      DistributionConfig:
        Comment: !Ref 'DomainName'
        Origins:
          - DomainName: !GetAtt WebsiteBucket.DomainName
            Id: website-s3-bucket
            S3OriginConfig:
              OriginAccessIdentity:
                !Join [ "", [ "origin-access-identity/cloudfront/", !Ref WebsiteCloudFrontOriginAccessIdentity ] ]
        Aliases:
          - !Ref 'DomainName'
        DefaultCacheBehavior:
          ViewerProtocolPolicy: redirect-to-https
          TargetOriginId: website-s3-bucket
          Compress: true
          ForwardedValues:
            QueryString: true
          LambdaFunctionAssociations:
            - EventType: viewer-request
              LambdaFunctionARN: !Ref WebsiteCloudFrontViewerRequestLambdaFunctionARN
        ViewerCertificate:
          AcmCertificateArn: !Ref CloudFrontCertificateArn
          MinimumProtocolVersion: TLSv1.2_2018
          SslSupportMethod: sni-only
        Enabled: true
        HttpVersion: http2
        DefaultRootObject: index.html
        IPV6Enabled: true
        CustomErrorResponses:
          - ErrorCode: 403
            ResponseCode: 404
            ResponsePagePath: '/error/index.html'
        PriceClass: PriceClass_100
      Tags:
        -
          Key: Name
          Value: !Ref 'DomainName'

Deployments

I use a simple Makefile.

.PHONY: run build sync-files invalidate-cache deploy

run:
	hugo server

build:
	hugo

sync-files:
	aws s3 sync ./public s3://adrianhesketh.com

invalidate-cache:
	aws cloudfront create-invalidation --distribution-id EE9HA1565U22V --paths /index.html /index.xml /sitemap.xml /css/*

deploy: build sync-files invalidate-cache