AWS DataPipeline: S3 Backup Script Example using AWS CLI Activity

In this tutorial, I will show you how to launch a pipeline via the CLI.

Why the CLI? Because anything using the CLI is AWESOME!

We will launch a AWS CLI Activity, where we are going to backup files from S3, compress them with a timestamp naming convention and upload them to a backup path in S3.

More on the Background:

  1. We have a shell script located on S3.
  2. Datapipeline lauches a node where the work will be done on.
  3. We then pass the AWS CLI command to download the shell script from S3, and execute.

Our Requirements:

  1. Setting Up DataPipeline
  2. Shell Script on S3
  3. Pipeline Definition

Lets get started:

Bash Script on S3:

Our bash script: dp-backup-scripts.sh

#!/bin/bash

S3_IN="s3://rb-bucket.repo/scripts"
S3_OUT="s3://rb-bucket.repo/backups/scripts"
STAGING_DIR="/tmp/.staging"
STAGING_DUMP="$STAGING_DIR/scripts"
STAGING_OUTPUT="$STAGING_DIR/output"
DEST_FILE="scripts-backup-$(date +%F).tar.gz"

mkdir $STAGING_DIR/{scripts,output} -p
aws s3 cp --recursive $S3_IN/ $STAGING_DUMP/

tar -zcvf $STAGING_OUTPUT/$DEST_FILE $STAGING_DUMP/*
aws s3 cp $STAGING_OUTPUT/$DEST_FILE $S3_OUT/
rm -rf $STAGING_DIR

Pipeline Definition:

Create a pipeline definition and save it as eg. definition.json

{
  "objects": [
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://rb-bucket.repo/logs/",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "name": "CliActivity",
      "id": "CliActivity",
      "runsOn": {
        "ref": "Ec2Instance"
      },
      "type": "ShellCommandActivity",
      "command": "(sudo yum -y update aws-cli) && (#{myAWSCLICmd})"
    },
    {
      "instanceType": "t1.micro",
      "name": "Ec2Instance",
      "id": "Ec2Instance",
      "type": "Ec2Resource",
      "terminateAfter": "50 Minutes"
    }
  ],
  "parameters": [
    {
      "watermark": "aws [options] <command> <subcommand> [parameters]",
      "description": "AWS CLI command",
      "id": "myAWSCLICmd",
      "type": "String"
    }
  ],
  "values": {
    "myAWSCLICmd": "aws s3 cp s3://rb-bucket.repo/scripts/dp-backup-scripts.sh . && sh dp-backup-scripts.sh"
  }
}

Create your Pipeline:

$ aws datapipeline create-pipeline --name MyPipeline --unique-id MyPipeline

Then after this command has been executed, you will receive a pipeline id, like below:

{
    "df-06478032TYTFI2MVO6SD"
}

Now that we have our pipeline id, we will associate our Pipeline Definition with our Pipeline Id:

$ aws datapipeline put-pipeline-definition --pipeline-id "df-06478032TYTFI2MVO6SD" --pipeline-definition file://definition.json

If no errors was returned, we can go ahead and activate our pipeline:

$ aws datapipeline activate-pipeline --pipeline-id df-06478032TYTFI2MVO6SD

Log Outputs:

In our pipelineLogUri from our pipeline definition, we have specified the location for our logs. To view iformation about our output, you can retrieve them like below:

$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/
                           PRE EC2ResourceObj/
                           PRE ShellCommandActivityObj/

1. Shell Command Activity Logs:

$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/ShellCommandActivityObj/@ShellCommandActivityObj_2016-07-28T09:07:35/@ShellCommandActivityObj_2016-07-28T09:07:35_Attempt=1/
2016-07-28 09:15:04        734 Activity.log.gz
2016-07-28 09:15:04        180 StdError.gz
2016-07-28 09:15:02      32832 StdOutput.gz

2. EC2 Resource Activity Logs:

$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/EC2ResourceObj/@EC2ResourceObj_2016-07-28T09:07:35/@EC2ResourceObj_2016-07-28T09:07:35_Attempt=1/
2016-07-28 09:13:57      34800 TaskRunner.2016-07-28-09@000000000000000-000000000363386.gz
2016-07-28 09:16:59      18948 TaskRunner.2016-07-28-09@000000000363386-000000000566948.gz

Let's verify if our files has been backed up to our defined path:

$ aws s3 ls s3://rb-bucket.repo/backups/scripts/
2016-07-28 11:46:33      15446 scripts-backup-2016-07-28.tar.gz

Other useful commands:

Listing Pipelines:

$ aws datapipeline list-pipelines

Describing Pipelines:

$ aws datapipeline describe-pipelines --pipeline-id <your-pipeline-id>

Deleting a Pipeline:

$ aws datapipeline delete-pipeline --pipeline-id <your-pipeline-id>

Please follow the AWS Datapipeline Documentation for more information.