AWS DataPipeline: S3 Backup Script Example using AWS CLI Activity

In this tutorial, I will show you how to launch a pipeline via the CLI.

Why the CLI? Because anything using the CLI is AWESOME!

We will launch a AWS CLI Activity, where we are going to backup files from S3, compress them with a timestamp naming convention and upload them to a backup path in S3.

More on the Background:

We have a shell script located on S3.
Datapipeline lauches a node where the work will be done on.
We then pass the AWS CLI command to download the shell script from S3, and execute.

Our Requirements:

Setting Up DataPipeline
Shell Script on S3
Pipeline Definition

Lets get started:

Bash Script on S3:

Our bash script: dp-backup-scripts.sh

#!/bin/bash

S3_IN="s3://rb-bucket.repo/scripts"
S3_OUT="s3://rb-bucket.repo/backups/scripts"
STAGING_DIR="/tmp/.staging"
STAGING_DUMP="$STAGING_DIR/scripts"
STAGING_OUTPUT="$STAGING_DIR/output"
DEST_FILE="scripts-backup-$(date +%F).tar.gz"

mkdir $STAGING_DIR/{scripts,output} -p
aws s3 cp --recursive $S3_IN/ $STAGING_DUMP/

tar -zcvf $STAGING_OUTPUT/$DEST_FILE $STAGING_DUMP/*
aws s3 cp $STAGING_OUTPUT/$DEST_FILE $S3_OUT/
rm -rf $STAGING_DIR

Pipeline Definition:

Create a pipeline definition and save it as eg. definition.json

{
  "objects": [
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://rb-bucket.repo/logs/",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "name": "CliActivity",
      "id": "CliActivity",
      "runsOn": {
        "ref": "Ec2Instance"
      },
      "type": "ShellCommandActivity",
      "command": "(sudo yum -y update aws-cli) && (#{myAWSCLICmd})"
    },
    {
      "instanceType": "t1.micro",
      "name": "Ec2Instance",
      "id": "Ec2Instance",
      "type": "Ec2Resource",
      "terminateAfter": "50 Minutes"
    }
  ],
  "parameters": [
    {
      "watermark": "aws [options] <command> <subcommand> [parameters]",
      "description": "AWS CLI command",
      "id": "myAWSCLICmd",
      "type": "String"
    }
  ],
  "values": {
    "myAWSCLICmd": "aws s3 cp s3://rb-bucket.repo/scripts/dp-backup-scripts.sh . && sh dp-backup-scripts.sh"
  }
}

Create your Pipeline:

$ aws datapipeline create-pipeline --name MyPipeline --unique-id MyPipeline

Then after this command has been executed, you will receive a pipeline id, like below:

{
    "df-06478032TYTFI2MVO6SD"
}

Now that we have our pipeline id, we will associate our Pipeline Definition with our Pipeline Id:

$ aws datapipeline put-pipeline-definition --pipeline-id "df-06478032TYTFI2MVO6SD" --pipeline-definition file://definition.json

If no errors was returned, we can go ahead and activate our pipeline:

$ aws datapipeline activate-pipeline --pipeline-id df-06478032TYTFI2MVO6SD

Log Outputs:

In our pipelineLogUri from our pipeline definition, we have specified the location for our logs. To view iformation about our output, you can retrieve them like below:

$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/
                           PRE EC2ResourceObj/
                           PRE ShellCommandActivityObj/

1. Shell Command Activity Logs:

$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/ShellCommandActivityObj/@ShellCommandActivityObj_2016-07-28T09:07:35/@ShellCommandActivityObj_2016-07-28T09:07:35_Attempt=1/
2016-07-28 09:15:04        734 Activity.log.gz
2016-07-28 09:15:04        180 StdError.gz
2016-07-28 09:15:02      32832 StdOutput.gz

2. EC2 Resource Activity Logs:

$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/EC2ResourceObj/@EC2ResourceObj_2016-07-28T09:07:35/@EC2ResourceObj_2016-07-28T09:07:35_Attempt=1/
2016-07-28 09:13:57      34800 [email protected]
2016-07-28 09:16:59      18948 [email protected]

Let's verify if our files has been backed up to our defined path:

$ aws s3 ls s3://rb-bucket.repo/backups/scripts/
2016-07-28 11:46:33      15446 scripts-backup-2016-07-28.tar.gz

Other useful commands:

Listing Pipelines:

$ aws datapipeline list-pipelines

Describing Pipelines:

$ aws datapipeline describe-pipelines --pipeline-id <your-pipeline-id>

Deleting a Pipeline:

$ aws datapipeline delete-pipeline --pipeline-id <your-pipeline-id>

Please follow the AWS Datapipeline Documentation for more information.

AWS DataPipeline: S3 Backup Script Example using AWS CLI Activity

Comments

Read Next

Amazon EMR Performance Comparison dealing with Hadoops SmallFiles Problem

Convert CSV to JSON files with AWS Lambda and S3 Events

Comments

Subscribe to Sysadmins

Read Next

Tags