AWS DataPipeline: S3 Backup Script Example using AWS CLI Activity
In this tutorial, I will show you how to launch a pipeline via the CLI.
Why the CLI? Because anything using the CLI is AWESOME!
We will launch a AWS CLI Activity, where we are going to backup files from S3, compress them with a timestamp naming convention and upload them to a backup path in S3.
More on the Background:
- We have a shell script located on S3.
- Datapipeline lauches a node where the work will be done on.
- We then pass the AWS CLI command to download the shell script from S3, and execute.
Our Requirements:
- Setting Up DataPipeline
- Shell Script on S3
- Pipeline Definition
Lets get started:
Bash Script on S3:
Our bash script: dp-backup-scripts.sh
#!/bin/bash
S3_IN="s3://rb-bucket.repo/scripts"
S3_OUT="s3://rb-bucket.repo/backups/scripts"
STAGING_DIR="/tmp/.staging"
STAGING_DUMP="$STAGING_DIR/scripts"
STAGING_OUTPUT="$STAGING_DIR/output"
DEST_FILE="scripts-backup-$(date +%F).tar.gz"
mkdir $STAGING_DIR/{scripts,output} -p
aws s3 cp --recursive $S3_IN/ $STAGING_DUMP/
tar -zcvf $STAGING_OUTPUT/$DEST_FILE $STAGING_DUMP/*
aws s3 cp $STAGING_OUTPUT/$DEST_FILE $S3_OUT/
rm -rf $STAGING_DIR
Pipeline Definition:
Create a pipeline definition and save it as eg. definition.json
{
"objects": [
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://rb-bucket.repo/logs/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"name": "CliActivity",
"id": "CliActivity",
"runsOn": {
"ref": "Ec2Instance"
},
"type": "ShellCommandActivity",
"command": "(sudo yum -y update aws-cli) && (#{myAWSCLICmd})"
},
{
"instanceType": "t1.micro",
"name": "Ec2Instance",
"id": "Ec2Instance",
"type": "Ec2Resource",
"terminateAfter": "50 Minutes"
}
],
"parameters": [
{
"watermark": "aws [options] <command> <subcommand> [parameters]",
"description": "AWS CLI command",
"id": "myAWSCLICmd",
"type": "String"
}
],
"values": {
"myAWSCLICmd": "aws s3 cp s3://rb-bucket.repo/scripts/dp-backup-scripts.sh . && sh dp-backup-scripts.sh"
}
}
Create your Pipeline:
$ aws datapipeline create-pipeline --name MyPipeline --unique-id MyPipeline
Then after this command has been executed, you will receive a pipeline id, like below:
{
"df-06478032TYTFI2MVO6SD"
}
Now that we have our pipeline id, we will associate our Pipeline Definition with our Pipeline Id:
$ aws datapipeline put-pipeline-definition --pipeline-id "df-06478032TYTFI2MVO6SD" --pipeline-definition file://definition.json
If no errors was returned, we can go ahead and activate our pipeline:
$ aws datapipeline activate-pipeline --pipeline-id df-06478032TYTFI2MVO6SD
Log Outputs:
In our pipelineLogUri
from our pipeline definition, we have specified the location for our logs. To view iformation about our output, you can retrieve them like below:
$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/
PRE EC2ResourceObj/
PRE ShellCommandActivityObj/
1. Shell Command Activity Logs:
$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/ShellCommandActivityObj/@ShellCommandActivityObj_2016-07-28T09:07:35/@ShellCommandActivityObj_2016-07-28T09:07:35_Attempt=1/
2016-07-28 09:15:04 734 Activity.log.gz
2016-07-28 09:15:04 180 StdError.gz
2016-07-28 09:15:02 32832 StdOutput.gz
2. EC2 Resource Activity Logs:
$ aws s3 ls s3://rb-bucket/logs/df-01614291LDEV55C2KYR0/EC2ResourceObj/@EC2ResourceObj_2016-07-28T09:07:35/@EC2ResourceObj_2016-07-28T09:07:35_Attempt=1/
2016-07-28 09:13:57 34800 TaskRunner.2016-07-28-09@000000000000000-000000000363386.gz
2016-07-28 09:16:59 18948 TaskRunner.2016-07-28-09@000000000363386-000000000566948.gz
Let's verify if our files has been backed up to our defined path:
$ aws s3 ls s3://rb-bucket.repo/backups/scripts/
2016-07-28 11:46:33 15446 scripts-backup-2016-07-28.tar.gz
Other useful commands:
Listing Pipelines:
$ aws datapipeline list-pipelines
Describing Pipelines:
$ aws datapipeline describe-pipelines --pipeline-id <your-pipeline-id>
Deleting a Pipeline:
$ aws datapipeline delete-pipeline --pipeline-id <your-pipeline-id>
Please follow the AWS Datapipeline Documentation for more information.