aws

In this Tutorial we will use the AWS CLI tools to Interact with Amazon Athena.

What is Amazon Athena:

Athena is a Serverless Query Service that allows you to analyze data in Amazon S3 using standard SQL.

Update AWS CLI Tools:

$ pip install pip --user awscli

Create a Bucket in the Region of choice:

At this point in time, us-east-1 is one of the supported regions, so we will create a S3 bucket in this region:

$ aws s3 mb s3://ruan-athena-bucket --region us-west-1

Create Sample Data:

$ echo "ruan,bekker" > data.csv
$ echo "stefan,bester" >> data.csv

Upload Data to our New S3 Bucket:

$ aws s3 cp data.csv s3://ruan-athena-bucket/data/
upload: ./data.csv to s3://ruan-athena-bucket/data/data.csv

Now that we have our data in S3, we will Create our Table in Athena and Read the Data.

I am creating a very, and I mean very small file for the data I will be reading from, that will just act as an example.

Athena really is amazingly fast when you have mass amounts of data, that you would like to query.

Create a Table in Athena:

When the query execution is performed, a query execution id is returned, which we can use to get information from the query that was performed.

$ aws athena start-query-execution --query-string "create external table tbl01 (name STRING, surname STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://ruan-athena-bucket/data/';" --result-configuration "OutputLocation=s3://ruan-athena-bucket/output/"
{
    "QueryExecutionId": "95e9611a-299e-4eed-b473-bd81dacc2b47"
}

Get Info of your Query Execution:

$ aws athena get-query-execution --query-execution-id "95e9611a-299e-4eed-b473-bd81dacc2b47"

{
    "QueryExecution": {
        "Status": {
            "SubmissionDateTime": 1496903451.099,
            "State": "SUCCEEDED",
            "CompletionDateTime": 1496903451.918
        },
        "Query": "create external table tbl01 (name STRING, surname STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://ruan-athena-bucket/data/'",
        "Statistics": {
            "DataScannedInBytes": 0,
            "EngineExecutionTimeInMillis": 573
        },
        "ResultConfiguration": {
            "OutputLocation": "s3://ruan-athena-bucket/output/95e9611a-299e-4eed-b473-bd81dacc2b47.txt"
        },
        "QueryExecutionId": "95e9611a-299e-4eed-b473-bd81dacc2b47"
    }
}

Get Output Results from our Athena Execution ID:

At this point in time when a table is created, although we have set the S3 Location where our data resides, data has not been read from S3, you can confirm by reading the data from the "OutputLocation", which you will see is blank, or using the cli to confirm:

$ aws athena get-query-results --query-execution-id "95e9611a-299e-4eed-b473-bd81dacc2b47"
{
    "ResultSet": {
        "Rows": [],
        "ResultSetMetadata": {
            "ColumnInfo": []
        }
    }
}

Read data with Athena:

Now we will read the data from our table, and then only will the data be read from S3 that was used in your create table statement:

$ aws athena start-query-execution --query-string "select * from tbl01;" --result-configuration "OutputLocation=s3://ruan-athena-bucket/output/"     
{
    "QueryExecutionId": "440260ff-d887-452f-a7f7-413191cc2f17"
}

Get The Details from the Athena Execution ID:

$ aws athena get-query-execution --query-execution-id "440260ff-d887-452f-a7f7-413191cc2f17"
{
    "QueryExecution": {
        "Status": {
            "SubmissionDateTime": 1496903763.174,
            "State": "SUCCEEDED",
            "CompletionDateTime": 1496903765.293
        },
        "Query": "select * from tbl01",
        "Statistics": {
            "DataScannedInBytes": 26,
            "EngineExecutionTimeInMillis": 1771
        },
        "ResultConfiguration": {
            "OutputLocation": "s3://ruan-athena-bucket/output/440260ff-d887-452f-a7f7-413191cc2f17.csv"
        },
        "QueryExecutionId": "440260ff-d887-452f-a7f7-413191cc2f17"
    }
}

Reading the Results from the OutputLocation:

$ aws s3 cp s3://ruan-athena-bucket/output/440260ff-d887-452f-a7f7-413191cc2f17.csv ./
download: s3://ruan-athena-bucket/output/440260ff-d887-452f-a7f7-413191cc2f17.csv to ./440260ff-d887-452f-a7f7-413191cc2f17.csv
$ cat ./440260ff-d887-452f-a7f7-413191cc2f17.csv
"name","surname"
"ruan","bekker"
"stefan","bester"

Reading the Results using the Query Execution ID via CLI:

$ aws athena get-query-results --query-execution-id "440260ff-d887-452f-a7f7-413191cc2f17"
{
    "ResultSet": {
        "Rows": [
            {
                "Data": [
                    {
                        "VarCharValue": "name"
                    },
                    {
                        "VarCharValue": "surname"
                    }
                ]
            },
            {
                "Data": [
                    {
                        "VarCharValue": "ruan"
                    },
                    {
                        "VarCharValue": "bekker"
                    }
                ]
            },
            {
                "Data": [
                    {
                        "VarCharValue": "stefan"
                    },
                    {
                        "VarCharValue": "bester"
                    }
                ]
            }
        ],
        "ResultSetMetadata": {
            "ColumnInfo": [
                {
                    "Scale": 0,
                    "Name": "name",
                    "Nullable": "UNKNOWN",
                    "TableName": "",
                    "Precision": 1073741824,
                    "Label": "name",
                    "CaseSensitive": true,
                    "SchemaName": "",
                    "Type": "varchar",
                    "CatalogName": "hive"
                },
                {
                    "Scale": 0,
                    "Name": "surname",
                    "Nullable": "UNKNOWN",
                    "TableName": "",
                    "Precision": 1073741824,
                    "Label": "surname",
                    "CaseSensitive": true,
                    "SchemaName": "",
                    "Type": "varchar",
                    "CatalogName": "hive"
                }
            ]
        }
    }
}

Delete the S3 Bucket:

$ aws s3 rm s3://ruan-athena-bucket

For more details on Amazon Athena, have a look at their Documentation