/ AWS

AWS: Import CSV Data from S3 to DynamoDB

When running a AWS EMR Cluster, you can import CSV data that is located on S3 to DynamoDB, using Hive.

Our sample data has the following structure "id", "movie name", "director name", "release status":

# movies.csv
01,movie1,james may,anne rose,released
02,movie2,andy smith,floyd rose,released
03,movie3,george good,ben james,released
04,movie4,mike james,anne goode,released

This CSV file, sits on S3, for example:
"s3://bucketname/path/to/csvdata/"

We will log onto the master node of our EMR Cluster, logon to Hive. Then from there we will create 2 external tables, a dynamodb_tbl and a s3_tbl:

# dynamodb-tbl
hive> CREATE EXTERNAL TABLE ddb_tbl_movies (
    id STRING, 
    movie STRING, 
    actor STRING, 
    director STRING, 
    status STRING ) 
  STORED BY 
  'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
  TBLPROPERTIES (
    "dynamodb.table.name" = "moviedb", 
    "dynamodb.column.mapping" = 
   "id:id,movie:movie,actor:actor,director:director,status:status"
  );
# s3-tbl
hive> CREATE EXTERNAL TABLE s3_table_movies (
    id STRING, 
    movie STRING, 
    actor STRING, 
    director STRING, 
    status STRING) 
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
  LOCATION "s3://bucketname/path/to/csvdata/" 
  TBLPROPERTIES ("serialization.null.format"="");

From here, we can see the mapped data from S3:

hive> select * from s3_table_movies;
OK
01      movie1  james may       anne rose       released
02      movie2  andy smith      floyd rose      released
03      movie3  george good     ben james       released
04      movie4  mike james      anne goode      released

Now, we can see the DynamoDB table is still empty:

hive> select * from ddb_tbl_movies;                                                                                                                                                                  
OK

Now we will insert data from S3 to DynamoDB

hive> INSERT INTO TABLE ddb_tbl_movies select * from s3_table_movies;
Launching Job 1 out of 1
...
MapReduce Total cumulative CPU time: 6 seconds 900 msec
Total MapReduce CPU Time Spent: 6 seconds 900 msec
OK
Time taken: 40.732 seconds

Verify the data:

hive> select * from ddb_tbl_movies;
OK
04      movie4  mike james      anne goode      released
02      movie2  andy smith      floyd rose      released
01      movie1  james may       anne rose       released
03      movie3  george good     ben james       released

Now your data has been imported to DynamoDB