A couple of times before, I had to generate massive amounts of sample data in order to reproduce certain issues/scenarios etc.
You get some awesome tools online, but most of the time, free versions that they offer, limits you of only getting more or less 100 items.
As of this time of writing, I'm testing out a lot of Hadoop and Spark jobs and need a massive amount on data.
I found a Python package called Faker, which is just what I need!
I will show you how to create 100 CSV files, which contains 1 Million items per file:
Setting Up:
$ yum install python-setuptools -y
$ easy_install pip
$ pip install fake-factory
``` <p>
**Create Python App:**
`generate-csv-data.py`:
```language-python
#!/usr/bin/python
from faker import Factory
print "username,first_name,last_name,email,position,country,last_access_from_ip,mac_address"
def create_names(fake):
for x in range(1000000):
print fake.slug () + "," + fake.first_name() + "," + fake.last_name() + "," + fake.company_email() + "," + fake.job() + "," + fake.country() + "," + fake.ipv4() + "," +
fake.mac_address()
if __name__ == "__main__":
fake = Factory.create()
create_names(fake)
``` <p>
Code can be found on [Github](https://gist.github.com/ruanbekker/d56781f97901c6ba7f04ffab7c3a7da8)
**Generating the Data:**
Create your directory where the generated csv's will reside:
```langauge-bash
$ mkdir dataset
``` <p>
Generate the data:
```langauge-bash
$ for x in {1..100}; do python generate-csv-data.py > dataset/dataset-$(date | md5sum | awk '{print $1}').csv; done
``` <p>
**Resources:**
* [Documentation](https://pypi.python.org/pypi/fake-factory)
* [Faker Providers](http://fake-factory.readthedocs.io/en/latest/providers.html)
Comments