/ Elasticsearch

Searching Data with Python and ElasticSearch

In this tutorial I will show you how to get started with Python and Elasticsearch, to be able to search for people's Name and Email addresses, based on their Job Descriptions.

I will show you how to get setup, populate the random data, and the full python code to setup the example.

In this tutorial, we will be doing the following:

1. Setup Elasticsearch library using pip.
2. Basic usage on indexing data to ES and also reading data from ES.
3. Setup a Python generator to generate lots of random data to ES using the Faker library.
4. Query Data from Elasticsearch via Python using the Elasticsearch library.
5. Scenario where we would search for people based on their job description.

Lets get started:

Dependencies:

$ virtualenv venv
$ source venv/bin/activate
$ pip install elasticsearch
$ pip install fake-factory

Basic Usage on Indexing data to Elasticsearch:

We will create a index called testprofiles, a index type of users and each document will have a numeric id, like the following:

_id: 1
name: firstname lastname
job_description: job description name
email: email@domain.com

Lets go ahead and index our first document:

$ curl -XPUT http://elasticsearch-endpoint:80/testprofiles/users/1 -d' 
{
	"name": "James Green",
	"job_description": "Database Administrator",
	"email": "james@domain.com"
}'

Add a couple more:

$ curl -XPUT http://elasticsearch-endpoint:80/testprofiles/users/2 -d' 
{
	"name": "Mark Shaw",
	"job_description": "Network Administrator",
	"email": "mark@domain.com"
}'

$ curl -XPUT http://elasticsearch-endpoint:80/testprofiles/users/3 -d' 
{
	"name": "Peter Stanford",
	"job_description": "Java Developer",
	"email": "peter@domain.com"
}'

Reading data from Elasticsearch:

We will perform a GET request against our Elasticsearch endpoint, and also specifying our id which will be '1' in this example:

$ curl -XGET http://elasticsearch-endpoint:80/testprofiles/users/1
{
  "_index" : "testprofiles",
  "_type" : "users",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name": "James Green",
    "job_description": "Database Administrator",
    "email": "james@domain.com"
  }
}

Setup the Generator for Random Data to Elasticsearch:

Now we will generate 1000 documents to Elasticsearch. We will generate random data for, name, email, and job_description and our id will be incremented on each run.

In this example I will name my python file: gen-data.py

from faker import Factory
from elasticsearch import Elasticsearch
import json
import time

ES_HOST = "http://elasticsearch-endpoint:80"
es = Elasticsearch(ES_HOST)

number = 0

def create_names(fake):
    for x in range(1000):
        genName = fake.name()
        genJob = fake.job()
        genEmail = fake.email()
        global number
        number += 1

        go = es.index(
            index="searchindex",
            doc_type="users",
            id=str(number),
            body={
                "name": genName,
                "job": genJob,
                "email": genEmail
            }
        )

        print json.dumps(go)
        time.sleep(0.01)

if __name__ == '__main__':
    fake = Factory.create()
    create_names(fake)

Setup the Search Program:

Next up, we will create the Python file which we will use to query the data from Elasticsearch. We will search for the job description, which will return the details of those in question.

In this example, I want the following output:

  1. Number of documents found
  2. Each line, consists of Id Number, Name and Email address:

I will be using the filename == search.py ==:

from elasticsearch import Elasticsearch

ES_HOST = "http://elasticsearch-endpoint:80"
query = raw_input("Search for people based on their Job Descrition: \n")

if __name__ == '__main__':

    es = Elasticsearch(ES_HOST)
    res = es.search(index="searchindex", doc_type="users", body={"query": {"match": {"job": query}}}, size=20)
    print("%d documents found:" % res['hits']['total'])
    for doc in res['hits']['hits']:
        print("%s) %s <mailto:%s>" % (doc['_id'], doc['_source']['name'], doc['_source']['email']))

Generating the Data to Elasticsearch:

$ python gen-data.py
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "1"}
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "2"}
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "3"}
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "4"}

Let's get searching!

So now that everything is setup, we would like to retrieve the names and email addresses of everyone that's Translators:

$ python search.py

Search for people based on their Job Descrition:
translator
4 documents found:
882) Daniel Garcia <mailto:scotthawkins@edwards-smith.com>
950) Andrew Simpson <mailto:nguyensamantha@hotmail.com>
127) David Hebert <mailto:thompsonmeghan@morgan.com>
402) Matthew Lloyd <mailto:carl62@wilson.com>

Now that we have retrieved the answers, lets verify that by querying the id, that is received, let's take 950:

$ curl http://es-endpoint:80/searchindex/users/950?pretty=true
{
  "_index" : "searchindex",
  "_type" : "users",
  "_id" : "950",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "job_description" : "Translator",
    "name" : "Andrew Simpson",
    "email" : "nguyensamantha@hotmail.com"
  }
}

That was a pretty basic example, but Elasticsearch is amazingly fast, and the more data you index to it, the better it gets.

Credit and Big thanks goes to the following: