Searching Data with Python and ElasticSearch

In this tutorial I will show you how to get started with Python and Elasticsearch, to be able to search for people's Name and Email addresses, based on their Job Descriptions.

I will show you how to get setup, populate the random data, and the full python code to setup the example.

In this tutorial, we will be doing the following:

1. Setup Elasticsearch library using pip.
2. Basic usage on indexing data to ES and also reading data from ES.
3. Setup a Python generator to generate lots of random data to ES using the Faker library.
4. Query Data from Elasticsearch via Python using the Elasticsearch library.
5. Scenario where we would search for people based on their job description.

Lets get started:

Dependencies:

$ virtualenv venv
$ source venv/bin/activate
$ pip install elasticsearch
$ pip install fake-factory

Basic Usage on Indexing data to Elasticsearch:

We will create a index called testprofiles, a index type of users and each document will have a numeric id, like the following:

_id: 1
name: firstname lastname
job_description: job description name
email: [email protected]

Lets go ahead and index our first document:

$ curl -XPUT http://elasticsearch-endpoint:80/testprofiles/users/1 -d' 
{
	"name": "James Green",
	"job_description": "Database Administrator",
	"email": "[email protected]"
}'

Add a couple more:

$ curl -XPUT http://elasticsearch-endpoint:80/testprofiles/users/2 -d' 
{
	"name": "Mark Shaw",
	"job_description": "Network Administrator",
	"email": "[email protected]"
}'

$ curl -XPUT http://elasticsearch-endpoint:80/testprofiles/users/3 -d' 
{
	"name": "Peter Stanford",
	"job_description": "Java Developer",
	"email": "[email protected]"
}'

Reading data from Elasticsearch:

We will perform a GET request against our Elasticsearch endpoint, and also specifying our id which will be '1' in this example:

$ curl -XGET http://elasticsearch-endpoint:80/testprofiles/users/1

{
  "_index" : "testprofiles",
  "_type" : "users",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "name": "James Green",
    "job_description": "Database Administrator",
    "email": "[email protected]"
  }
}

Setup the Generator for Random Data to Elasticsearch:

Now we will generate 1000 documents to Elasticsearch. We will generate random data for, name, email, and job_description and our id will be incremented on each run.

In this example I will name my python file: gen-data.py

from faker import Factory
from elasticsearch import Elasticsearch
import json
import time

ES_HOST = "http://elasticsearch-endpoint:80"
es = Elasticsearch(ES_HOST)

number = 0

def create_names(fake):
    for x in range(1000):
        genName = fake.name()
        genJob = fake.job()
        genEmail = fake.email()
        global number
        number += 1

        go = es.index(
            index="searchindex",
            doc_type="users",
            id=str(number),
            body={
                "name": genName,
                "job": genJob,
                "email": genEmail
            }
        )

        print json.dumps(go)
        time.sleep(0.01)

if __name__ == '__main__':
    fake = Factory.create()
    create_names(fake)

Setup the Search Program:

Next up, we will create the Python file which we will use to query the data from Elasticsearch. We will search for the job description, which will return the details of those in question.

In this example, I want the following output:

Number of documents found
Each line, consists of Id Number, Name and Email address:

I will be using the filename == search.py ==:

from elasticsearch import Elasticsearch

ES_HOST = "http://elasticsearch-endpoint:80"
query = raw_input("Search for people based on their Job Descrition: \n")

if __name__ == '__main__':

    es = Elasticsearch(ES_HOST)
    res = es.search(index="searchindex", doc_type="users", body={"query": {"match": {"job": query}}}, size=20)
    print("%d documents found:" % res['hits']['total'])
    for doc in res['hits']['hits']:
        print("%s) %s <mailto:%s>" % (doc['_id'], doc['_source']['name'], doc['_source']['email']))

Generating the Data to Elasticsearch:

$ python gen-data.py

{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "1"}
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "2"}
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "3"}
{"_type": "users", "created": true, "_shards": {"successful": 1, "failed": 0, "total": 2}, "_version": 1, "_index": "searchindex", "_id": "4"}

Let's get searching!

So now that everything is setup, we would like to retrieve the names and email addresses of everyone that's Translators:

$ python search.py

Search for people based on their Job Descrition:
translator

4 documents found:
882) Daniel Garcia <mailto:[email protected]>
950) Andrew Simpson <mailto:[email protected]>
127) David Hebert <mailto:[email protected]>
402) Matthew Lloyd <mailto:[email protected]>

Now that we have retrieved the answers, lets verify that by querying the id, that is received, let's take 950:

$ curl http://es-endpoint:80/searchindex/users/950?pretty=true

{
  "_index" : "searchindex",
  "_type" : "users",
  "_id" : "950",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "job_description" : "Translator",
    "name" : "Andrew Simpson",
    "email" : "[email protected]"
  }
}

That was a pretty basic example, but Elasticsearch is amazingly fast, and the more data you index to it, the better it gets.

Credit and Big thanks goes to the following:

Searching Data with Python and ElasticSearch

Comments

Read Next

How to Setup In-Place Upgrades with CodeDeploy

How to use dotenv with Python

Comments

Subscribe to Sysadmins

Read Next

Tags