/ Elasticsearch

Scraping Websites with Python and Beautiful Soup and Ingesting into Elasticsearch

This will be a 2 post guide, where we will scrape this website on Page Title, URL and Tags, for blog posts, then we will ingest this data into Elasticsearch. - This Post

Once we have our data in Elasticsearch, we will build a Search Engine to search for these posts, the frontend will consist of Python Flask, Elasticsearch Library and HTML, which will be coverend in Part 2

Notice:

Always ensure that you are scraping for the right reasons, in this example, I will use my own blog site as the target, and I won't be scraping the websites data, but only Page Title, URL and Tags, so that we have enough data for our search engine.

Requirements:

For this example I am using Ubuntu 16.04, and we will need some dependencies to install for our Python Script:

$ apt udpate && apt upgrade -y
$ apt install python python-dev python-setuptools python-lxml openssl libssl-dev python-pip
$ pip install requests
$ pip install bs4
$ pip install elasticsearch

Python Scraper:

Here is our Python Scraper that will scrape the data from a sitemap.xml and ingest the data into Elasticsearch:

import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

es_client = Elasticsearch(['http://10.0.1.10:9200'])

drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])

def urlparser(title, url):
    # scrape title
    p = {}
    post = title
    page = requests.get(post).content
    soup = BeautifulSoup(page, 'lxml')
    title_name = soup.title.string

    # scrape tags
    tag_names = []
    desc = soup.findAll(attrs={"property":"article:tag"})
    for x in xrange(len(desc)):
        tag_names.append(desc[x-1]['content'].encode('utf-8'))

    # payload for elasticsearch
    doc = {
        'date': time.strftime("%Y-%m-%d"),
        'title': title_name,
        'tags': tag_names,
        'url': url
    }

    # ingest payload into elasticsearch
    res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
    time.sleep(0.5)

sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]

for x in urls:
    urlparser(x, x)

This scraper will grab all the posts from the sitemap.xml then loop through each post, with the given logic, ingest the data into elasticsearch.

Running the Python Scraper:

$ python scraper.py

Verify Documents in Elasticsearch:

After you have executed the python script, have a look at elasticsearch to confirm if the documents were ingested into Elasticsearch:

$ curl http://10.0.1.10:9200/_cat/indices/scrape-sysadmins?v
health status index            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   blogs-sysadmins gyHONBcwTmaVjZVRj6dYew   5   1         80            0    289.6kb        144.8kb

As you can see we have ingested 80 documents, having a look at one of our documents:

$ curl http://10.0.1.10:9200/blogs-sysadmins/_search?pretty -d '{"size": 1}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 80,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "blogs-sysadmins",
        "_type" : "docs",
        "_id" : "AV6j3bZEzCREvIW7N4yt",
        "_score" : 1.0,
        "_source" : {
          "date" : "2017-09-21",
          "url" : "https://sysadmins.co.za/bash-script-setup-a-3-node-hadoop-cluster-on-lxc-containers/",
          "tags" : [
            "LXC",
            "Hadoop",
            "Scripting",
            "LXD"
          ],
          "title" : "Bash Script setup a 3 Node Hadoop Cluster on LXC Containers"
        }
      }
    ]
  }
}

Next Steps:

In my next post, I will guide you through the steps on setting up a Search User Interface that will be our search engine to search from blog posts that is stored in Elasticsearch.