Scraping Websites with Python and Beautiful Soup and Ingesting into Elasticsearch

This will be a 2 post guide, where we will scrape this website on Page Title, URL and Tags, for blog posts, then we will ingest this data into Elasticsearch. - This Post

Once we have our data in Elasticsearch, we will build a Search Engine to search for these posts, the frontend will consist of Python Flask, Elasticsearch Library and HTML, which will be coverend in Part 2

Notice:

Always ensure that you are scraping for the right reasons, in this example, I will use my own blog site as the target, and I won't be scraping the websites data, but only Page Title, URL and Tags, so that we have enough data for our search engine.

Requirements:

For this example I am using Ubuntu 16.04, and we will need some dependencies to install for our Python Script:

$ apt udpate && apt upgrade -y
$ apt install python python-dev python-setuptools python-lxml openssl libssl-dev python-pip
$ pip install requests
$ pip install bs4
$ pip install elasticsearch

Python Scraper:

Here is our Python Scraper that will scrape the data from a sitemap.xml and ingest the data into Elasticsearch:

import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

es_client = Elasticsearch(['http://10.0.1.10:9200'])

drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])

def urlparser(title, url):
    # scrape title
    p = {}
    post = title
    page = requests.get(post).content
    soup = BeautifulSoup(page, 'lxml')
    title_name = soup.title.string

    # scrape tags
    tag_names = []
    desc = soup.findAll(attrs={"property":"article:tag"})
    for x in xrange(len(desc)):
        tag_names.append(desc[x-1]['content'].encode('utf-8'))

    # payload for elasticsearch
    doc = {
        'date': time.strftime("%Y-%m-%d"),
        'title': title_name,
        'tags': tag_names,
        'url': url
    }

    # ingest payload into elasticsearch
    res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
    time.sleep(0.5)

sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]

for x in urls:
    urlparser(x, x)

This scraper will grab all the posts from the sitemap.xml then loop through each post, with the given logic, ingest the data into elasticsearch.

Running the Python Scraper:

$ python scraper.py

Verify Documents in Elasticsearch:

After you have executed the python script, have a look at elasticsearch to confirm if the documents were ingested into Elasticsearch:

$ curl http://10.0.1.10:9200/_cat/indices/scrape-sysadmins?v
health status index            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   blogs-sysadmins gyHONBcwTmaVjZVRj6dYew   5   1         80            0    289.6kb        144.8kb

As you can see we have ingested 80 documents, having a look at one of our documents:

$ curl http://10.0.1.10:9200/blogs-sysadmins/_search?pretty -d '{"size": 1}'
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 80,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "blogs-sysadmins",
        "_type" : "docs",
        "_id" : "AV6j3bZEzCREvIW7N4yt",
        "_score" : 1.0,
        "_source" : {
          "date" : "2017-09-21",
          "url" : "https://sysadmins.co.za/bash-script-setup-a-3-node-hadoop-cluster-on-lxc-containers/",
          "tags" : [
            "LXC",
            "Hadoop",
            "Scripting",
            "LXD"
          ],
          "title" : "Bash Script setup a 3 Node Hadoop Cluster on LXC Containers"
        }
      }
    ]
  }
}

Next Steps:

In my next post, I will guide you through the steps on setting up a Search User Interface that will be our search engine to search from blog posts that is stored in Elasticsearch.

Scraping Websites with Python and Beautiful Soup and Ingesting into Elasticsearch

Comments

Read Next

How to Setup In-Place Upgrades with CodeDeploy

How to use dotenv with Python

Comments

Subscribe to Sysadmins

Read Next

Tags