This will be a 2 post guide, where we will scrape this website on Page Title
, URL
and Tags
, for blog posts, then we will ingest this data into Elasticsearch. - This Post
Once we have our data in Elasticsearch, we will build a Search Engine to search for these posts, the frontend will consist of Python Flask, Elasticsearch Library and HTML, which will be coverend in Part 2
Notice:
Always ensure that you are scraping for the right reasons, in this example, I will use my own blog site as the target, and I won't be scraping the websites data, but only Page Title, URL and Tags, so that we have enough data for our search engine.
Requirements:
For this example I am using Ubuntu 16.04, and we will need some dependencies to install for our Python Script:
$ apt udpate && apt upgrade -y
$ apt install python python-dev python-setuptools python-lxml openssl libssl-dev python-pip
$ pip install requests
$ pip install bs4
$ pip install elasticsearch
Python Scraper:
Here is our Python Scraper that will scrape the data from a sitemap.xml
and ingest the data into Elasticsearch:
import re
import time
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch
es_client = Elasticsearch(['http://10.0.1.10:9200'])
drop_index = es_client.indices.create(index='blog-sysadmins', ignore=400)
create_index = es_client.indices.delete(index='blog-sysadmins', ignore=[400, 404])
def urlparser(title, url):
# scrape title
p = {}
post = title
page = requests.get(post).content
soup = BeautifulSoup(page, 'lxml')
title_name = soup.title.string
# scrape tags
tag_names = []
desc = soup.findAll(attrs={"property":"article:tag"})
for x in xrange(len(desc)):
tag_names.append(desc[x-1]['content'].encode('utf-8'))
# payload for elasticsearch
doc = {
'date': time.strftime("%Y-%m-%d"),
'title': title_name,
'tags': tag_names,
'url': url
}
# ingest payload into elasticsearch
res = es_client.index(index="blog-sysadmins", doc_type="docs", body=doc)
time.sleep(0.5)
sitemap_feed = 'https://sysadmins.co.za/sitemap-posts.xml'
page = requests.get(sitemap_feed)
sitemap_index = BeautifulSoup(page.content, 'html.parser')
urls = [element.text for element in sitemap_index.findAll('loc')]
for x in urls:
urlparser(x, x)
This scraper will grab all the posts from the sitemap.xml
then loop through each post, with the given logic, ingest the data into elasticsearch.
Running the Python Scraper:
$ python scraper.py
Verify Documents in Elasticsearch:
After you have executed the python script, have a look at elasticsearch to confirm if the documents were ingested into Elasticsearch:
$ curl http://10.0.1.10:9200/_cat/indices/scrape-sysadmins?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open blogs-sysadmins gyHONBcwTmaVjZVRj6dYew 5 1 80 0 289.6kb 144.8kb
As you can see we have ingested 80 documents, having a look at one of our documents:
$ curl http://10.0.1.10:9200/blogs-sysadmins/_search?pretty -d '{"size": 1}'
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 80,
"max_score" : 1.0,
"hits" : [
{
"_index" : "blogs-sysadmins",
"_type" : "docs",
"_id" : "AV6j3bZEzCREvIW7N4yt",
"_score" : 1.0,
"_source" : {
"date" : "2017-09-21",
"url" : "https://sysadmins.co.za/bash-script-setup-a-3-node-hadoop-cluster-on-lxc-containers/",
"tags" : [
"LXC",
"Hadoop",
"Scripting",
"LXD"
],
"title" : "Bash Script setup a 3 Node Hadoop Cluster on LXC Containers"
}
}
]
}
}
Next Steps:
In my next post, I will guide you through the steps on setting up a Search User Interface that will be our search engine to search from blog posts that is stored in Elasticsearch.
Comments