I wanted to get metadata from my other blog sysadmins.co.za, such as each post's title, link and tags using the RSS link. I stumbled upon feedparser
, where I will use it to scrape all the posts details from the link and append it to a list, which I can then use to ingest it into a database or something like that.
Installing Dependencies:
Install feedparser and requests:
$ pip install feedparser requests
The Python Code:
I'm not too sure at this point how to get pagination going, so I've set a range to check, and if a status code of 200
is received, it will check if the title is in the list that I defined, if not, it will append it to the list.
At the end of the loop, the script will return the list that was defined, which will provide the info mentioned earlier:
import feedparser
import time
import requests
rss_url = "https://sysadmins.co.za/rss"
posts = []
def get_posts_for_ghost(rss_url):
response = feedparser.parse(rss_url)
for each in response['entries']:
if each['title'] in [x['title'] for x in posts]:
pass
else:
posts.append({
"title": each['title'],
"link": each['links'][0]['href'],
"tags": [x['term'] for x in each['tags']],
"date": time.strftime('%Y-%m-%d', each['published_parsed'])
})
return posts
count = 12
for x in range(count):
if requests.get("{0}/{1}/".format(rss_url, count)).status_code == 200:
print("get succeeded, count at: {}".format(count))
get_posts_for_ghost("{0}/{1}/".format(rss_url, count))
count -= 1
else:
print("got 404, count at: {}".format(count))
count -= 1
get_posts_for_ghost(rss_url)
print(posts)
Running the script:
Running the script will look something like this:
$ python rssfeed.py
got 404, count at: 12
got 404, count at: 11
got 404, count at: 10
get succeeded, count at: 9
get succeeded, count at: 8
get succeeded, count at: 7
get succeeded, count at: 6
get succeeded, count at: 5
get succeeded, count at: 4
get succeeded, count at: 3
get succeeded, count at: 2
get succeeded, count at: 1
[
{
'title': 'Tutorial on DynamoDB using Bash and the AWS CLI Tools to Interact with a Music Dataset',
'link': 'https://sysadmins.co.za/tutorial-on-dynamodb-using-bash-and-the-aws-cli-tools-to-interact-with-a-music-dataset/',
'tags': ['DynamoDB', 'Bash', 'AWS'],
'date': '2018-08-15'
},
{
'title': 'Setup a PPTP VPN on Ubuntu 16',
'link': 'https://sysadmins.co.za/setup-a-pptp-vpn-on-ubuntu-16/',
'tags': ['Networking', 'VPN'],
'date': '2018-06-27'
},
...
Comments