Monday, June 10, 2019

Removing boilerplate from webpages using python

I am an avid Firefox web browser user and more often that not, when I am visiting a webpage containing an article, I end up clicking on the "Reader Mode" button in the address bar so that I can remove all the useless noise and just focus on the main content that the page has to offer.

A news article, like this one shown below

becomes like this after entering the reader mode.
As you can see, all the ads, boilerplate etc is gone and now I can focus on the actual content without straining my eyes to find the stuff that I visited the webpage for.


It turns out that one could easily write a python program ( python being my language of choice for such quick experiments ) to do the same thing.

We will use the python readability library to achieve this thing in our python program.

 Here is the program. Seems to work for me.



import requests
from readability import Document
import webbrowser,os
""" So we need a url that contains a lot of boilerplate. I found one.
You could insert any other url of your choice
"""
url = 'https://timesofindia.indiatimes.com/city/delhi/delhi-records-all-time-high-of-48-degrees-celsius-heat-wave-to-continue/articleshow/69727572.cms'
""" Fetch the webpage using requests library """
response = requests.get(url)
""" Run the Document function from reability package. This is what does the magic."""
document = Document(response.text)
""" We get the content using the document.summary() function.
Let's write it down into a file a invoke a web browser to
see if we really did get what we wanted
"""
f = open('tempwebpage.html','w')
f.write(document.summary())
f.close()
webbrowser.open('file://' + os.path.realpath('tempwebpage.html'))

No comments:

Post a Comment