Monday, June 10, 2019

Removing boilerplate from webpages using python

I am an avid Firefox web browser user and more often that not, when I am visiting a webpage containing an article, I end up clicking on the "Reader Mode" button in the address bar so that I can remove all the useless noise and just focus on the main content that the page has to offer.

A news article, like this one shown below

becomes like this after entering the reader mode.
As you can see, all the ads, boilerplate etc is gone and now I can focus on the actual content without straining my eyes to find the stuff that I visited the webpage for.


It turns out that one could easily write a python program ( python being my language of choice for such quick experiments ) to do the same thing.

We will use the python readability library to achieve this thing in our python program.

 Here is the program. Seems to work for me.