I am an avid Firefox web browser user and more often that not, when I am visiting a webpage containing an article, I end up clicking on the "Reader Mode" button in the address bar so that I can remove all the useless noise and just focus on the main content that the page has to offer.
A news article, like this one shown below
becomes like this after entering the reader mode.
As you can see, all the ads, boilerplate etc is gone and now I can focus on the actual content without straining my eyes to find the stuff that I visited the webpage for.
It turns out that one could easily write a python program ( python being my language of choice for such quick experiments ) to do the same thing.
We will use the python readability library to achieve this thing in our python program.
Here is the program. Seems to work for me.
A news article, like this one shown below
becomes like this after entering the reader mode.
As you can see, all the ads, boilerplate etc is gone and now I can focus on the actual content without straining my eyes to find the stuff that I visited the webpage for.
It turns out that one could easily write a python program ( python being my language of choice for such quick experiments ) to do the same thing.
We will use the python readability library to achieve this thing in our python program.
Here is the program. Seems to work for me.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import requests | |
from readability import Document | |
import webbrowser,os | |
""" So we need a url that contains a lot of boilerplate. I found one. | |
You could insert any other url of your choice | |
""" | |
url = 'https://timesofindia.indiatimes.com/city/delhi/delhi-records-all-time-high-of-48-degrees-celsius-heat-wave-to-continue/articleshow/69727572.cms' | |
""" Fetch the webpage using requests library """ | |
response = requests.get(url) | |
""" Run the Document function from reability package. This is what does the magic.""" | |
document = Document(response.text) | |
""" We get the content using the document.summary() function. | |
Let's write it down into a file a invoke a web browser to | |
see if we really did get what we wanted | |
""" | |
f = open('tempwebpage.html','w') | |
f.write(document.summary()) | |
f.close() | |
webbrowser.open('file://' + os.path.realpath('tempwebpage.html')) |