Extract HTML content using Xpath


Xpath is widely used to scrap the desired contents from a html page.

Here i am making an attempt to read the HTML page and extract information. For this i am going to use Python and YQL .

YQL is Yahoo Query Language ” an expressive SQL-like language that lets you query, filter, and join data across Web services. With YQL, apps run faster with fewer lines of code and a smaller network footprint”

For E.g.

A query like this :

select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta

will return the whole HTML page as a <results> attribute in the XML, but we are interested only in extracting useful data so we can append xpath to the query :

Consider I just have to extract the Background information of the Dj player which is present in the box on the right side .

If  you view the source that info is in the <table> tag with class=”infobox vcard” . One can append the XPath to YQL as follows:

select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta&#8221; and xpath=\’//table[@class=”infobox vcard”]’

This query will return an xml in which the <results> tag will contains the details from the table. Now one could easily parse the remaining data to use it.

But I am not interested in all <tr> tags present in the table, i am only interested in <tr> tags with class=””. I can put more filter to the query

select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta&#8221; and xpath=’//table[@class=”infobox vcard”]/tr[@class=””]’

Now i will get details something like:

 <tr class="">
            <th scope="row" style="text-align:left;">
                <p>Birth name</p>
            </th>
            <td class="nickname" style="">
                <p>David Pierre Guetta</p>
            </td>
        </tr>
        <tr class="">
            <th scope="row" style="text-align:left;">
                <p>Born</p>
            </th>
            <td class="" style="">
                <p>7 November 1967 <span style="display:none">(<span class="bday">1967-11-07</span>)</span>
                    <span class="noprint">(age&nbsp;43)</span>
                    <br/>
Paris, France</p>
            </td>
        </tr>

I tried this using python . If python is already installed in your computer , then installing YQL is simple :

1) If easy_install is present , just do

easy_install yql for windows
sudo easy_install yql for linux systems

2) Can download the tarball :
wget http://pypi.python.org/packages/source/y/yql/yql-0.2.tar.gz
tar -xzf yql-0.2.tar.gz
cd yql-0.2
python steup.py install

The YQL guide More Details aobut YQL

My sample program is :

import yql
if __name__ == ‘__main__’:
y = yql.Public()
query = ‘select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta&#8221; and xpath=\’//table[@class=”infobox vcard”]/tr[@class=””]\”;
result = y.execute(query)
print result.rows

— Done —

Advertisements

How to install python plugin for eclipse :Pydev tutorial


1)      The plugin which I use to develop Python programs on Eclipse is Pydev. It’s a free plugin and can be downloaded from hereDownload

2)      Install the Pydev by just following the instructions

3)      Once the Pydev is installed open eclipse

4)      Goto Window -> Preferences -> Pydev -> Interpreter Python

5)      Click on the New Button

6)      Select the path where the “python.exe” is present and click on OK

7)      If the path is successfully found the screen should look like this

8)      Click on OK. Now goto Window -> Open Perspective

9)      Select the Pydev perspective , which means the plugin is successfully configured and you are ready to write the first program.

10)      Goto File -> New -> PyDevProject

11)      After Click on Finish , the project appear is “PyDev Package Explorer”.

12)      Right Click on src -> New -> PyDev Module

13)      Enter the name of program and click on Finish.

14)      The screen should appear as follows

16)      Instead of “pass” write

print ‘Hello World’

17)      Run the program with the play button as a normal java program runs.

18)      Check the console screen to see “Hello World”