Extract HTML content using Xpath


Xpath is widely used to scrap the desired contents from a html page.

Here i am making an attempt to read the HTML page and extract information. For this i am going to use Python and YQL .

YQL is Yahoo Query Language ” an expressive SQL-like language that lets you query, filter, and join data across Web services. With YQL, apps run faster with fewer lines of code and a smaller network footprint”

For E.g.

A query like this :

select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta

will return the whole HTML page as a <results> attribute in the XML, but we are interested only in extracting useful data so we can append xpath to the query :

Consider I just have to extract the Background information of the Dj player which is present in the box on the right side .

If  you view the source that info is in the <table> tag with class=”infobox vcard” . One can append the XPath to YQL as follows:

select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta&#8221; and xpath=\’//table[@class=”infobox vcard”]’

This query will return an xml in which the <results> tag will contains the details from the table. Now one could easily parse the remaining data to use it.

But I am not interested in all <tr> tags present in the table, i am only interested in <tr> tags with class=””. I can put more filter to the query

select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta&#8221; and xpath=’//table[@class=”infobox vcard”]/tr[@class=””]’

Now i will get details something like:

 <tr class="">
            <th scope="row" style="text-align:left;">
                <p>Birth name</p>
            </th>
            <td class="nickname" style="">
                <p>David Pierre Guetta</p>
            </td>
        </tr>
        <tr class="">
            <th scope="row" style="text-align:left;">
                <p>Born</p>
            </th>
            <td class="" style="">
                <p>7 November 1967 <span style="display:none">(<span class="bday">1967-11-07</span>)</span>
                    <span class="noprint">(age&nbsp;43)</span>
                    <br/>
Paris, France</p>
            </td>
        </tr>

I tried this using python . If python is already installed in your computer , then installing YQL is simple :

1) If easy_install is present , just do

easy_install yql for windows
sudo easy_install yql for linux systems

2) Can download the tarball :
wget http://pypi.python.org/packages/source/y/yql/yql-0.2.tar.gz
tar -xzf yql-0.2.tar.gz
cd yql-0.2
python steup.py install

The YQL guide More Details aobut YQL

My sample program is :

import yql
if __name__ == ‘__main__’:
y = yql.Public()
query = ‘select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta&#8221; and xpath=\’//table[@class=”infobox vcard”]/tr[@class=””]\”;
result = y.execute(query)
print result.rows

— Done —