Xpath is widely used to scrap the desired contents from a html page.
Here i am making an attempt to read the HTML page and extract information. For this i am going to use Python and YQL .
YQL is Yahoo Query Language ” an expressive SQL-like language that lets you query, filter, and join data across Web services. With YQL, apps run faster with fewer lines of code and a smaller network footprint”
For E.g.
A query like this :
select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta“
will return the whole HTML page as a <results> attribute in the XML, but we are interested only in extracting useful data so we can append xpath to the query :
Consider I just have to extract the Background information of the Dj player which is present in the box on the right side .
If you view the source that info is in the <table> tag with class=”infobox vcard” . One can append the XPath to YQL as follows:
select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta” and xpath=\’//table[@class=”infobox vcard”]’
This query will return an xml in which the <results> tag will contains the details from the table. Now one could easily parse the remaining data to use it.
But I am not interested in all <tr> tags present in the table, i am only interested in <tr> tags with class=””. I can put more filter to the query
select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta” and xpath=’//table[@class=”infobox vcard”]/tr[@class=””]’
Now i will get details something like:
<tr class=""> <th scope="row" style="text-align:left;"> <p>Birth name</p> </th> <td class="nickname" style=""> <p>David Pierre Guetta</p> </td> </tr> <tr class=""> <th scope="row" style="text-align:left;"> <p>Born</p> </th> <td class="" style=""> <p>7 November 1967 <span style="display:none">(<span class="bday">1967-11-07</span>)</span> <span class="noprint">(age 43)</span> <br/> Paris, France</p> </td> </tr>
I tried this using python . If python is already installed in your computer , then installing YQL is simple :
1) If easy_install is present , just do
easy_install yql for windows
sudo easy_install yql for linux systems
2) Can download the tarball :
wget http://pypi.python.org/packages/source/y/yql/yql-0.2.tar.gz
tar -xzf yql-0.2.tar.gz
cd yql-0.2
python steup.py install
The YQL guide More Details aobut YQL
My sample program is :
import yql
if __name__ == ‘__main__’:
y = yql.Public()
query = ‘select * from html where url=”http://en.wikipedia.org/wiki/David_Guetta” and xpath=\’//table[@class=”infobox vcard”]/tr[@class=””]\”;
result = y.execute(query)
print result.rows
— Done —