During work on deep learning consulting or natural language processing consulting one often encounters a need to obtain data sets for training from the internet.
Often this involves scraping the websites for either textual data (most often) and also images (less often).
What is scraping?
Web scraping is an approach to use bots or other programs to automate the access to and extraction of data from websites. This allows to save you a lot of time or some one else, e.g. co-workers or other persons in your organisations.
In one of the earliest phases of the internet, 2000-2005, a popular programming language for scraping websites used to be Perl (still remember using their Mechanize module). This has changed in the last decade, I most often resort to python for this purpose.
One can use special purpose libraries for that like Beautiful Soup or Scrapy. Scrapy has a lot of logic coded to deal with various parts of scraping, I like especially Scrapy rules which allow you to control your scraping spiders.
An important part when collecting data from websites is also the parsing. If they are structured, savings in time are especially high as one can define a set of rules and then scrape 10000 pages with these rules. Scraping of websites can often be the only way to get some good data sets for deep learning consulting.
A very useful library for parsing XML and HTML files is lxml. Lxml is actually pythonic binding for libxml2 and libxst.
Installation of lxml
For installing lxml just use python’s package manager pip: pip install lxml
Including lxml in python projects
Examples for including lxml in projects: from lxml import etree from lxml import html
Dealing with element class
Element is the key object for the ElementTree API. More information is available at http://effbot.org/zone/element-index.htm. Example of creating a new element with lxml:
To access the name of the new element: print(new.tag)
Using selectors to find information in files
Two typical types of selectors with lxml:
- cssselect (if you do not have, install it with pip install cssselect)
Let us look at an example of its use.
from lxml import html
example='<html><a class=”class1″>This is some text. </a><a class=”class2″>This is another text.</a></html>’
tree = html.fromstring(example)
will result in:
This is some text. This is another text.