Data Science Consultant Role

Data Science has emerged as one the important fields in the last period, capable of providing organizations in different fields with methods to gain insights with their often vast collections of data.

Data Science Consultant most often uses machine learning or deep learning models in addressing the problems of their clients.

Typical machine learning include both supervised and unsupervised models. In supervised machine learning models approach the data science consultant has labelled data sets available about the domain of application. The data set in this case consists of values for a set of features, input variables and a target variable that we want to predict.

The purpose of a supervised machine learning model is to learn on instances of the data set with the goal of predicting value of target variable from features values on previously unseen examples.

If the target variable is categorical, e.g. customer will buy/not buy a product, then we are talking about a classification problem. If the target variable is continuous then the problem is known as a regression problem.

There are various different methods available to train a machine learning model for classification and regression problems:

– linear regression (Lasso, Ridge)

– logistic regression

decision trees

– random forest

– support vector machines

– gradient boosting machines, LightGBM and XGBoost

– deep neural nets, convolutional neural networks, Long-Short Term Memory (LSTM)

Before selecting a particular machine learning model for the assigned problem, an important of a data science consultant is to perform initial data preparation, feature selection and feature engineering.

Initial data exploratory analysis and feature selection can involve many different things. We can e.g. compute Pearson correlations between numerical features or compute Cramer’s V to determine correlations between categorical features.

Categorical features can often be divided in nominal and ordinal features, with ordinal features those where we can define an order between values but cannot compute distance, e.g. euclid one, between different values. An example of an ordinal feature would be education levels: primary school, high school, university and similar.

Another common analytical step for data science consultant is to generate Kernel Density Plots (KDE) which show how the values of a feature are distributed. Example plot:

Often useful are pairplots that show how features are interacting. Pairplots are often generated for a large number of features.

After initial exploratory data analysis one can continue with feature engineering. Although there are various definitions for this, we usually mean generating new features under feature engineering. One can generate new features by delving deeper in the domain of the problem. Fields like healthcare, telecom, finance often have different characteristics and specifics regarding have some typical variables affect others, e.g. interest rates of ECB and Fed on the behavior of clients when subscribing to financial products, like annuities and so on.

Feature engineering can take some time so there has been a lot of development in the field of automated feature engineering with featuretools an example of a very useful tool that helps in generating new features more quickly.

In initial data analysis, data engineers or machine learning engineers can provide a helpful role in preparing data for the data scientist.

After this first phase, the data science consultant then has to turn to the next one – building and training a machine learning model.

 

 

 

 

Scraping with lxml library – use cssselect or xpath selectors

During work on deep learning consulting or natural language processing consulting one often encounters a need to obtain data sets for training from the internet.

Often this involves scraping the websites for either textual data (most often) and also images (less often).

What is scraping?

Web scraping is an approach to use bots or other programs to automate the access to and extraction of data from websites. This allows to save you a lot of time or some one else, e.g. co-workers or other persons in your organisations.

In one of the earliest phases of the internet, 2000-2005, a popular programming language for scraping websites used to be Perl (still remember using their Mechanize module). This has changed in the last decade, I most often resort to python for this purpose.

One can use special purpose libraries for that like Beautiful Soup or Scrapy. Scrapy has a lot of logic coded to deal with various parts of scraping, I like especially Scrapy rules which allow you to control your scraping spiders. 

An important part when collecting data from websites is also the parsing. If they are structured, savings in time are especially high as one can define a set of rules and then scrape 10000 pages with these rules. Scraping of websites can often be the only way to get some good data sets for deep learning consulting.

A very useful library for parsing XML and HTML files is lxml. Lxml is actually pythonic binding for libxml2 and libxst.

Installation of lxml

For installing lxml just use python’s package manager pip: pip install lxml 

Including lxml in python projects

Examples for including lxml in projects: from lxml import etree from lxml import html

Dealing with element class

Element is the key object for the ElementTree API. More information is available at http://effbot.org/zone/element-index.htm. Example of creating a new element with lxml:

new = etree.Element(‘new1’)

To access the name of the new element: print(new.tag)

Using selectors to find information in files

Two typical types of selectors with lxml:

  • cssselect (if you do not have, install it with pip install cssselect)
  • xpath

Let us look at an example of its use.

from lxml import html

example='<html><a class=”class1″>This is some text. </a><a class=”class2″>This is another text.</a></html>’

tree = html.fromstring(example)

print(tree.xpath(“//a[@class=’class1′]/text()”)[0])

print(tree.cssselect(“a[class=’class2′]”)[0].text)

will result in:

This is some text. This is another text.