Website and product categorizations

What Is Website Categorization?

Website categorization is a task of classifying website into one of predefined categories, also called taxonomies. Usually this is done by a supervised text classification machine learning model, because in deployment to production one often needs to classify a large number of texts.

Typical website categories

How many categories are in the taxonomy depends on the problem. E.g. in ecommerce setting the top Tier 1 level of categorization usually has 21 categories:

Apparel & Accessories 226
Home & Garden 115
Sporting Goods 50
Health & Beauty 46
Hardware 37
Electronics 30
Animals & Pet Supplies 25
Office Supplies 19
Food, Beverages & Tobacco 13
Toys & Games 13
Business & Industrial 10
Baby & Toddler 6
Luggage & Bags 6
Arts & Entertainment 4
Software 4
Furniture 4
Religious & Ceremonial 3
Mature 2
Cameras & Optics 2
Media 1
Vehicles & Parts 1

Then, on lower Tiers, the google product taxonomy has 190+ categories on Tier 2 and 1000+ categories on Tier 3.

Most usually website categorization is available as API or tool. In this way one can easily integrate it in own products and services.

 

Useful resources

Adding a few useful resources and interesting websites.

Links to science journals and tools for keywords research and niches research:

https://linktr.ee/nicheskeywords

BittsAnalytics on joy.link webpage:

https://joy.link/bittsanalytics

Article on factorization machines and real-time bidding:
https://wakelet.com/wake/V2lw9jQb3SlDi2YxurdrX

Interesting website with useful tips from developers:

https://devrant.com/users/aidatascientist

Interesting images produced with transfer style deep learning application:

https://ello.co/datascientist1

ETH support, resistance levels

The crypto market is currently in a bear market, falling from around 60,000 USD to below 40,000 USD even though the prices briefly dipped as low as 31k USD. One of the major reasons were negative comments from Elon Musk as well as other factors, e.g. rising concerns about inflation in United States and the negative slopes of the BTC futures curve.

After a steep rise over several months, some correction is however not that surprising. After all, the bitcoin market cap went over 1 trillion USD at the ATH, which is comparable to some of the biggest companies like Apple, Amazon, Google and Microsoft. Ethereum with ticker ETH, the second largest cryptocurrency also attained the market capitalization of almost 500 billion USD. Again another milestone.

In recent days there was another factor at play testing BTC and ETH support, resistance levels, namely decision of China to be more stringent in its dealings with cryptocurrency miners. Mining is what ensures the trading of Bitcoin all over the worlds so any moves against the mining industry are a negative factor.

So now wonder, that the price was weak on that latest news as well. At BittsAnalytics, ETH support, resistance levels as well as support, resistance levels for 300+ cryptocurrencies are computed on the daily level.

Here is an example screenshot of ETH support and resistance levels:

We can see that the Ethereum made such rapid gains in the last period that there almost no major support levels above 1800. This gain came after Bitcoin hit a bit of a ceiling in terms of price, so some people made a rotation from Bitcoin and also Dogecoin to the Ethereum.

Will be interesting to watch what happens with Ethereum and Bitcoin over the next few weeks.

Data Science Consultant Role

Data Science has emerged as one the important fields in the last period, capable of providing organizations in different fields with methods to gain insights with their often vast collections of data.

Data Science Consultant most often uses machine learning or deep learning models in addressing the problems of their clients.

Typical machine learning include both supervised and unsupervised models. In supervised machine learning models approach the data science consultant has labelled data sets available about the domain of application. The data set in this case consists of values for a set of features, input variables and a target variable that we want to predict.

The purpose of a supervised machine learning model is to learn on instances of the data set with the goal of predicting value of target variable from features values on previously unseen examples.

If the target variable is categorical, e.g. customer will buy/not buy a product, then we are talking about a classification problem. If the target variable is continuous then the problem is known as a regression problem.

There are various different methods available to train a machine learning model for classification and regression problems:

– linear regression (Lasso, Ridge)

– logistic regression

decision trees

– random forest

– support vector machines

– gradient boosting machines, LightGBM and XGBoost

– deep neural nets, convolutional neural networks, Long-Short Term Memory (LSTM)

Before selecting a particular machine learning model for the assigned problem, an important of a data science consultant is to perform initial data preparation, feature selection and feature engineering.

Initial data exploratory analysis and feature selection can involve many different things. We can e.g. compute Pearson correlations between numerical features or compute Cramer’s V to determine correlations between categorical features.

Categorical features can often be divided in nominal and ordinal features, with ordinal features those where we can define an order between values but cannot compute distance, e.g. euclid one, between different values. An example of an ordinal feature would be education levels: primary school, high school, university and similar.

Another common analytical step for data science consultant is to generate Kernel Density Plots (KDE) which show how the values of a feature are distributed. Example plot:

Often useful are pairplots that show how features are interacting. Pairplots are often generated for a large number of features.

After initial exploratory data analysis one can continue with feature engineering. Although there are various definitions for this, we usually mean generating new features under feature engineering. One can generate new features by delving deeper in the domain of the problem. Fields like healthcare, telecom, finance often have different characteristics and specifics regarding have some typical variables affect others, e.g. interest rates of ECB and Fed on the behavior of clients when subscribing to financial products, like annuities and so on.

Feature engineering can take some time so there has been a lot of development in the field of automated feature engineering with featuretools an example of a very useful tool that helps in generating new features more quickly.

In initial data analysis, data engineers or machine learning engineers can provide a helpful role in preparing data for the data scientist.

After this first phase, the data science consultant then has to turn to the next one – building and training a machine learning model.

 

 

 

 

Scraping with lxml library – use cssselect or xpath selectors

During work on deep learning consulting or natural language processing consulting one often encounters a need to obtain data sets for training from the internet.

Often this involves scraping the websites for either textual data (most often) and also images (less often).

What is scraping?

Web scraping is an approach to use bots or other programs to automate the access to and extraction of data from websites. This allows to save you a lot of time or some one else, e.g. co-workers or other persons in your organisations.

In one of the earliest phases of the internet, 2000-2005, a popular programming language for scraping websites used to be Perl (still remember using their Mechanize module). This has changed in the last decade, I most often resort to python for this purpose.

One can use special purpose libraries for that like Beautiful Soup or Scrapy. Scrapy has a lot of logic coded to deal with various parts of scraping, I like especially Scrapy rules which allow you to control your scraping spiders. 

An important part when collecting data from websites is also the parsing. If they are structured, savings in time are especially high as one can define a set of rules and then scrape 10000 pages with these rules. Scraping of websites can often be the only way to get some good data sets for deep learning consulting.

A very useful library for parsing XML and HTML files is lxml. Lxml is actually pythonic binding for libxml2 and libxst.

Installation of lxml

For installing lxml just use python’s package manager pip: pip install lxml 

Including lxml in python projects

Examples for including lxml in projects: from lxml import etree from lxml import html

Dealing with element class

Element is the key object for the ElementTree API. More information is available at http://effbot.org/zone/element-index.htm. Example of creating a new element with lxml:

new = etree.Element(‘new1’)

To access the name of the new element: print(new.tag)

Using selectors to find information in files

Two typical types of selectors with lxml:

  • cssselect (if you do not have, install it with pip install cssselect)
  • xpath

Let us look at an example of its use.

from lxml import html

example='<html><a class=”class1″>This is some text. </a><a class=”class2″>This is another text.</a></html>’

tree = html.fromstring(example)

print(tree.xpath(“//a[@class=’class1′]/text()”)[0])

print(tree.cssselect(“a[class=’class2′]”)[0].text)

will result in:

This is some text. This is another text.