Alibaba Cloud Machine Learning Platform for AI: Image Classification by Caffe

Join us at the Alibaba Cloud ACtivate Online Conference on March 5–6 to challenge assumptions, exchange ideas, and explore what is possible through digital transformation.

By Garvin Li

The Image classification by Tensorflow section introduces how to use the TensorFlow framework of deep learning to classify CIFAR-10 images. This section introduces another deep learning framework: Caffe. With Caffe, you can complete image classification model training by editing configuration files.

Make sure that you have already read the Deep Learning section and activated deep learning in Alibaba Cloud Machine Learning Platform for AI (PAI).


This experiment uses a CIFAR-10 open-source dataset, containing 60,000 images with pixel dimensions 32 x 32. These images are classified into 10 categories: airplanes, automobiles, birds, cats, deer. dogs, frogs, horses, ships, and trucks. The following figure shows the dataset.

The dataset has already been stored in the public dataset in Alibaba Cloud Machine Learning Platform for AI in JPG format. Machine learning users can directly enter the following paths in the Data Source Path field of deep learning components:

  • Testing data: oss://
  • Training data: oss://

Enter the path, as shown in the following figure:

Format Conversion

The Caffe framework of deep learning currently only supports certain formats. Therefore, you must first use the format conversion component to convert the JPG images.

  • OSS Path Storing Images and Table Files: set this parameter to the path of the public dataset predefined in Alibaba Cloud Machine Learning Platform for AI.
  • Output OSS Path: user-defined OSS path.

After format conversion, the following files are generated in the output OSS path, including a piece of training data and a piece of testing data.

Record the corresponding paths for editing the Net file. The following is an example of the data paths:

  • Training data data_file_list.txt: bucket/cifar/train/data_file_list.txt
  • Training data: data_mean.binaryproto:bucket/cifar/train/data_mean.binaryproto
  • Testing data data_file_list.txt: bucket/cifar/test/data_file_list.txt
  • Testing data: data_mean.binaryproto:bucket/cifar/test/data_mean.binaryproto

Caffe Configuration Files

Enter the preceding paths in the Net file, as follows:

Edit the Solver file:

Run the Experiment

  1. Upload the Solver and Net files to OSS, drag and drop the Caffe component to the canvas, and connect the component to the data source.
  2. Set the parameters in the Caffe component, as shown in the following figure. Set the Solver OSS Path to the OSS path of the uploaded Solver file and then click Run.
  3. View the generated image classification model file in the model storage path on OSS. You can use the following models to classify images.

  1. To view the corresponding log, refer to Logview in Image classification by TensorFlow.


How ML differs from Statistics

Classical Statistics in University Under-graduate courses or even Graduate courses starts with descriptive statistics and then moves into distribution fitting and then all the way to complex multivariate analysis. Essentially covering hypothesis testing, correlation, regression , factor analysis and Principal Component analysis. 
 Statistics assumes a lot of a-priori knowledge about the data and its properties and does not necessarily cover a lot of trial and error or even tinkering.

Machine Learning in new age looks at wide array of techniques and algorithms which themselves learn from the data. Deep Machine Learning, Supervised Learning and Reinforcement Learning covers very interesting algorithm which learn themselves from the wide array of data. So data becomes input and model becomes output. This happens without any human intervention ( except in supervised learning). This is the real beauty of ML over conventional statistics. Although new age ML ( covering CNN/Deep Learning/Reinforcement Learning) draws a lot from statistics, cognitive biology, neuroscience, mathematics and control theory, most of the ML applications have been very new and have large technical and business impact.

In Reinforcement Learning classical optimization functions are used and behaviorism invested in psychology by Skinner comes int play in terms of “reward and punishment”. So behavior of the RL Algorithm is shaped in the same way a child’s behaviour is shaped by parents. Eventually use of Dynamic Programming from the classical optimization ( Operations Research) is used along with Bellman’s optimality conditions and MDP ( Markov Decision Process)

RL ensures that you can start “learning” with minimum domain or problem knowledge. Algorithm has power to learn and come up with its parameters depending on the error conditioning and reward optimization. Multiple algorithms like Temporal Difference Learning, Deep ! Learning and Actor Critic Methods ( A3c) ensure that algorithms in RL have power to create truly domain independent ways to learn in many many new domains without need to have domain knowledge.

ML Tribe( collection of AI Scientists, Data Analysts, ML practitioners, Students, Professors and Industry Professionals) is significantly different from old school statistics in many ways. Statistics assumes a lot of knowledge about the system. Statistical thinking in many ways is top-down, a-priori thinking. ML( Broad umbrella of algorithms in RL, Deep Learning) thinking is inherently is posterior, does not assume much and is bottom-up. In many ways as Richard Dawkins puts it “ The Darwinian thinking is mindless, purposeless bottom-up processes involving R&D, Trial and Error and Tinkering all the way”. ML resembles our own biological evolution. The same way as biological evolution ML algorithms are also evolving. The big advantage is ML algorithms evolution is much faster unlike biological gradual, slow evolution.

ML works a lot like biological processes seen elsewhere in nature. Sometimes ML does not necessarily try to Optimize in the classical Optimization Sense ( finding the best possible solution from large scale solution space). ML tries a process of sophisticated tinkering which moves from finding one sub-optimal solution and then move ahead. This process ensures continuity in learning as well as learning becomes in many ways autonomous.

Statistics used to need a lot of careful sampling, sometimes meticulously planned data cleaning would pre-date a rigorous statistical analysis. ML works with existing data and tries to create inferences.

ML v/s Statistics

One of the families of ML algorithms, Bayesian Inferencing using basic Bayes Probability coupled with state-space generators like Monte Carlo simulation so that you create simulated data where data is non-existent or not accurate. ML algorithms this way build a kind of robustness against the Data Quality problems.

Video Classification with Deep Learning

Problem Statement

Imagine that you have tremendous amount of videos and you would like to classify them based on what occurs inside, and of course you don’t want to hire people to sit in front of the computer and do the jobs for you 🙂 That is an option but highly expensive and error-prone.


  • There are both spatial and temporal content to be considered. Yes, a video consists of lots of images being viewed one after the other and each frame has a meaning but the order is important too. Would it be meaningful to reorder the images and view the resulting video? Probably not!
  • Who will process that many of frames? Do we need to process each and every frame to make assumptions about the content of a video? What if you only watch every 10th frame?
  • Training effort will be huge! Video classification is not a simple task. Apart from labeling training data, the architecture and hyperparameters of an optimum neural network will demand vast amount of resources.

Attacking the Problem

Ok, we are clear about the problem and challenges. It is now time to think about what can be done. What I will list below is by no means an exhaustive list but will give you enough perspective.

  1. Create a deep neural network with neurons processing each and every pixel of frames as separate features.
  2. Choose a Convolutional Neural Network (CNN) to decrease the number of features to be processed. Nearby pixels do not carry independent characteristics after all.
  3. Utilize a Recurrent Neural Network (RNN) to capture the order between frames for better classification.
  4. Construct a hybrid of a CNN and RNN.

What I Will Demonstrate

I will go with the 4th option above. I do not want to go into the never-ending training and testing cycles of a huge network trying to process every pixel. That is where a CNN comes in. Moreover, I also do not want to lose the temporal information hidden inside the videos.

In terms of the technology stack, I preferred TensorFlow. We will not interact with TensorFlow directly, though, as it will require many many lines of code. That is where Keras comes into the picture.

We will also not build the CNN part from scratch but instead do some transfer learning with an Inception v3 CNN.

Okay, here are the steps we will follow:

  1. Extraction of image frames from videos
  2. Training the top layer of an Inception v3 CNN with the input images
  3. Extraction of a sequence of images from videos with a constant size and equally spaced
  4. Training an LSTM RNN for classifying videos based on the image frames


I obtained dataset from This dataset has 13320 videos assigned to 101 action categories.

Example Videos

Folders were used for assigning the video files to their respective categories. Each different subject within videos are assigned to a group:

Folder Names Represent Categories

File naming convention for videos is as follows:

  • Sample Name
  • Category
  • Group Number
  • Index number within the Group

Considerations for Training/Test Data Split

  • A group cannot span across datasets
    Videos within the same group were recorded for the same subject. For example, if the category is YoYo, the same person was recorded within the same group. Therefore, using videos from the same group for both datasets will give a high accuracy but that will not be realistic.
  • Regular Expression for Extracting the Group and Index Numbers
  • Folder Name as the Category Name
  • Groups to be Shuffled During Assignment
  • Same Group Assignment for both CNN and RNN Training
    If a group is assigned to one dataset during CNN training and another dataset during RNN training, then the results will not be healthy because one of the networks would have seen that input in the training but it will be used for validation in the other network.

Inception v3

I will use Inception v3 CNN for transfer learning so it is worth to give some introduction for those new to the idea.

  • What is an inception network?
    An inception network consists of multiple inception blocks chained together.
  • What is an inception block?
    A single inception block tries to find the perfect combination of CONV blocks with different sizes in addition to a MAX POOL layer.

Inception Block

Inception Network

How I Fed Images into Inception v3

  • Every 25th frame is chosen for input to decrease the amount of data.
  • Extracted frames are written into a directory structure starting with the type of dataset.
  • 95% of the input data goes for training and the rest for validation.
  • All input image will be first passed through the CNN up to the top layer. The last layer dimension is 6 x 8 x 2048:

mixed10 (Concatenate) (None, 6, 8, 2048) 0 activation_86[0][0] mixed9_1[0][0] concatenate_2[0][0] activation_94[0][0]

  • Data shape for the training data features and labels:
    ((85807, 6, 8, 2048), (85807, 101))
  • Data shape for the validation data features and labels:
    ((7734, 6, 8, 2048), (7734, 101))
  • Training the top part with a Dense layer of 256 neurons, a Dropout layer with 0.5 probability, and a softmax layer:
    model = models.Sequential()
    model.add(layers.Dense(256, activation=’relu’, input_dim=6 * 8 * 2048))
    model.add(layers.Dense(101, activation=’softmax’))
  • Epoch size of 30 and batch size of 64 were used.
  • Accuracy and loss graphs for training and validation data:

Training and Validation Charts
  • The main objective here is to train the dense layer and not to achieve the highest accuracies possible.
  • Extracting the dense layer out of the trained network and concatenating that to the Inception v3 CNN as the top layer:
    model2 = models.Sequential()
  • Viewing the summary of the final CNN:
    Layer (type) Output Shape Param # ================================================================= inception_v3 (Model) (None, 6, 8, 2048) 21802784 _________________________________________________________________ flatten_1 (Flatten) (None, 98304) 0 _________________________________________________________________ dense_1 (Dense) (None, 256) 25166080 ================================================================= 
    Total params: 46,968,864 Trainable params: 46,934,432 Non-trainable params: 34,432

Custom RNN

Enough details about CNN. Let’s turn our attention the RNN network now.

  • A fixed sequence length (80) is used.
  • Sequences are extracted equally distanced.
  • Each image is first fed into the CNN to have the feature length of 256.
  • Data shape for the training data features and labels:
    ((11283, 80, 256), (11283, 101))
  • Data shape for the validation data features and labels:
    ((1008, 80, 256), (1008, 101))
  • Architecture has an LSTM of 200 neurons and a softmax layer:
    model = Sequential()
    model.add(LSTM(200, input_shape=(80, 256)))
    model.add(Dense(101, activation=’softmax’))
  • The network can learn the the features from the training set very easily so there is no need to add more layers and further complicate the architecture.
  • Epoch size of 30 and batch size of 64 were used.
  • Accuracy and loss graphs for training and validation data:

Training and Validation Charts
  • The network could achieve 100% accuracy for the training set very quickly but got stuck at around 72% — 73% for the validation set.

My Thoughts

  • Accuracy values for the attempted solutions with regards to UCF101 dataset are around 70–75%.
  • This suggests that the training data is not rich enough to grasp the essential features from the videos so as to predict the ones in the validation set.
  • Using different architectures with various neuron sizes also did not improve the validation accuracy, which is again a sort of proof that the above statement holds.
  • As mentioned earlier, the frames within a video are usually very similar in content and therefore increasing the RNN sequence length did not add value, too.
  • Although it is a very labor-intensive task, acquiring rich content video with enough size will give satisfying results with this hybrid architecture.

You Prefer to Watch Instead of Reading?

Well, here is my Youtube video that is a live explanation of this study:

FPGA Research and Development in Nepal

FPGA are the re-configurable chip technology which are dominance in market of Electronic Hardware Design since 1990’s. FPGA(Field Programmable Gate Array) technology allow’s hardware engineer to design, test and implement different logic designs, architecture and processing systems. While talking about the global FPGA market it is becoming Multi-Billion Dollar industry according to marketsandmarkets. The main player’s of FPGA Market are Xilinx and Altera (acquired by Intel), aside of this two main bulls there are small players too which are Lattice Semiconductor, MicroSemi etc.

An Open Source FPGA-PYNQ FPGA from Xilinx which allows to implement design in Python (Source:

Digitronix Nepal is initiating FPGA Research and Development Initiative from 2015, while Digitronix Nepal worked on Electronic Hardware Design, Research and Development from 2013.

Why Digitronix Nepal is initiating FPGA R & D in Nepal?

Nepal (a developing Nation) which is only becoming a technology consuming market rather than a development center in terms of design and development however currently there are many Companies in Software Development in Nepal which are representing Nepal on global arena. The Electronics Design and Automation is wide market globally but Nepal is not able to harness this opportunity even a bit. So, Digitronix Nepal believe that Electronic Hardware Design can generate lots of opportunities for Nepalese Engineer’s and Professional’s as so we are working on Electronics Hardware Design field from 2013. While talking about FPGA Research and Development, We (Digitronix Nepal) believe that FPGA is the electronic hardware design platform where Designs and Intellectual Property (so called IP’s) can be marketed so we don’t need to make hardware ourselves.

StartUp Scene at New Business Age Magazine, May 2017 (Click for more)

What has happened on FPGA Research and Development Initiative until yet?

Digitronix Nepal has signed MoU (Memorandum of Understanding) with Nepal’s Top Engineering Colleges ,including IOE Pulchowk Campus, Kathmandu Engineering College, Himalaya College of Engineering, Kathford Int’l College of Engineering and Management, Sagarmatha Engineering College, National College of Engineering and Kantipur Engineering College for creating FPGA Research and Development center’s at respective colleges. While Digitronix Nepal also facilitate to have state of art resources; FPGA’s and Software Tools.

MoU between Digitronix Nepal and National College of Engineering

The hardware and software are utilizing in those Research and Development Centers for implementing new design methods, development of systems and researching on new ideas with FPGA’s. Digitronix Nepal also has assisted those center’s for Technology Transfer on this State of Art design environment.

Digitronix Nepal has collaborate with different engineering colleges of Nepal for organizing Seminar’s on FPGA Technology, FPGA Design Competition’s and Interaction Programs for enhancing skills and knowledge in Engineering Courses and Professional Companies. Some snapshots of this events are presented below:

News on First FPGA Design Competition, 2016 at Republica
News regarding to Second All Nepal FPGA Design Competition 2017 at Saptahik, Kantipur

Who are getting benefit from this initiative?

This Initiative provide chance to get skill set on latest and market leading technology which can be marketed globally. Engineering Faculties and Student of Electronics , Computer and Electrical Engineering are getting global skills here in Nepal. So enthusiast from those engineering streams willing to pursue carrier on FPGA, VLSI (Very large scale integrated) Design and ASIC (Application Specific Integrated Circuit) and who are willing to pursue further study on Electronic Engineering, Computer Engineering, Computer Science, Embedded System Design are getting huge benefits which can offer them better opportunities than they are having currently.

So then what are the opportunities in FPGA R & D field?

Globally there are many opportunities on FPGA Design Field while Design Skills of FPGA are also heavily applied for VLSI Design and ASIC Design so the overall area of opportunities is huge including FPGA Design, VLSI Design and Verification and ASIC Design. You can visit electronicsweekly, Indeed and many other job portal’s and Freelancing Sites (Upwork, Freelancer, fever etc.)

In Nepal, Digitronix Nepal is offering Internship’s and Job Opportunities on FPGA Research and Development. The Carrier Objective is Implementing Computer Vision Algorithm with Neural Networks and Machine Learning’s on FPGA and Design/Implementation of Real Time Video Processing.

Internship Offering on Machine Learning and Neural Networks at Digitronix Nepal (for more click here)

So what is Digitronix Nepal’s Services?

Digitronix Nepal is currently working on FPGA based IP (Intellectual Property) Design in Real time Image and Video Processing. We are offering our services on FPGA Design based on RTL Design (Verilog/VHDL), Intellectual Property (IP) design on Image Processing/Video Processing, Design and Verification, IP migration , PCIe based Design support and PCB Design etc.

Our Projects can be viewed at: Digitronix Nepal’s Project

Digitronix Nepal is also providing Offline Training’s and Online Training’s. We already have provide training to Faculties to Kantipur Engineering College, Khwopa College of Engineering and Students of Kathmandu Engineering College, Khwopa College of Engineering, Himalaya College of Engineering Etc.

On talking about Online Training’s, we are providing online training’s from where we have Six Courses on FPGA Design and Development.

Digitronix Nepal’s Online Courses at Udemy: Course Link

Thank You for reading this Article!

Feedback's, Comments and Suggestion's are heavily welcomed at: or Digitronix Nepal’s Facebook Page.


In previous posts we’ve discussed interpreting residual plots when evaluating linear models ( here, here, here, and here). But what is a residual? A residual is the distance that a given data point is from the regression line. In other words a residual is our prediction error for that data point. A smaller residual indicates a better fit to that data point.

A simple plot will demostrate:

plot(dist ~ speed, data = cars)
abline(lm(dist ~speed, data = cars))

The yellow highlights are the residuals; the shorter lines indicate smaller residuals. Points below the regression line will have a negative residual while those above the line will have a positive residual. The sum and mean of all residuals is always zero.

We can evaluate how well the model fits overall by evaluating the Root Mean Squared Error, which is the standard deviation of all of the residuals. We’ll cover that topic next.

Google Digital News Initiative Grant for Developing MorphL: AI-Driven UI

MorphL is a machine learning platform that empowers digital publishers to optimize engagement and conversion rates by means of predicting how users will interact with various UI elements.

The platform will record various UI micro-metrics and it will automatically test different variations to identify the optimum combination that produces the best results. By doing this, it’s like having a 24/7 in-house R&D department keeping the application’s UI always relevant and engaging, allowing digital publishers to focus on what they do best.

MorphL introduces a shift in the mindset and work process of digital publishers: the intrinsic ability for an application to assess how a particular UI element is impacting the engagement/conversion rate and automatically adapt to user behavior.

The Digital News Initiative (DNI) is a partnership between Google and publishers in Europe to support high-quality journalism through technology and innovation. Since 2016, the DNI Innovation Fund evaluated more than 3,000 applications, carried out 748 interviews and offered more than €73m in funding to 359 projects in 29 European countries.

The 50,000 EUR grant from Google DNI Fund comes as a confirmation of the platform’s potential to impact the future of UI development in the digital publishing space and that we’re entering a new era of UI development, one that will be impacted by AI (like many other aspects of our lives).

The project will be developed by our team at Appticles (multi-channel mobile publishing platform) in partnership with PressOne (an independent Romanian digital news publication) and we’re going to post updates on our progress right here on Medium, but you can also keep in touch by following us on Twitter & Facebook and star us on GitHub.

Deep learning frameworks and vectorization approaches for sentiment analysis

Introduction and Background

Data preparation

def clean_tweet(tweet_raw):    # REMOVE USER NAMES
tweet_clean = re.sub(r'@.*? ','', tweet_raw)
tweet_clean = re.sub(r'@\_.*? ','', tweet_clean)
tweet_clean = re.sub(r'http://.*?($
)', '', tweet_clean)
tweet_clean = tweet_clean.replace('"','')
tweet_clean = tweet_clean.replace("'",'')
tweet_clean = tweet_clean.replace(' ',' ')
tweet_clean = tweet_clean.replace('&lt;','<')
tweet_clean = tweet_clean.replace('&gt;','>')
tweet_clean = re.sub(r'^ +','', tweet_clean)
tweet_clean = tweet_clean.lower()
return tweet_clean
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 52, 4, 125, 171, 15, 8, 2453, 466, 28, 2785, 2, 3, 71, 21, 92, 121, 135, 26, 8, 199]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 60, 62, 128, 92, 60, 6, 31, 62, 6, 63, 29, 127, 127, 30, 6, 63, 62, 62, 92, 6, 90, 62, 93, 6, 129, 132, 6, 30, 28, 132, 63, 128, 59, 63, 6, 127, 32, 125, 129, 6, 125, 31, 6, 18, 125, 129, 78, 6, 128, 6, 28, 125, 27, 6, 63, 62, 6, 129, 131, 59, 28, 6, 90, 131, 92, 6, 31, 62, 92, 128, 60, 28, 31, 6, 64, 128, 31, 28, 6, 129, 132, 6, 90, 93, 128, 127, 92, 27, 63]

Model Preparation

model.add(Convolution1D(64, 3, border_mode='same'))
model.add(Convolution1D(32, 3, border_mode='same'))
model.add(Convolution1D(16, 3, border_mode='same'))



Character Level
Word Level (BoW)


Using Zipf’s Law To Improve Neural Language Models


In this article, I will explain what is Zipf’s Law in the context of Natural Language Processing (NLP) and how knowledge of this distribution has been used to build better neural language models. I assume the reader is familiar with the concept of neural language models.


The code to reproduce the numbers and figures presented in this article can downloaded from this repository.


The analysis discussed in this article is based on the raw wikitext-103 corpus. This corpus consist of articles extracted from the set of Good and Featured wikipedia articles and has over 100 million tokens.

Since I lack access to powerful computing resources, I performed this analyses only on the test set. Here’s a sample of what the test set looks like:

The test set has the following properties:

  • Vocabulary size: 17, 733
  • Number of tokens: 196,095

Here’s a list of the 10 most common words and their counts (number of times they appear in the corpus):

and here’s a list of the 10 least common words and their counts:

Zipf’s Law

In NLP, Zipf’s Law is a discrete probability distribution that tells you the probability of encountering a word in a given corpus. The input is the rank of a word (in terms of is frequency) so you can use this distribution to ask questions like:

  • What is the probability of encountering the most common word in a corpus with 100,000 words?
  • What is the probability of encountering the 10th most common word in a corpus of 100,000 words?
  • What is the probability of encountering the least common word in a corpus of 100,000 words?

The probability mass function in Zipf’s Law is defined as:


  • k is the rank of the word we are interested in finding out the probability of appearing in the corpus
  • N is the corpus’ vocabulary size
  • s is a parameter of the probability distribution and is set to 1 in the classic version of Zipf’s Law.

As the following derivation shows, Zipf’s Law says that there is a relationship between the probability of a word occurring in a corpus and its rank:

Suppose s=1 and the most common word in your corpus i.e. rank 1 is “the” i.e. word i and the 10-th most common word i.e. rank 10 in your corpus the word “cat” i.e word j. Also, suppose that the probability of the word “the” occurring in your corpus is 0.5. Then according to equation 1, the probability of the word “cat” occurring in your corpus is (1/10)*0.5 = 0.05. In other words, for a given corpus, the relationship between the rank of a word and the probability of encountering it follows a power law.

Zipf’s Law in Wikitext-103

Do the words in Wikitext-103 follow Zipf’s Law?

Here’s the list 10 most common words in the corpus again but with a couple of additional information:

The “proportion” column is the number of times a particular word appears divided by the total number of words in the corpus i.e. the empirical probability. The “predicted_proportion” column is the theoretical probability according to Zipf’s Law with s=1. Notice that this two values are very close to each other.

Here’s a log-log scatter plot showing the relationship between the rank of a word and its empirical and theoretical probability:

Since the Zipf’s Law implies a power law relationship between the rank of a word and its probability of occurring in a corpus, the shape of the theoretical line in the log-log plot will be a straight line. Notice that the empirical line more or less follows the shape of the theoretical line. Visually, this suggests that the word distribution approximately follows Zipf’s Law. A better fit may be had if we assume that the relationship follows a broken power law.

So What?

Why does it matter that the word distribution of the corpora you are training a neural language model on follows Zipf’s Law?

Recall that neural language models are trained on large corpora such that even the most frequent word is a small small fraction of it. In this example, the most frequent word is “the” but it only constitutes 7% of the words in the wikitext test set (recall that this corpus has 196k words and a vocabulary size of 17k. By Zipf’s Law, there will only be around 0.1 * 0.07 * 196,095 = 1,372 training examples for the 10th most common word. The number of examples will continue to rapidly decrease by a factor of 10 for the 100th, 1,000th, etc most common word.

Since neural language models represent each word in a corpus using a high-dimensional vector, these models tend to do well in predicting common words but worse on rare words. This is because the rare words have significantly less number of training examples relative to the common words (by Zipf’s Law) but are given the same capacity (vector dimension) as the common words to represent them in vector space.

Consider representing the words in this corpus using a 300-dimensional vector. By Zipf’s Law, you will only have about 0.01 * 0.07 * 196,095 = 137 training examples to train the 100th most common word. This is certainly going to lead to over-fitting as the capacity to represent this word (300) is a lot bigger than the number of training examples (137). Training on a larger corpus is only a partial solution since it only shifts the problem to less common words.

Adaptive Input Representations

Baevski and Auli (2019) proposed adaptive input representations as a means to fix the problem described in the preceding section. Here’s a diagram from their paper describing the idea:

In words, they start by grouping the vocabulary in their corpus into clusters based on the word’s frequency. The first cluster will have embedding size of d and subsequent clusters will have their embedding size reduce by a factor of k. For example, when the authors built a language model for wikitext-103, they split the vocabulary into 3 clusters with the following properties:

  • cluster 1: 20k words with embedding size of 1,024
  • cluster 2: 40k words with embedding size of 256
  • cluster 3: 200k words with embedding size of 64

notice that the the embedding size for each cluster is decreased by a factor of k=4.

Since each cluster outputs a vector with different dimensions, each of them is also associated with a projection matrix that will project the output vectors into a common dimension, d, so that they can be concatenated and fed into a model for further processing e.g. feature extraction, classification, etc. These projection matrices are the green boxes labeled “Linear” in the diagram.

The above setup yields the following benefits:

  • Reduction in number of parameters to train which leads to faster training time
  • Better generalization on rare words

Here’s the performance comparison between their proposed (ADP and ADP-T) model and other common models:

Here’s how their model perform on predicting rare words:

Notice that ADP-T performs the best across all word frequency bins.

Note: The ‘T’ stands for tied which means the parameters in the adaptive input section of the model are shared with the adaptive softmax part of the model.


My takeway from this paper is that although deep learning can work well on imbalanced datasets, explicitly modelling this phenomena into your model architecture can help the model learn even better representations and improve performance.

Let me know what you think in the comments.


Zipf’s Law; Wikipedia. Last accessed on 10 March 2019.

Language Model; Wikipedia. Last accessed on 10 March 2019.

The WikiText Long Term Dependency Language Modeling Dataset; Merity. 2016.

Power Law; Wikipedia. Last accessed on 10 March 2019.

Adaptive Input Representations For Neural Language Modeling; Baevski and Auli. 2019.

Ad data could save your life

Originally published in the November edition of Dialogue Magazine.

When we woke up our computers we gave them superpowers. Now we have to decide how to use them, writes Pete Trainor.

The world is different today than it was yesterday and tomorrow it will be different again. We’ve unleashed a genie from a bottle that will change humanity in the most unpredictable of ways. Which is ironic, because the genie we released has a talent for being able to make almost perfect predictions 99% of the time.

We have given machines the ability to see, to understand, and to interact with us in sophisticated and intricate ways. And this new intelligence is not going to stop or slow down, either. In fact, as the quantities of data we produce continue to grow exponentially, so will our computers’ ability to process and analyze — and learn from — that data. According to the most recent reports, the total amount of data produced around the world was 4.4 zettabytes in 2013 — set to rise enormously to 44 zettabytes by 2020. To put that in perspective, one zettabyte is equivalent to 44 trillion gigabytes (about 22 trillion tiny USB sticks). Across the world, businesses collect our data for marketing, purchases and trend analysis. Banks collect our spending and portfolio data. Governments gather data from census information, incident reports, CCTV, medical records and more.With this expanding universe of data, the mind of the machine will only continue to evolve. There’s no escaping the nexus now.

Running alongside this new sea of information collection is a subset of Artificial Intelligence called ‘Machine Learning’, autonomously perusing and, yes, learning, from all that data. Machine learning algorithms don’t even have to be explicitly programmed — they can literally change and improve their own code, all by themselves.

The philosophical and ethical implications are huge on so many levels.

On the surface, many people believe businesses are only just starting to harness this new technological superpower to optimise themselves. In reality, however, many of them have been using algorithms to make things more efficient since the late 1960s.

In 1967 the “nearest neighbour” code was written to allow computers to begin recognizing very basic patterns. The nearest neighbour algorithm was originally used to map routes for traveling salesmen, ensuring they visited all viable locations along a route to optimise a short trip. It soon spread to many other industries.

Then, in 1981, Gerald Dejong introduced the world to Explanation-Based Learning (EBL). With EBL, computers could now analyze a data set and derive a pattern from it all on their own, even discarding what they thought was ‘unimportant’ data.

Machines were able to make their own decisions. A truly astonishing breakthrough and something we take for granted in many services and systems, like banking, still to this day.

The next massive leap forward came just a few years later, in the early 1990s, when Machine Learning shifted from a knowledge-driven approach to a data-driven approach, giving computers the ability to analyze large amounts of data and draw their own conclusions — in other words, to learn — from the results.

The age of the everyday supercomputer had truly begun.

The devil lies in the detail and it’s always the devil we would rather avoid than converse with. There are things lurking inside the data we generate, that many companies would rather avoid or not acknowledge — at least, not publically. We are not kept in the dark because they’re all malicious or evil corporations, but more often because of the huge ethical and legal concerns attached to what data and what processes lie the shadows.

Let’s say a social network you use every single day is sitting on top of the large set of data generated by tens-of-millions of people just like you.

The whole system has been designed right from the outset to get you hooked, extracting information such as your location, travel plans, likes and dislikes, status updates (both passive and active). From there, the company can tease out the sentiment of posts, your browsing behaviors, and many other fetishes, habits and quirks. Some of these companies also have permission (that you will grant them access to, in those lengthy terms and conditions forms) to scrape data from other seemingly unrelated apps and services on your phone, too.

One of the social networks you use everyday even has a patent to “discreetly take control of the camera on your phone or laptop to analyse your emotions while you browse”

Using all this information, a company can build highly sophisticated and extremely intricate, explicit models that predict your outcomes and reactions — including your emotional and even physical states.

Most of these models use your ‘actual’ data to predict/extrapolate the value of an unseen, not-yet-recorded point from all that data — in short, it can predict if you’re going to do something even before you might have decided to do it.

The machines are literally reading our minds using predictive and prescriptive analytics

A consequence of giving our data away without much thought or due diligence is that we have never really understood its value and power.

And, unfortunately for us, most of the companies ingesting our behavioural data only use their models to predict what advert might tempt us to click, or what wording for a headline might resonate because of some long forgotten and repressed memory.

All companies bear some responsibility to care for their users‘ data, but do they really care for the ‘humans‘ generating that data?

That’s the big question.

We’ve spent an awfully long time mapping the user journey or plotting the customer journey when, in reality, every human is on a journey we know nothing about.

Yes, the technical, legal and social barriers are significant. But what about commercial data’s potential to improve people’s health and mental wellbeing?

It’s started to hit home for me even harder over the last few years because I’ve started losing close friends to suicide.

The biggest killer of men under 45 in the UK, and one of the leading causes of death in the US.

It’s an epidemic.

Which is why I needed to do something.

“Don’t do things better, do better things” — Pete Trainor

Companies can keep using our data to pad out shiny adverts or they can use that same data and re-tune the algorithms and models to do more — to do better things.

The emerging discipline of computational psychiatry, uses the same powerful data analysis, machine learning, and artificial intelligence techniques as commercial entities — but instead of working out how best to keep you on a site, or sell you a product, computational psychiatrists use data to explore the underlying factors behind extreme and unusual conditions that make people vulnerable to self-harm and even suicide.

The SU project: a not-for-profit chatbot who can identify and support vulnerable individuals

The SU project was a piece of artificial intelligence that attempted to detect when people are vulnerable and, in response, actively intervene with appropriate support messages. It worked like an instant messaging platform — SU even knew to talk with people at the times of day you were most at risk of feeling low.

We had no idea the data SU would end up learning from was the exact same data being mined by other tech companies we interact with every single day.

We didn’t invent anything ground-breaking at all, we just gave our algorithm a different purpose.

Ai needs agency. And often, it’s not asked to do better things, just to do things better — quicker, cheaper, more efficient.

Companies, then, haven’t moved quite as far from 1967s’ ‘nearest neighbor’ than we might like to believe.

For many companies, the subject of suicide prevention is too contentious to provide a marketing benefit worth pursuing. They simply do not have the staff or psychotherapy expertise, internally, to handle this kind of content.

Where the boundaries also get blurred and the water murky is that to save a single life you would likely have to monitor us all.


The modern Panopticon.

The idea of me monitoring your data makes you feel uneasy because it feels like a violation of your privacy.

Advertising, however, not so much. We’re used to adverts. Being sold to has been normalized, being saved has not, which is a shame when so many companies benefit from keeping their customers alive.

Saving people is the job of counsellors, not corporates — or something like that. It is unlikely that data mining for good projects like SU would ever be granted universal dispensation since the nature and boundaries of what is ‘good’ remain elusively subjective.

But perhaps it is time for companies who already feed off our data to take up the baton? Practice a sense of enlightenment rather than entitlement?

If the 21st-century public’s willingness to give away access to their activities and privacy through unread T&Cs and cookies is so effective it can fuel multi-billion dollar empires, surely those companies should act on the opportunity to nurture us as well as sell to us? A technological quid pro quo?

Is it possible? Yes. Would it be acceptable? Less clear.

– Pete Trainor is a co-founder of US. A best-selling author, with a background in design and computers, his mission is not just to do things better, but do better things.

Week 4— Warmth of Image

Snowy Image

Title: Weather Condition Prediction from Image

Team Members: Berk GÜLAY, Samet KALKAN, Mert SÜRÜCÜOĞLU

E-mails Respectively: , ,

Welcome again to our blog post about 4th week of our project . We salute you from a new and promising week in our side.

This week, we could leastwise see about horizon of our project. Eventually we could form/arange and get into use all “Warmth of Image” dataset as we want(cropped/standart , test/validation/train , eliminated and balanced for each class, combined etc.). We took preliminary results from Convolutional Neural Network algorithms using “Keras”. Our image dataset’s cropped part gave approximately %79 accuracy with Convolution layers using Neural Nets.

Architecture for first trial of CNN: 2 convolution layer and a pooling layer, drop-out method usage, 1 fully connected NN for classification part, softmax layer

Used dataset info: 50*50 cropped images from each class (cloudy,sunny,rainy,snowy,foggy) and total 15.000 images approximately.

Preliminary Result of CNN : ~ % 79 accuracy for validation data

Another progress that we also noticed is hardness of snowy view classification. Actually we are trying to use Intensity Histograms(white pixel density) to separate snowy pictures/views but since cloud or sky frames also give similar white pixel dense areas, we could not separate sky segments and snowy areas from each other(check example images below) even if we try different segmentation algorithms(Otsu/Watershed etc.) or hand-crafted tools(Edge Detectors etc.). We are trying to figure out how to separate sky area and snowy parts from each other and determine white pixel density for snowy area segmentation only.

An example hard case image which we face with while trying to segment snowy area & sky
An example hard case image which we face with while trying to segment snowy area & sky

Hopefully we will figure out how to use intensity histograms/white pixel density feature for snowy image detection.

We constructed our feature-set to classify images according to their weather condition type as well. We are planning to use color histogram, brightness, contrast, haze, sharpness, intensity histogram(white pixel density) and sky region metrics as image descriptors for weather condition recognition problem. Moreover, we did not only hypothetically create our feature-set but also we researched related works/projects for our problem and found ways to calculate and measure these metrics combination efficiently.

Lastly we are using and trying different frameworks and tools for our project(like Keras, Libsvm, Scikit-learn, Scikit-image, opencv etc.) so as to determine efficient and beneficial methodologies and algorithms in our problem. Related works and other papers on similar recognition problems help so much while discovering new pathways.

Wait to inform you next weeks and you are always welcome to contact with as about any part of our project. All assistances will be highly appreciated. (“Warmth of Image” Team)