Alibaba Cloud Machine Learning Platform for AI: Image Classification by Caffe

Join us at the Alibaba Cloud ACtivate Online Conference on March 5–6 to challenge assumptions, exchange ideas, and explore what is possible through digital transformation.

By Garvin Li

The Image classification by Tensorflow section introduces how to use the TensorFlow framework of deep learning to classify CIFAR-10 images. This section introduces another deep learning framework: Caffe. With Caffe, you can complete image classification model training by editing configuration files.

Make sure that you have already read the Deep Learning section and activated deep learning in Alibaba Cloud Machine Learning Platform for AI (PAI).


This experiment uses a CIFAR-10 open-source dataset, containing 60,000 images with pixel dimensions 32 x 32. These images are classified into 10 categories: airplanes, automobiles, birds, cats, deer. dogs, frogs, horses, ships, and trucks. The following figure shows the dataset.

The dataset has already been stored in the public dataset in Alibaba Cloud Machine Learning Platform for AI in JPG format. Machine learning users can directly enter the following paths in the Data Source Path field of deep learning components:

  • Testing data: oss://
  • Training data: oss://

Enter the path, as shown in the following figure:

Format Conversion

The Caffe framework of deep learning currently only supports certain formats. Therefore, you must first use the format conversion component to convert the JPG images.

  • OSS Path Storing Images and Table Files: set this parameter to the path of the public dataset predefined in Alibaba Cloud Machine Learning Platform for AI.
  • Output OSS Path: user-defined OSS path.

After format conversion, the following files are generated in the output OSS path, including a piece of training data and a piece of testing data.

Record the corresponding paths for editing the Net file. The following is an example of the data paths:

  • Training data data_file_list.txt: bucket/cifar/train/data_file_list.txt
  • Training data: data_mean.binaryproto:bucket/cifar/train/data_mean.binaryproto
  • Testing data data_file_list.txt: bucket/cifar/test/data_file_list.txt
  • Testing data: data_mean.binaryproto:bucket/cifar/test/data_mean.binaryproto

Caffe Configuration Files

Enter the preceding paths in the Net file, as follows:

Edit the Solver file:

Run the Experiment

  1. Upload the Solver and Net files to OSS, drag and drop the Caffe component to the canvas, and connect the component to the data source.
  2. Set the parameters in the Caffe component, as shown in the following figure. Set the Solver OSS Path to the OSS path of the uploaded Solver file and then click Run.
  3. View the generated image classification model file in the model storage path on OSS. You can use the following models to classify images.

  1. To view the corresponding log, refer to Logview in Image classification by TensorFlow.


How to Use Big Data for Better Governance

India has created the biggest data repository in the world. But now it faces the challenge of using this information for better governance.

At the first workshop on “Big Data Analytics in Government”, organized by the National Institute for Smart Government (NISG), it emerged that India owns the largest complex of data, gathered via digitization of records for purposes like IDs, passports and payment of subsidies.

DexLab Analytics

DexLab Solutions has been founded by a team of core industry professionals as a provider of accelerated learning…

Big data is characterized by its sheer volume, variety and speed and the analytics involves the processing of data in the most cost-effective way in order to draw conclusions for useful application.

According to experts, “There are a number of areas where huge projects have been implemented, like Aadhaar, passports etc. All this has opened up a lot of opportunities to apply this data to improve the citizens’ customer experience, to improve government efficiency, especially in delivery of services and to boost business…to create capacity to serve both domestic and export markets. Big data analytics, which then merges into fields like deep learning, machine learning and artificial intelligence (AI), has tremendous possibilities.”

DexLab Analytics (@dexlabanalytics) * Instagram photos and videos

572 Followers, 251 Following, 832 Posts – See Instagram photos and videos from DexLab Analytics (@dexlabanalytics)

The the importance of big data can be gauged from the fact that 90 per cent of digital information worldwide has been created over the last two years, while processing power has increased by 40 per cent between 2010–16. At the same time the cost of storing data has gone down 500 per cent. To have better grip on data analysis, get enroll in a good data analyst course in Gurgaon.

DexLab Analytics (@Dexlabanalytics) | Twitter

The latest Tweets from DexLab Analytics (@Dexlabanalytics). At DexLab we create #BigData & #Analytics professionals…

Machine learning: A beginner’s guide to teaching artificial intelligence

2017 is the year of hype for artificial intelligence and machine learning. Countless articles this year have extolled AI’s virtues, explored the automation of menial tasks, and prophesied the end of work altogether. While we’ve all heard about AI’s potential after this year of endless press coverage, few have a basic understanding of how the technology works.

I’m not a computer scientist or AI expert. I’m still very much a student of AI, and I hope writing a brief primer aimed at non-technologists will help me (and maybe you?) understand how artificial intelligence learns a little better.

Machine learning is giving a computer the ability to learn without being explicitly programmed.

So, let’s take a step back to understand explicit programming. Explicit programming is how most of the software you’re familiar with works. Traditionally, software has defined goals, straightforward inputs, and clear outputs. While explicit programming can get very complicated, it is essentially a set of rules. For every action or input, the software follows a set of rules to generate an output.

Explicit programming has served us well. It’s what powers the personal computer and the internet. It runs nearly all the software we use daily. Explicit programming is what we think of when we think of coding and computer science.

However, explicit programming has limitations, and those limitations are largely two-fold. First, since explicit programming is all about inputs, rules, and outputs, it relies on a software developer to write rules that apply to every possible input. In situations where there are hundreds or thousands of variable inputs, it becomes impractical to explicitly program rules for all the possible variables.

Second, explicit programming is static and bad at prediction. In most cases, once a software developer has written a piece of code, that code won’t change. It doesn’t have the ability to update itself in response to new variables. This static nature makes explicit programming bad at prediction, since it lacks the ability to fine tune its predictions when it gains new data.

“Tuning predictions in response to new data” is a very nerdy way of saying that explicitly programmed software can’t learn. But since the 1960s, computer scientists have been working on changing that.

Helping a computer to learn requires two main things:

  1. A large dataset so the computer can make hundreds or thousands of mistakes while tuning its predictions
  2. Lots of processing power to assemble and examine the data and run hundreds or thousands of tests

We’re entering the golden age of machine learning because those two things are now readily available. Storage and processing power have increased exponentially over the decades in a phenomenon known as Moore’s Law.

With these two hardware challenges overcome, machine learning now relies on writing algorithms that tell a computer what to look for in a data set, without explicitly coding how to find the answer. The computer then uses trial and error, breeding the successful trials and adding random variation to try new things. Over the course of many tests and variations, the computer learns how to succeed at the task at hand.

It might be easiest to understand how the algorithm works visually. This video of a computer learning to play Super Mario is a great introduction:

Machine learning has tons of applications, some of them are visible while many of them are hidden.

  • Machine learning helps Amazon decide which products to recommend to you based on your past purchases
  • Facebook uses a branch of machine learning known as computer vision to suggest which friends you should tag in your photos
  • Machine learning helps Siri and Alexa with voice recognition through trial and error
  • Gmail relies on machine learning to identify and filter spam
  • Netflix uses machine learning to predict which shows you’ll like based on the ones you’ve watched in the past

Basically, machine learning is useful anywhere a computer needs to make a prediction or a judgment. It can apply to fraud detection, bioinformatics, advertising, and financial modeling.

Machine learning is only as good as the algorithm it uses and the data set it gets trained on. Because machine learning is so complex, its code is notoriously difficult to debug and write correctly. When a machine learning algorithm fails, it can be difficult to pinpoint exactly why it failed.

Another problem is machine learning inherits the biases that are present in your data set. If your data is imprecise, skewed, or incomplete, the algorithm will still try to make sense of it. Conclusions drawn from too small of a data set are often outright wrong. Insufficient data is one reason why machine learning projects fail to deliver.

More insidious, however, are machine learning projects that inherit racial, class, age, or other biases from the data collectors or algorithm creators. Machine learning is only as good as the data it is trained on, and creating ethical datasets is a subject for an entire blog post in itself.

Over the coming years, AI will certainly make our lives easier by automating menial tasks. It’s possible these tasks could get as complex as self-driving cars and automated checkout at retail stores. While the jobs lost to automation is an important concern, another key is making sure everyone has equal access to automation and is treated fairly in a new automated world. Unfortunately, we haven’t found a way for a machine learning algorithm to teach ethics or morality quite yet.

Creating a Big Data Cloud instance

So you’ve started your Oracle Cloud Free trial and want to start working with big data? Let me be your guide!

SSH Public Key

Getting the “Cloud Storage Container” values right is the tricky part, but the SSH keys are crucial! I just let Oracle create them for me.

If you click “Create a New Key” OpenSSH keys will be generated for you. Subsequently download the zip file.

Windows will unzip them for you, but I like to do it keep my command-line skills sharp 🙂

unzip -d sshkeybundle

Since the keys are OpenSSH you’ll have to convert them to .ppk format if your using putty. My next post will by on using my favorite ssh client (and so much more!) MobaXterm to access our instance

MobaXterm free Xserver and tabbed SSH client for Windows

Free X server for Windows with tabbed SSH terminal, telnet, RDP, VNC, Xdmcp, Mosh and X11-forwarding. Portable or…

6. Click “Next” and “Create”. If you included your email in the “Notification” box (at step 4) then you’ll get a notification email when the service has been created 🙂

If everything went well you should see something like this indicating that the service is being created!

And if you included your email in the “Create Service” pane (Step 4) a little email letting your know how awesome you are

If you have questions please comment, message me on twitter, or send me an email ( You can also visit the official Oracle docs at .

Twitter @genseb7

Geospatial Data Analytics

Comparison of SimplyAnalytics and Industry-Standard GIS Software Applications

SimplyAnalytics (formerly Geographic Research Inc.) is a top performing and easy to use geospatial analysis tool. (Image source: SimplyAnalytics)

With Big Data and Data Analytics becoming common thought in More Developed Countries, coming to the realisation that all data is somehow tied to a geographic location on Earth can reveal otherwise hidden information. This is where analysing data for geospatial patterns comes into play.

For those with little to no geographic background — have no fear — SimplyAnalytics is here. SimplyAnalytics is a web-based analytics tool. To date, it is the leader in equipping educational institutions, non-profit organizations, businesses and government agencies with datasets, data visualization, spatial analysis, and GIS software.

As with geotechnology and geomatics industry-standard GIS software applications, SimplyAnalytics is an excellent and rather unconvential tool which can be applied to a wide variety of decision-making processes. SimplyAnalytics is significantly simpler to use and quicker to collect data and generate visual output than standard GIS packages which take more expertise and a longer amount of time to compile the same (or perhaps fewer) results.

<iframe src=”" width=”480" height=”270" frameBorder=”0" class=”giphy-embed” allowFullScreen></iframe><p><a href=”">via GIPHY</a></p>

In a GIS software, the human operating the application must conduct the analysis on their own. Meanwhile, SimplyAnalytics is programmed to automatically offer insights. The amazing part is that SimplyAnalytics will output a report including the fine-grained details — not in hours or days-but in mere seconds!

Maps, charts and reports can be output by SimplyAnalytics in mere seconds to minutes! (Image source: SimplyAnalytics)

How Artificial Intelligence can Improve Health and Productivity?

Survival instinct is key to the prosperity of the human species. It is perhaps why we as humans have always been a bit hypochondriac deep down. Thus, when people are personally given the responsibility of monitoring their health, they are at a greater advantage.

However, humans have minimal computational capacity in terms of analyzing their physiological data, which are gathered through sensations such as fatigue and pain to name a few. Moreover, they cannot possibly compare this data with various other data sets. This is why, humans are soon going to lose their comparative advantage in the task of reporting their physiological conditions to AI systems.

While it is well established already by various experts that, Artificial Intelligence can help medical and healthcare professionals to carry out diagnosis much more efficiently than before, the role AI can play in personal fitness is something which is being brought to notice recently. The AI-driven medical systems industry is set to grow up to $6.6 billion according to an Accenture report.

Some of the spheres within which AI has been most effectively applied according to a Forbes magazine report are: AI assisted robotic surgery, image analysis, and other administrative and managerial tasks within the health sector. Apart from these, AI has effortlessly assumed the roles of virtual nursing assistants and has been quite successful in aiding clinical judgments.

Since these have shown positive results, all these technologies are to be used by the large scale health industry and will be administered and used by professionals. Meanwhile, the role of patients as passive recipients of treatment will remain the same. Moreover, while these significant changes keep happening in the medical industry, there is another parallel healthcare revolution brewing, and this is aimed at empowering the individual patients.

Many personal fitness gadgets are being developed for the market. When these gadgets are integrated with AI systems, they become capable of providing customized health care solutions for individuals. These will soon offer a far more intense and all-round healthcare option for individuals than any given physician could ever provide.

Wearables and AI: Are you Fitbit Enough?

Wearable fitness gadgets have risen in popularity in spite of the skepticism from doctors about their effectiveness of for instance measuring how many steps any individual has taken. Moreover, there is no denying the fact that devices, like the Apple Watch, Android Wear, and Fitbit among others are producing vast amounts of data related to the health and lifestyle of the users.

The wearables industry has witnessed its fair share of success stories. Kardia, for example, is a case in point. Kardia was successful in developing a cost-effective EKG wearable which collected vast amounts of EKG data from the users. The collected data is then processed through an algorithm which detects atrial fibrillation. The FDA has in fact cleared Kardia’s system as a valid measure for the detection of this cardiovascular condition.

Propeller Health is another successful AI system which was able to detect asthma attacks based on patient medication data and environmental conditions. There are several other examples of such successes.

Therefore, we see that not only is AI capable of producing positive results in terms of providing effective health care; it also is an inevitable option to process the large volume of health data that is generated by the wearable gadgets and other such platforms.

Apart from the wearables, several online platforms which are providing fitness related services are eventually going to benefit the users. Although they depend on human respondents in certain areas of health care, like providing a personal exercise routine or a diet plan fit for weight loss, these services can be of great help.

Conclusion: Health Care Cut to Size

We have already seen how Google Assistant, Siri, or Alexa have used our personal data to make our digital experience more fulfilling. They have technically used their AI systems to create customized song lists, reading lists, and even shopping lists for each user by training them with vast amounts of relevant data sets.

With the importance and need for AI solutions increasing by the day, MindSync is creating a holistic community of the best minds in AI technologies — data scientists, machine learning experts, etc. This community will be a single touch point for all kinds of industries to submit their requirements. These tasks will be taken up by the community as a challenge and the best talent around the world will work towards fulfilment of the business task. The best task can then be used by the business to transform its processes and create a clear competitive advantage.

Experts are of the opinion that this model can be replicated for the health sector as well. By using personal health data of the individuals, the AI systems will be able to create customized lifestyle plans, diets and exercise routines which do not affect the working hours or the personal routine of the users.

The preferences for food or types of exercises can also be taken into consideration while creating these plans. This data will reduce the bias in reporting health conditions and lifestyle choices, including intimate details like the sexual activities of the individuals.

Therefore, such a comprehensive AI system will increase productivity and well-being of individuals. While large scale AI systems used by the medical industry will improve the longevity of individuals, the personal health care sector will enhance its quality.

SOC Analyst Survey: Initial SIEM Observations

As part of the continued analysis on the SOC Analyst Survey, I thought I’d post a few teasers to help illustrate and communicate the findings I’m coming across in the process. This first one is a simple count of the SIEM’s that are deployed across the 54 organizations represented in the survey. It should be noted that a few organizations had multiple SIEM technologies deployed for various reasons. I’ll save the personal observations on that topic for a later post but what stands out is the number of “search” based SIEM deployments that dominate the top of the leaderboard: Splunk ES, Splunk, and ELK. The final report will have these broken out further by company size, etc.

SIEM Prevalence across 54 Enterprises.

In one fun pivot of the data, I wanted to look for ways to quantify health or maturity of these deployments using the information available. I excluded the 10 MSSP’s represented for a moment to focus on what corporations are doing with their SIEM and asked the simple question of the data “What would it look like if I cross-referenced SIEM technology and number of new use cases deployed over the last 30 days.”

Use Cases Deployed by SIEM over the last 30 days

I’m not saying that there is a definite correlation in this data set but it’s worth measuring this type of data more closely over time with greater levels of granularity. There are some obvious factors that lead to skew for us to consider when looking at data like this. 1) It’s a snapshot in time. 2) It’s representative of 44 Companies (I excluded the 10 MSSP’s for this one pivot). 3) Some companies reported multiple SIEM so some numbers may be artificially inflated since the survey questions didn’t specific by SIEM only across SIEM’s deployed. 4) We don’t know if the respondents are in initial build-out phase or operational maturity so this snapshot in time may skew results further. Still, I think it’s worth investigating and learning how to improve the process. In the future, I can see ways to tighten up the integrity of the data with better questions and ongoing inquiries.

I’ll encourage you to check out this site for updates in the near future. I have some more fun pivots to share with you all very soon. These are some of the more fun ones that are just about ready to be shared:

– Analyst time spent “validating” SIEM events versus time spent “hunting”

– Most Common Event Sources versus Most Desired Event Sources

Finally, I want to express sincere gratitude to those wonderful souls that participated in the survey. Thank you so much for your support!

Would you trade your digital privacy for free pizza?

Would you trade your digital privacy for free pizza?

How can facial data improve shopping experiences?

Can AI make us healthier?


Engage maximum warp speed in time series analysis with WarpScript

We, at Metrics Data Platform, are working everyday with Warp10 Platform, an open source Time Series database. You may not know it because it’s not as famous as Prometheus or InfluxDB but Warp10 is the most powerful and generic solution to store and analyze sensor data. It’s the core of Metrics, and many internal teams from OVH are using Metrics Data Platform to monitor their infrastructure. As a result, we are handling a pretty nice traffic 24/7/365, as you can see below:

Yes, that’s more than 4M datapoints/sec on our frontends.
And more than 5M commits/sec on HBase, our storage layer.

Not only Warp10 allows us to reach an unbelievable scalability but it also comes with his own language called WarpScript, to manipulate and perform heavy time series analysis. Before digging into the need of a new language, let’s talk a bit about the need of time series analysis.

A time serie, or sensor data, is simply a sequence of measurements over time. The definition is quite generic, because many things can be represented as a time serie:

  • the evolution of the stock exchange or a bank account
  • the number of calls on a webserver
  • the fuel consumption of a car
  • the time to insert a value into a database
  • the time a customer is taking to register on your website
  • the heart rate of a person measured through a smartwatch

From an historical point of view, time series appeared shortly after the creation of the Web, to help engineers monitor the networks. It quickly expands to also monitors servers. With the right monitoring system, you can have insights and KPIs about your service:

Analysis of long-term trend

  • How fast is my database growing?
  • At what speed my number of active user accounts grows?

The comparison over time

  • My queries run faster with the new version of my library? Is my site slower than last week?


  • Trigger alerts based on advanced queries

Displaying data through dashboards

  • Dashboards help answer basic questions on the service, and in particular the 4 indispensable metrics: latency, traffic, errors and service saturation

The possibility of designing retrospective

  • Our latency is doubling, what’s going on?

Storage, retrieval and analysis of time series cannot be done through standard relational databases. Generally, highly scalable databases are used to support volumetry. For example, the 300,000 Airbus A380 sensors on board can generate an average of 16 TB of data per flight. On a smaller scale, a single sensor that measures every second generates 31.5 million values per year. Handling time series at scale is difficult, because you’re running into advanced distributed systems issues, such as:

  • ingestion scalability, i.e. how to absorb all the datapoints 24⁄7
  • query scalability, i.e. how to query in a raisonnable amount of time
  • delete capability, i.e. how to handle deletes without stopping ingestion and query

Frustration with existing open source monitoring tools like Nagios and Ganglia is why the giants created their own tools — Google has Borgmon and Facebook has Gorilla, just to name two. They are closed sources but the idea of treating time-series data as a data source for generating alerts is now accessible to everyone, thanks to the former Googlers who decided to rewrite Borgmon outside Google.

Now the time series ecosystem is bigger than ever, here’s a short list of what you can find to handle time series data:

  • InfluxDB.
  • Kdb+
  • Graphite.
  • RRDTool.
  • Prometheus.
  • OpenTSDB.
  • Druid.
  • TimescaleDB.

Then there’s Warp10. The difference is quite simple, Warp10 is a platform whereas all the time series listed above are stores. This is game changing, for multiples reasons.

Security-first design

Security is mandatory for data access and sharing job’s results, but in most of the above databases, security access is not handled by default. With Warp10, security is handled with crypto tokens similar to Macaroons.

High level analysis capabilities

Using classical time series database, high level analysis must be done elsewhere, with R, Spark, Flink, Python, or whatever languages or frameworks that you want to use. Using Warp10, you can just submit your script and voilà!

Server-side calculation

Algorithms are resource heavy. Whatever they’re using CPU, ram, disk and network, you’ll hit limitations on your personal computer. Can you really aggregate and analyze one year of data from thousands of sensors on your laptop? Maybe, but what if you’re submitting the job from a mobile? To be scalable, analysis must be done server-side.

Warp10 folks created WarpScript, an extensible stack oriented programming language which offers more than 800 functions and several high level frameworks to ease and speed your data analysis. Simply create scripts containing your data analysis code and submit them to the platform, they will execute close to where the data resides and you will get the result of that analysis as a JSON object that you can integrate into your application.

Yes, you’ll be able to run that awesome query that is fetching millions of datapoints and only get the result. You need all the data, or just the timestamp of a weird datapoint? The result of the script is simply what’s left on the stack.

Dataflow language

WarpScript is really easy to code, because of the stack design. You’ll be pushing elements into the stack and consume them. Coding became logical. First you need to fetch your points, then applying some downsampling and then aggregate. These 3 steps are translated into 3 lines of WarpScript:

  • FETCH will push the needed Geo Time Series into the stack
  • BUCKETIZE will take the Geo Time Series from the stack, apply some downsampling, and push the result into the stack
  • REDUCE will take the Geo Time Series from the stack, aggregate them, and push them back into the stack

Debugguing as never be that easy, just use the keyword STOP to see the stack at any moment.

Rich programming capabilities

WarpScript is coming with more than 800 functions, ready to use. Things like Patterns and outliers detections, rolling average, FFT, IDWT are built-in.

Geo-Fencing capabilities

Both space (location) and time are considered first class citizens. Complex searches like “find all the sensors active during last Monday in the perimeter delimited by this geo-fencing polygon” can be done without involving expensive joins between separate time series for the same source.

Unified Language

WarpScript can be used in batch mode, or in real-time, because you need both of them in the real world.

Here’s an example of a simple but advanced query:

// Fetching all values
[ $token ‘temperature’ {} NOW 1 h ] FETCH
// Get max value for each minute
[ SWAP bucketizer.max 0 1 m 0 ] BUCKETIZE
// Round to nearest long
[ SWAP mapper.round 0 0 0 ] MAP
// reduce the data by keeping the max, grouping by 'buildingID'
[ SWAP [ 'buildingID' ] reducer.max ] REDUCE

Have you guessed the goal? The result will display the temperature from now to 1 hour of the hottest room per buildingID.

You’re still here? Good, let’s have a more complex example. Let’s say that I want to do some patterns recognition. Let’s take an example. Here’s a cosinus with an increasing amplitude:

I want to detect the green part of the time series, because I know that my service is crashing when I have that kind of load. With WarpScript, it’s only a 2 functions calls:

  • PATTERNS is generating a list of motifs.
  • PATTERNDETECTION is running the list of motifs on all the time series you have.

Here’s the code

// defining some variables
32 'windowSize' STORE
8 'patternLength' STORE
16 'quantizationScale' STORE
// Generate patterns
$ 0 GET
$windowSize $patternLength $quantizationScale PATTERNS
VALUES 'patterns' STORE
// Running the patterns through a list of GTS (Geo Time Series)
$list.of.gts $patterns
$windowSize $patternLength $quantizationScale PATTERNDETECTION

Here’s the result:

As you can see, PATTERNDETECTION is working even with the increasing amplitude! You can discover this example by yourself by using Quantum, the official web-based IDE for WarpScript. You need to switch X-axis scale to Timestamp in order to see the courbe.

Thanks for reading, here’s a nice list of additionnals informations about the time series subject and Warp10:

  • Metrics Data Platform, our product
  • Warp10 official documentation
  • Warp10 tour, similar to “The Go Tour”
  • Presentation of the Warp 10 Time Series Platform at the 42 US school in Fremont
  • Warp10 Google Groups

3 Things about Big Data Your Teachers Wouldn’t Tell You

Big Data is exactly what the name implies: large amounts of data. It comes from various sources, namely satellites, cameras, computer databases, microphones, credit records and a whole lot of other digital storage units and devices.Almost everything that is recorded digitally can fall under the umbrella of Big Data: bills, purchases, apps, surveys, and the list goes on and on and on…..

The thing many dislike about big data is that it gathers personal information too — so why is personal information gathered and tracked?

The reasons are many but as far as businesses are concerned, personal data is used to collect and analyze behavioral trends to better market to their customers and target audience and supply them with what they want, when they want it.

Big Data markets around the world are predicted to increase their revenue (software & services) by $61 billion in the next 7 years.

There are 3 main ways businesses and companies collect massive amounts of data from Internet users:

  • Opt-ins & Lead Forms
  • Smartphone Apps
  • Website Cookies

Big Data: Immoral, Illegal, or Good Business Practice?

While collecting such massive amounts of personal information may seem like a private policy issue, nowadays there are many systems set in place to give the user the right to refuse such tracking and data gathering.

Photo by ev on Unsplash

For instance, the General Data Protection Regulation, GDPR, was established by the European Commission and took effect on May 25, 2018.

The regulation ensures that all businesses who want to do business in the EU (whether they are EU based or not) — must give the user the right to “Agree” or “Disagree” on being tracked. In addition, EU citizens can request data collection apps, devices, and sites to relinquish and delete all of their personal information.

A host of other Big Data security solutions have risen up over the past couple of years due to the ripple effect of the GDPR and Facebook’s

Cambridge Analytica scandal (2018), which awoke the public at large to the fact that their personal information was not only being collected but shared among third-party collection agencies.

The following tools and software solutions are but a sampling of the recent efforts businesses are making to keep their customers and prospect’s information safe and secure and from falling into the wrong hands:

  • Cloud Data Protection (CDP)
  • Big Data Encryption
  • Data Privacy Management
  • Data/Content Subjects Right Management
  • Data Governance Management

Giving the right to decline data collection to the end user as well as modern cyber-security software and private data storage solutions have turned a potential privacy violation and ethical conundrum into a viable and moral process which helps businesses run smoother and serve their customers in the best possible way.

Why Businesses Need Big Data

Businesses do not just want to collect and store large amounts of information related to their business, they need it!

Aside from being able to analyze how customers interact with a brand, businesses also use Big Data to find weaknesses in their marketing, production, and customer service levels as well as to spot industry trends before they happen.

Even Henry Ford used information from his environment to expand upon industry trends, which helped him improve his already ground-breaking car production assembly line. He got the idea for the division of labor and interchangeable parts by watching disassembly lines operate at meat-packing plants.

The above example would not be considered Big Data by today’s standards but it was data nonetheless — information that was observed, stored, and analyzed to produce better business results.

Big Data, as it is defined today, can help your business in so many ways: The following list gives a brief sample of some of the areas in which Big Data can help improve a business’s overall operation:

  • Time Efficiency:Large amounts of information can be analyzed quickly using machine-learning in order to achieve quick business decisions in less time.
  • Cost Savings: Big Data tools allow businesses to reduce manpower, especially in the IT department, as they are mostly comprised of real-time and autonomous systems. Resources can be allocated at a much lower cost and large amounts of information can be stored and processed for much less.
  • Current Market Data/Product Development: Through analyzing Big Data, companies can better grasp recent product trends and either produce such products or modify existing products to move ahead of the competition.
  • Customer Behaviors: Large amounts of customer data can be used to figure out what customers want and when they want it. Therefore, a business can provide better service, support, and marketing to its target audience.
  • Online Reputation: When it comes to protecting a business’s online reputation, Big Data is unmatched. It allows companies to retrieve customer feedback on social media platforms, for instance, and improve any negative reactions through positive outreach.

How to Use Big Data Effectively

It is really not about how much information is collected but how that information is used that is the essence of producing better business results.

To really understand how to use large amounts of data to improve customer service, reduce costs, or speed up business processes, a business must first consider what metrics are important to analyze to achieve such results.

Photo by Stephen Dawson on Unsplash

Deciding on which metrics to focus on and how much of each one to focus is simply a matter of knowing which data set is most indicative of the progress of your company.

Of course, this will be different for each company but there are metrics that fall under the main general business categories which can help turn Big Data into “Wise Data”. These categories and metrics are as follows:


  • Employee Retention
  • Employee Productivity
  • Customer Support Tickets
  • Time to Install


  • Status Against Budget
  • Status Against Deadline
  • Team Productivity
  • Recruiting Productivity
  • Bug Resolution


  • Revenues vs. Budget
  • Sales Force Productivity
  • Renewal Rate
  • Cost to Acquire
  • Time to Acquire
  • Customer Lifetime Value
  • Market Share


  • Click Through Rate (CTR)
  • Conversion Rate
  • Cost Per Lead
  • Website Traffic
  • Return On Investment (ROI)
  • Qualified Prospects
  • Competitive Ranking


  • Burn Rate
  • Net Cash vs. Net Budget
  • Days Receivable
  • Days Payable
  • Months of Cash Remaining
  • Bank Covenant Ratio

When defining your Big Data strategy, make it a company-wide imperative and not just a “special” IT project. In other words, your entire business should use Big Data to either confirm company objectives or refute them.

5 steps which can help accomplish this goal are as follows:

  1. Identify what’s important to the business.
  2. Create goals around company imperatives (step 1).
  3. Prioritize and group goals in the matter of importance and focus on them one at a time from top to bottom.
  4. Decide which data sources are best for collecting the information your business needs to measure its chosen goals.
  5. Link the financial value of the company’s data sources to the company’s goals in order to increase the future probabilities of success for measuring those goals.

Always start with the most important goals first!

3 Things You Should Know About Using Big Data for Your Business

To make the process of figuring out how to profit from Big Data easier, the following summary of what has just been said and some extra pieces of advice have been broken down into three simple but effective actionable steps.

These 3 steps are as follows:

1. Collecting Data

Figuring out which data to collect and which metrics to analyze was just mentioned in great detail in the section above but it can not be emphasized enough.

For any business to reap the full benefits of Big Data it must understand what data it needs and how to use it to increase efficiency. Just collecting data to have a large database is of no use if that information is not relevant to the business whatsoever.

If a business spends a ton of resources to collect unnecessary data, the only result will be higher costs and less profitability.

2. Extracting Data

After figuring out what data to collect — which data really matters to the business — the next step is the extraction process. This may seem like a daunting task but it does not have to be if a structured set of guidelines are followed.

The following 9 steps can help any business organize and extract both structured & unstructured data sets for optimal output:

  1. Data Sources: Choose which data sources are absolutely relevant to the metrics you wish to analyze.
  2. Data Presentation: How will the results be used? Figure out a plan on how best to use the data to achieve the desired end result.
  3. Data Storage: The data results and analysis have to be stored in a proper storage unit like a cloud-based information store, for example, in order for proper utilization. There are various factors involved in choosing the proper data storage but the main point is to choose one that can fulfill the requirements of the chosen end result.
  4. Data Lake: Before storing data in a data warehouse, consider storing it in a data lake where all aspects of the data are left untouched until it is decided how the data will be used. After this decision, the data can be stripped and stored in segmented formats in a larger warehouse.
  5. Storage Preparation: Before the actual storage process, remove any unnecessary pieces of information like symbols & white spaces as this can often obscure relevant information and make it harder to locate and retrieve it later on.
  6. Data Retrieval: Make full use of Parts-of-Speech tagging (semantic analysis) to easily extract common name terms like “location”, “person”, or “organization”.
  7. Category Analysis:After analyzing the data, choose categories in which to segment related sources and extracted data points to structure the data for easier accessibility.
  8. Data Classification: Once the data has been stored and categorized, it should then be further segmented with the help of machine-learning software. This will assist in locating close similarities between relevant metrics like customer behaviors and product trends.
  9. Data Visualization:Once the relevant data has been extracted & analyzed, it should be presented in such a manner that concerned parties can easily retrieve it and gain insight from it. For example, data presented in graphical formats which can be displayed on hand-held devices.

3. Monetizing Data

Scheduling regular meetings to review collected and analyzed data will help in figuring out how to best use such information for monetization purposes. The best way to monetize relevant data is to make sure that it can be implemented into a working strategy or plan that assists in meeting business objectives.

Photo by Franki Chamaki on Unsplash

Even though data monetization is an ever changing and ever expanding field that is particular to each business’s goals and needs during its various business cycles, there are some common monetize uses of Big Data that many businesses seem to employ.

Here are the 8 most common ways businesses monetize their data:

  1. Revenue Leaks: Using Big Data analytics, businesses can identify invoice mistakes and errors, as well as help in gathering information for collection purposes.
  2. Customer Satisfaction: The ability to gather information about customer satisfaction through social media analysis and surveys, has given businesses the ability to create more brand loyalty by providing their customers what they want when they want it.
  3. Product Development: Big Data allows organizations to create more products and thus more revenue streams by spotting trends and gaps in the market before their competition does.
  4. Fraud Detection:Businesses that use multiple channels and avenues to deliver their products to their target audience can use Big Data to spot price fraud and piracy before it gets out of control.
  5. Customer Retention:Information gathered on customer churn can help organizations of all shapes and sizes figure out when customers leave, why they leave, and how to best persuade them to stay.
  6. Profitable Marketing:Customer behavior can be analyzed across multiple data points using Big Data tools and analysis and so create better targeting for marketing purposes. The result is a higher marketing ROI.
  7. New Business Models: Companies can use Big Data to discover untapped areas within their market and even new business ventures in order to keep them relevant and on the cutting-edge of today’s constantly changing business environment.
  8. Redefining Value: Big Data assists businesses in redefining what value means by gathering, segmenting, and analyzing customer information, product information, and market trends. This helps them provide their customers and prospects with consistent value-based products, services, and information that coincide with the times.

Can SMBs Have Access to Big Data Analytics Capabilities?

Small and medium businesses don’t have the means to have on-site databases with the capacity to store, process, and analyze large amounts of data, which is why big data servers that are hosted in the cloud are a much better option.

Standard servers used by small businesses also lack the right security measures for such large amounts of sensitive information. Big Data is not just data but concepts which need to be collected and processed to deliver the desired outcome. Standard hosting servers are not built to comprehend such concepts.

Big Data Cloud Storage & Hosting Benefits

Big Data platforms, comprising of cloud storage and dedicated hosting, are specifically designed to deal with unstructured data collected from various different sources that use a host of different formats. The ability to extract such varied pieces of information requires a database that can uncover relationships among the clutter — a job not fit for traditional on-site databases used by most small and midsize businesses today.

Some of the other major benefits of using online Big Data cloud storage & hosting include:

  • Faster Speed of Processing
  • Increased Security Measures
  • Built-In Analytical Infrastructure
  • Unlimited Storage Space
  • Disaster Recovery (automated data backups)
  • Decreased Costs (no need to build and store data centers on-site to deal with large amounts of unstructured data)
  • Faster Time to Value (management & analytical applications can be built on demand)

Final Words

The digital age has brought many changes to businesses around the world and Big Data has played one of the major roles in this regard. Big Data analytics has made it possible to accomplish feats that would have taken months, if not years, for a business to accomplish. That too, in real-time!

Any business wishing to move forward and thrive in this digital age must learn to use Big Data wisely and utilize storage components that allow ease of access, ease of processing, and ease of collection to maintain data integrity.

Big Data is not going anywhere, so as a business owner you should seriously consider developing a Big Data strategy for your business if you already haven’t in order to take advantage of all the wonders this technology has to offer.

Jen McKenzie is an independent business consultant from New York. She writes extensively on business, education and human resource topics. When Jennifer is not at her desk working, you can usually find her hiking or taking a road trip with her two dogs. You can reach Jennifer @jenmcknzie