Page 1
1 1
INTRODUCTION TO BIG DATA
Unit Structure
1.1 Big Data
1.1.1 In troduction to Big data Platform
1.1.2 Traits of big data
1.1.3 Chal lenges of conventional systems
1.1.4 Web data
1.1.5 Analytic processes and tools
1.1.6 Analysis vs Reporting
1.1.7 Modern data analytic tools
1.2 Statistical concepts
1.2.1 Sampling distributions
1.2.2 Re -sampling
1.2.3 Statistical Inference
1.2.4 Prediction error
1.3 Data Analysis
1.3.1 Regression modeling
1.4 Analysis of time Series
1.4.1 Linear systems analysis
1.4.2 Nonlin ear dynamics
1.4.3 Rule induction
1.5 Neural networks
1.5.1 Learning and Generalization
1.5.2 Competitive Learning
1.5.3 Principal Component Analysis and Neural Networks
1.6 Fuzzy Logic
1.6.1 Extracting Fuzzy Models from Data
1.6.2 F uzzy Decision Trees,
1.6.3 Stochastic Search Methods munotes.in
Page 2
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
2 1.1 BIG DATA
Big data is referred to as the collection of a huge data set that includes
structured, semi -structure or unstructured data which cannot be stored and
analyzed by traditional database manag ement systems. The primary source
of big data is various activities done by uses through the internet for
various purposes.
The use of the internet is an integral part of our lifestyle and due to that, it
is very common to use various digital platforms on the internet for day -to-
day work. Lots of people leave their footprint in the form of the data by
doing various activities on social media, online shopping websites, online
business transactions, online banking systems, online searching, online
education system and many others. Subsequently, it is observed that the
growth of data is exponential way. So very advanced technology has
emerged to manage a huge amount of data.
1.1.1 Introduction to Big data Platform:
The invention of hand -held digital devices has bee n considering as a prime
factor for the growth of internet users. In today's life, the internet is
accessed via computers, mobile phones, personal digital assistant devices,
gaming stations and digital TV. It is believed that the Internet is the most
fast growing technology.
Figure 1 : Internet usage in 2020
Big data cannot be analyzed by conventional technology or it cannot be
stored by the traditional database management system. The biggest
challenge to work with big data is the exponential growth o f data which
requires very advanced technology to store it in such a way that can be munotes.in
Page 3
Introduction to Big Data
3 utilized for analysis purposes. Various big data platforms enable storing,
managing, merging, developing, deploying, operating and analyzing big
data. The big data infrast ructure generally consists of very advanced data
storage systems, high computing servers and big data management
technology. A big data platform normally includes very advanced
infrastructure which combines the capability of several big data
applications. Whereas, the big data analytics software mainly focuses on
providing facilities to support analytics for extremely large data sets. In
other words, analytics helps to convert a huge amount of data into smart
data or high -quality information which provides deeper insights for the
decision -making process.
There are many big data tools are available in the market for Big data
analytics, few can be listed here. Apache Hadoop, Cassandra, data
wrapper, mongo DB, Apache storm, Tableau, R, CDH (Cloudera
Distributi on for Hadoop), Elastic search, Kaggle, Hive, Spark, OpenText,
Oracle Data Mining, BigML, CouchDB, Pentaho, Adverity, Xplenty,
Apache SAMOA, Lumify, HPCC, Adverity, Knime, Talend, rapid miner,
Microsoft Azure, Amazon Web service, Google bigquery, VMware,
Google big data, IBM big data, wavefront, Cloudera enterprise big data,
Oracle Big data analytics, DataTorrent, mapR converged data platform,
Splunk big data analytics, Big object, Opera solutions signal hub, SAP Big
data analytics, Next Pathway, 1010data, GE industrial internet, SGI big
data, Teradata big data analytics, Intel big data, HP big data, Dell Big data
analytics, Cisco big data, Pentahol big data, Opera solutions big data.
1.1.2 Traits of big data:
Billions of users are connected to the World Wide Web and spending a
significant amount of time via mobiles, computers and other devices.
Consecutively, there are collections of large -scale unstructured data and it
is also increasing with a constant growth rate every day. Hence, it emerges
into the necessity of an advanced technology that could support a wide
range of data storage, scalable processing and analysis of this data. In this
scenario, big data technologies evolved as a revolutionary solution to cope
up with all these solutions.
Big data defines wi th 5V’s characteristics. The first ‘V’ is a symbolization
of extra -large scale of the data volume. The second ‘V’ is a symbolization
of a variety of data that emphasis on heterogeneous data (structure,
unstructured and semi -structure). The third ‘V’ is a s ymbolization of
velocity of data that highlights on data -analytics. Figure 2 shows 5 ‘V’
characteristics of big data. munotes.in
Page 4
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
4
Figure 2 : 5 ‘V’ characteristics of big data
1 Volume
Big data has been defining with five V characteristics. The first V is
symbolizati on of volume. The big data has an extra -large scale data. The
volume of data can be measured with zettabytes.
Unit Abbreviation Size
byte B 8 bits
kilobyte KB 1,024 bytes or 10^3 bytes megabyte MB 1,024 KB or 10^6 bytes
gigabyte GB 1,024 MB or 10^9 by tes
terabyte TB 1,024 GB or 10^12 bytes
petabyte PB 1,024 TB or 10^15 bytes
exabyte EB 1,024 PB or 10^18 bytes
zettabyte ZB 1,024 EB or 10^21 bytes
yottabyte YB 1,024 ZB or 10^24 bytes
In real life, millions of users are connected with the World Wid e Web and
spending a significant amount of time for surfing and online activities with
the help of many hand -held devices, such as computers, laptops and
tablets.
Due to this, a constant growth rate was found, and mostly this data
increasing at petabyte sc ale. The volume of data was previously measuring
into Terabytes, later on, Petabytes and nowadays that is shifted to
Zettabytes. Have a look at some statics about today’s scenario. Only munotes.in
Page 5
Introduction to Big Data
5 Twitter has more than 500 tweets to send every day and hence it genera tes
more than 7 TB of data every day. Whereas, on Facebook, approximately
4 petabytes of the post or likes related data and hence it generates 10 TB
data every day. It is also observed that more than 65 billion messages are
sent by people via WhatsApp. Som e online enterprises are also believed to
generate terabytes of data every hour of every day. A new era has begun in
the field of transportation and 4 TB of data has been generated by each
connected car. On the Internet, 5 billion searches are made from al l around
the world and the Internet is a huge network of many web servers and web
services. This is just for having an idea that how much data we produce
and even how much data will be available in the future to dig into it?
Figure 3 : A day in big d ata
In other words, we can say that a massive amount of data has generated
every day, which has to store. An organization has to manage storage and
processing in real -time, which is the biggest challenge related to big data.
2 Variety
The second ‘V’ is a sym bol of a variety of data that means the big data can
be found as structured data, unstructured data or semi -structured data. In
an online environment, the source of data could be different and hence the
data can have a different format subsequently the dat a may have a variety
of format. Due to the presence of text, media, links and application
programs as a part of today’s websites, a variety of data is found as a part
of the big data. In the case of convention data technology, data could be
processed only if it is structured and represented in the two -dimensional
table. On the other hand, the major portion of today's websites and social
media data consist of text, images and videos, which are very complex and
difficult to process. Herein, text, links, maps , network hierarchies and
streaming data are unstructured and cannot be stored in that 2 -D format. munotes.in
Page 6
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
6 Some of the data are semi -structure, which is more structure in nature,
compare to unstructured data. It cannot process with the help of a
relational databas e. Normally, a tree -like structure such as XML is used to
store semi -structured data. It is also known as the key -value pair structure.
XML and JSON are an example of these kinds of data storage formats.
Structured Data Unstructured Data
Type of
data It is represented as
numbers, dates, strings
and alphanumeric values
etc. It may consist of text, images,
audio and videos etc.
Storage
structure It can be easily stored in
2-dimensal structure of
row and column. So, it
can be stored with Excel
or RDBMS. It can be stored with (NOSQL)
Non relational structure, Big
Table, graph data and many other
advanced data structures.
Source of
data It is part of major business
data stored with ERP
systems and other MIS
system. It is normally present at a part of
online systems and web data.
Growth
rate It is increasing at the
growth rate of 20 -30% It is increasing at the growth rate
of 80 -90%
Analysis
Process It is very easy to analysis
it with RDMS and with
use of simple algorithms. It is very complicated to
preproc ess, process and analysis
of it. It requires very complex
and advanced technology for
analysis purpose such as text
analysis algorithms, Artificial
Intelligent and Neural Network.
Due to all these challenges, many innovations have provided solutions to
process data in various formats such as Big Table, graph data and many
others. Even due to these data challenges, NoSQL technology emerged as
a solution and it has been adapted by many.
3 Velocity
The third ‘V’ is a symbol of the velocity of the data. A Vel ocity is related
to the speed at which data are arriving and it has to store. Similarly,
velocity is related to 'How much the data received in a specific period?'
and that could accommodate into the database. Sometimes, velocity is also
referred as the mea surement of the speed at which the data it is moving
towards the data repository. For the conventional system, it is impossible
to manage the constant flow of data that comes from various data streams
connected with RFID sensors. More than that, for the re al-time system, it
is essential to analyze this data in real life as the life of the data is short. munotes.in
Page 7
Introduction to Big Data
7 For real -life applications, batch processing is not a good option
specifically for data streams. The real -time computing system, which
accepts data from man y data streams and computing systems has to
execute the query and identifies current trends based on the recent and up -
to-date data in real -time. The Google map traffic analysis system is this
kind of real -time system which processes a massive amount of cu rrent
traffic -related data and provides valuable information in real -time.
4 Varacity
The next ‘V’ stands for ‘Varacity’ or ‘Validity’ of the data. The veracity
refers to the trustworthiness and quality of the data used for analysis.
Nowadays, the data is av ailable in huge amounts but the quality of data is
a big question. Only high -quality data yields meaningful information,
which seems a difficult task in an online environment. The source of data
and its authenticity must be considered at the time of data p reprocessing.
The handling of noise, inaccurate data and missing data must be done to
increase the quality of the data. The process of validating data is a big
challenge due to the consideration of context analysis for text data.
5 Variability
The next ‘V’ stands for ‘Variability’ or uncertainty of the data. The
variability of the data suggests too many changes in the data. Due to
changing nature of the data, the data processing methods and the models
has to also change according to the data. The constant ch anges and
innovation in the technology lead to the addition of new things into the
Internet, and hence new kinds of data formats and processing methods
involve automatically. The general methodology for various kinds of
objects cannot be applicable. Subseq uently, new algorithms and
processing approaches have to introduce to manage constantly variable
data. The conventional technology only focused on the analysis of
historical data collected over a period of time from the same enterprise
system. This system design is for specific kinds of processing
requirements concerning that data only. Internet and advanced IoT
systems are capable to connect many different systems with different
components. This emerges as needs flexible algorithms that can work well
with a wide range of data variety.
1.1.3 Challenges of conventional systems:
The conventional systems are mainly made to manage enterprise level data
but, it is normally do not focus on gathering data from out of the
organization. Due to this, the data of convention al system has predefined
and fixed structure, where as big data system has mix of various kinds of
data. More than this, volume of data for conventional system is limited to
gigabytes to terabytes only whereas big data system has to store and
manage zetaby tes of data with cloud and other advanced data storage
system. The conventional system can analysed data with algorithms which
are suitable to process structured data only. The analysis of structured data
may be done by various functions such as statistica l functions and date
functions. In the market now -a-days the most commonly used statistics munotes.in
Page 8
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
8 softwares for that is SPSS software. Statistical methods are most suitable
for quantitative data. In statistic, a wide range of aggregation functions are
available w hich can be applicable o groups. In contrast to that, statistical
methods cannot be applicable to heterogeneous data. Hence, a wide
variety of algorithms are needed to process structured, semi -structured and
unstructured data. The analysis of unstructured data or text data is very
complicated in nature, compare to structured data. For example, search
engines has to perform text analysis on web data, it may required key
word extraction, semantic analysis and similarity matching etc.
Another limitation of th e conventional data management system is related
to the storage capacity of data. In the case of a conventional data
management system data is generated at the rate of per hour or per day.
The business data can be stored at the centralized level and shared with all
remote devices. The data has a fixed schema and it is not possible to
change the structure at run time. The data manipulation functions are
predefined and various data operations are performed on regular basis.
Subsequently, the analysis process is also implied according to the data. In
contrast to that, big data has flexible schema and heterogeneous data.
More than that, big data is generated at the speed of exponential rate. Due
to that, data has to store with a flat -file structure or in such a way that can
be shared over a wide network. The latest technological revolution has
made it possible with cloud storage and clustering storage systems.
Subsequently, the processing method has to adapt the relevant technology
for future analysis. In short, big data analysis systems should be flexible,
scalable and more tolerant to failure to manage the need of the time. That
should also allow distributed and allowed parallel processing to speed up
the analysis task.
1.1.4 Web data:
The traditional system focus on a data management system that processes
mostly transaction data such as enterprise resource planning system (ERP)
and customer relationship management (CRP) system. The major source
of this kind of system is transaction data produce due to various busines s
transactions which have to be processed via predefine business methods.
On the other side, the Web of data is today's reality and exist due to the
relationship among the data on the internet. The web data consist of
Website data, Domain name data, News d ata, Web activity data, Web
search data, IP address data, Click Stream data, Sentiment web data, Web
traffic data and Semantic web data. The entire collection of interrelated
data set on the web is also sometimes referred to as linked data or
Semantic Web. An example of a Linked Dataset is DBPedia, which
includes Wikipedia data. A significant feature of DBPedia is it makes it
possible to get the content of Wikipedia in RDF format.
Web analytics is a process of measuring web traffic, web search and web
uses. Many web search engines perform web analysis and help internet
users to search from a huge collection of web pages present on the
Internet. The analysis of web data is possible with the use of HTML,
XML, RDF, OWL, SPARQL, etc. munotes.in
Page 9
Introduction to Big Data
9 In addition to that, the volume of Web data is constantly increasing. Along
with that, a variety of data sources continuously generating various kinds
of data and makes web data more complicated and unstructured. The data
on the Internet arises due to social media, social network ing links, social
media posts, image data, video data, click stream data and many other
activities. Another source of data is various surveys, online surveys,
experiments and observations of the people. Sometimes, market survey
data, industry reports, cons umer analysis reports, various kinds of business
reports and comparative analysis reported also loads tons of different types
of data on the internet. In this era, due to the presence of GPS and GIS,
lots of location related data is also generated by mobil e devices and other
geospatial systems. Many security systems, produces images and videos in
massive amount with the use of surveillance and other security devices.
With the help of many remote sensors, RFID devices, IoT systems and
many other real -time tr acking systems load a massive amount of data.
Satellite images and weather -related data are also an integral part of
Internet data.
1.1.5 Analytic processes and tools:
Data analysis is a process that transforms raw data into very useful
information. Data analysis is very useful for generating various statistics
related to data, meaningful insights and valuable explanations to manage
data-driven business decisions. There are many software and applications
which perform various data analysis tasks. It is cru cial to choose an
appropriate tool to execute, from a wide range of data analytics tools. The
selection process for data analytics tools may consider many parameters
such as price, robustness, supported data models, learning curve,
scalability, expandabili ty, visualization facility and many others.
Data analysis generally follows well -defined steps. It is very important to
understand the importance of process along with know -how of data to
yield meaningful insights and valuable patterns. Normally, to carry out the
analytics process following steps are required to conduct : (1) data
collection (2) data cleaning and preprocessing (3) data analysis (4)
visualising the output (5) understanding the results.
Data collection: The first step of the data analysis pr ocess is to understand
the source of the data, the format of the data and the collection procedure.
Based on all this, the data collection procedure has to be defined.
Nowadays various data collection tools are also used to capture data in
real-time, such as barcode readers, cameras, voice detecting machines,
sensors and automatic weighing machines.
Data cleaning and preprocessing: It is very essential to conduct a data
cleaning process to convert raw data into high -quality data. The data
cleaning process m ay include the process for removal of duplicate data,
removal of outliers and removal of errors. Sometimes, it is also essential to
identify and fill the gap between data that are collected from different
sources to integrate them into a single database. T he data cleaning process
may carry out manually or by using automated data cleaning tools. Along munotes.in
Page 10
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
10 with the data cleaning process, it is also very essential to conduct an
exploratory analysis of the data. This step helps to understand the
characteristic of d ata and the relationship among them. Sometimes the
existing co -relationship of the data is very essential to find out to establish
a hypothesis.
Data analysis: A data analytic process mainly depends on the goal of the
process and the availability of the d ata. There are lots of statistical
techniques used for analysis, a few are listed here univariate or bivariate
analysis, regression analysis, time series analysis, descriptive analysis and
predictive analysis.
The descriptive analysis identifies the underl ying relationship among the
data. This kind of analysis may help to find, a summary of the data, to
describe the data, and to determine the next processing step to be carried
out. The predictive analysis helps to find future values for future, based on
the historical data. This kind of analysis may help to predict market sales
based on the previous year's sales data.
Visualiting the output: Data visualization is equally important as the data
analytics process. The output of the analysis process must be clea rly well
presented and understandable. Sometimes data visualization tools are used
to increase the readability of the data. More specifically, these tools are
used when the volume of data is very large. Google charts, Infogram and
Tableau are well known ex amples of data visualization tools.
Understanding the results: Understanding of final output is a very
crucial step. For instance, the output may be misleading or erroneous due
to several reasons. In this situation, it is very essential to identify the
reason behind it, and to determine correct approach.
1.1.6 Analysis Vs Reporting:
In this digital era, the wealth of information brings into existence due to
modern analytics technology. Analysis and reporting both are valuable for
the same. The goal of the a nalysis process is to inspect the data and
transform it into useful information. The goal of the reporting process is
transforming the output of the analytic process in a presentable format.
The main purpose of conducting the analysis process is examining,
interpreting, comparing and predicting the data. Whereas reporting process
is mainly focusing on highlighting organizing, summarizing and
formatting processes. Sometimes, visualization of output may enhance
with the use of chats, maps, graphs and linking of data.
1.1.7 Modern data analytic tools:
Big data analytics uses the large quantities of data that generates and
gathers from various sources and converts into meaningful information.
There are many big data tools, and having the most in -demand by data
scientist. Some vital tools of big data are the following:
munotes.in
Page 11
Introduction to Big Data
11 No. Tools Benefits
1 R R programming language is the most common choice
of many data scientists today. R is free and available
under an open -source license. R available for different
types of har dware and software e.g. Windows, Unix
systems and the Mac. The most attractive feature of
‘R’ is the extendibility and integration of a rich library
of packages.
2 Python Python is a very powerful yet, open source language
and an easy -to-learn language. I t offers statistical and
mathematical functions. Few famous libraries are
NumPy, SciPy, etc. It is a high -level language with
high readability and object -oriented programming
functionality.
3 PIG and
HIVE
Hadoop is distributed File System that allows the
storage of data in a distributed manner. The ecosystem
is consists of many tools. Hadoop MapReduce
facilitates the processing of large volumes of data in a
parallel and distributed manner. HIVE and PIG are
also an integral part of the Hadoop ecosystem. T hey
facilitate processing and analysis. More specifically,
HIVE is a data warehouse with HiveQL, which is the
query language for large datasets stored in HDFS. PIG
runs on Hadoop cluster and processes and analyzes
large datasets using a scripting language.
4 Tableau
Tableau is a very easy -to-learn data visualization tool
that converts numeric and textual data into beautiful
visuals. It is user friendly, mobile friendly, simple yet,
fast. Anyone without knowledge of coding can also
use Tableau.
5 Jupyter
Notebook Jupyter Notebook is a free, open -source and online
data analytics tool. It supports 40+ programming
languages so, it is known as a multi -language
computing environment. It allows the use of python’s
wide variety of packages and visualization tool s.
6 Google
Data
Studio
Google data studio is a free data analytics tool that can
automatically integrate with other Google applications
such as Google Analytics, Google Ads, Google Sheets
and Google BigQuery.
munotes.in
Page 12
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
12 1.2. STATISTICAL CONCEPTS:
1.2.1 Sam pling distributions
We consider sample as an analytic subset of a larger population in
statistics. Samples allow researchers to conduct their studies with more
manageable data and in a timely manner. Random samples do not have
much bias if they are large enough, but achieving such a sample may be
expensive and time consuming. In simple random sampling, every entity
in the population is identical.
What is a Sampling Distribution?
A sampling distribution is a probability distribution of a statistic obtained
from a larger number of samples. It is the distribution of frequencies of a
range of different outcomes that could possibly occur for a statistic of a
population .
A population may refer to an entire group of people, objects, events,
hospital visits, or measurements. A population can thus be said to be an
aggregate observation of subjects grouped together by a c ommon feature.
A sampling distribution is a statistic that is arrived out through
repeated sampling from a larger population.
It describes a range of possible outcomes that of a statistic, such as the
mean or mode of some variable, as it truly exists a po pulation.
The majority of data analyzed by researchers are actually drawn from
samples and not populations.
Understanding Sampling Distribution
Huge amount of data drawn and used by academicians, statisticians,
researchers, marketers, analysts, etc. are actually samples, not
populations. Consider this example, a medical researcher that wanted to
compare the average weight of all babies born in North America from
1995 to 2005 to those born in South America within the same time period
cannot within a reasonable amount of time draw the data for the entire
population of over a million childbirths that occurred over the ten -year
time frame. He will instead only use the weight of, say, 100 babies, in each
continent to make a conclusion. The weight of 200 babies used is the
sample and the average weight calculated is the sample mean.
Few Definitions
A sample is a subset of the population.
A population is a collection of all the elements of interest.
The sampled population is the population from which the sample is
drawn.
An element is the entity on which data are collected. munotes.in
Page 13
Introduction to Big Data
13 A frame is a list of the elements that the sample will be selected from.
1.2.2 Re -sampling
Once we have a data sample, it can be used to estimate the population
parameter. The problem is that we only have a single estimate of the
population parameter. One way to address this is by estimating the
population parameter multiple times from our data sample. This is called
re-sampling.
Statistical re -sampling methods are procedures that describe how to
economically use available data to estimate a population parameter. The
result can be both a more accurate estimate of the parameter (such as
taking the mean o f the estimates) and a quantification of the uncertainty of
the estimate (such as adding a confidence interval).
Two commonly used re -sampling methods that you may encounter are k -
fold cross - validation and the bootstrap.
Bootstrap . Samples are drawn from the dataset with replacement
(allowing the same sample to appear more than once in the sample), where
those instances not drawn into the data sample may be used for the test set.
k-fold Cross Validation . A dataset is partitioned into k groups, where
each group is given the opportunity of being used as a held out test set
leaving the remaining groups as the training set.
The k -fold cross -validation method specifically lends itself to use in the
evaluation of predictive models that are repeatedly trained on one subset
of the data and evaluated on a second held -out subset of the data.
Generally, re -sampling techniques for estimating model performance
operate similarly. Re -sampling methods are very easy to use, requiring
little mathematical knowledge. They are methods that are easy to
understand and implement compared to specialized statistical methods that
may require deep technical skill in order to select and interpret.
The re -sampling methods are easy to learn and easy to apply. They require
no mathematics b eyond introductory high -school algebra, etc are
applicable in an exceptionally broad range of subject areas.
A downside of the methods is that they can be computationally very
expensive, requiring tens, hundreds, or even thousands of re -samples in
order to develop a robust estimate of the population parameter.
The key idea is to resample form the original data either directly or via a
fitted model to create replicate datasets, from which the variability of the
interest can be assessed without long -winded an d error -prone analytical
calculation. Because this approach involves repeating the original data
analysis procedure with many replicate sets of data, these are sometimes
called computer intensive methods. munotes.in
Page 14
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
14 Each new subsample from the original data sample is used to estimate the
population parameter. The sample of estimated population parameters can
then be considered with statistical tools in order to quantify the expected
value and variance, providing measures of the uncertainty of the estimate.
Statistical sampling methods can be used in the selection of a subsample
from the original sample.
A key difference is that process must be repeated multiple times. The
problem with this is that there will be some relationship between the
samples as observations that will be shared across multiple subsamples.
This means that the subsamples and the estimated population parameters
are not strictly identical and independently distributed. This has
implications for statistical tests performed on the sample of estimated
population parameters downstream, i.e. paired statistical tests may be
required.
Subset of samples can be used to fit a model and the remaining samples
are used to estimate the efficacy of the model. This process is repeated
multiple times and the results ar e aggregated and summarized. The
difference in techniques usually depends on the method in which
subsamples are chosen.
1.2.3 Statistical Inference
Statistical inference makes propositions about a population, using data
drawn from the population with some form of sampling . Given a
hypothesis about a population, for which we wish to draw inferences,
statistical inference consists of selecting a statistical model of the process
that generates the data and deducing propositions from the model.
"The majority of the problems in statistical inference can be considered to
be problems related to statistical modelling". Sir David Cox has said,
"How [the] translation from subject -matter problem to statistical model is
done is often the most critical part of an analysis".
The conclusion of a statistical inference is a statistical proposition . Some
common forms of statistical proposition are the following:
a point estimate , i.e. a pa rticular value that best approximates some
parameter of interest;
an interval estimate , e.g. a confidence interval (or set estimate), i.e. an
interval constructed using a dataset drawn from a population so that,
under repeated sampling of such datasets, such intervals would contain
the true parameter value with the probability at the stated confidence
level ;
a credible interval , i.e. a set of values containing, for example, 95% of
posterior belief;
rejection of a hypothesi s;
Clustering or classification of data points into groups. munotes.in
Page 15
Introduction to Big Data
15 Models an d assumptions
Any statistical inference requires some assumptions. A statistical model
is a set of assumptions concerning the generation of the observed data and
similar data. Descriptions of statistical models usually emphasize the role
of population quan tities of interest, about which we wish to draw
inference. Descriptive statistics are typically used as a preliminary step
before more formal inferences are drawn.
Paradigms for inference
Different schools of statistical inference have become established. These
schools or "paradigms" are not mutually exclusive, and methods that work
well under one paradigm often have attractive interpretations under other
paradigms.
There are four paradigms:
(i) Classical statistics or error statistics,
(ii) Bayesian sta tistics,
(iii) Likelihood based statistics and
(iv) Akaikean Information Criterion based statistics.
The practice of statistics falls broadly into two categories:
(1) Descriptive or
(2) Inferential.
When we are just describing or exploring the observe d sample data, we
are doing descriptive statistics. However, we are often also interested in
understanding something that is unobserved in the wider population,
this could be the average blood pressure in a population of pregnant
women for example, or the true effect of a drug on pregnancy rate, or
whether a new treatment perform better or worse than the standard
treatment. In these situations we have to recognise that almost always
we observe only one sample or do one experiment. If we took another
sample or did another experiment, then the result would almost certainly
vary. This means that there is uncertainty in our result, if we took another
sample or did another experiment and based our conclusion solely on the
observed sample data, we may even end up drawing a different
conclusion!
The purpose of statistical inference is to estimate this sample to
sample variation or uncertainty. Understanding how much our results may
differ if we did the study again, or how uncertain our findings are, allows
us to take this uncertainty into account when drawing conclusions. It
allows us to provide a plausible range of values for the true value of
something in the population, such as the mean, or size of an effect and it
allows us to make statements about whether our study provides evidence
to reject a hypothesis. munotes.in
Page 16
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
16 Estimating uncertainty:
Almost of all of the statistical methods you will come across are based
on sampling distribution. This is a completely abstract concept. It is the
theoretical distribution of a sampl e statistic such as the sample mean over
infinite independent random samples. We typically only do one
experiment or one study and certainly don't replicate a study so many
times that we could empirically observe the sampling distribution. It is
thus a the oretical concept. However we can estimate what the sampling
distribution looks like for our sample statistic or point estimate of interest
based on only one sample or one experiment or one study. The spread of
the sampling distribution is captured by its s tandard deviation, just like the
spread of a sample distribution is captured by the standard deviation.
Do not get confused between the sample distribution and sampling
distribution, one is the distribution of the individual observations that
we observe o r measure, and the other is the theoretical distribution of
the sample statistic that we don't observe .
We should not get confused between the standard deviation of the sample
distribution and the standard deviation of the sampling distribution, we
call t he standard deviation of the sampling distribution the standard error.
This is useful because the standard deviation of the sampling distribution
captures the error due to sampling, it is thus a measure of the precision of
the point estimates or put anothe r way, a measure of the uncertainty of our
estimate. Since we often want to draw conclusions about something in a
population based on only one study, understanding how our sample
statistics may vary from sample to sample, as captured by the standard
error, is also really useful. The standard error allows us to try to answer
questions such as: what is a plausible range of values for the mean in this
population given the mean that I have observed in this particular sample?
The standard error is thus integral to all statistical inference, it is used for
all of the hypothesis tests and confidence intervals that you are likely to
ever come across.
1.2.4 Prediction error
A prediction error is the failure of some expected event to occur. When
predictions fail, hum ans can use meta -cognitive functions, examining prior
predictions and failures and deciding. For example, whether there are
correlations and trends such as consistently being unable to fore see
outcomes accurately in particular situations. Applying that type of
knowledge can inform decisions and improve the quality of future
predictions.
Predictive analytics software processes new and historical data to forecast
activity, behavior and trends. The programs apply statistical analysis
techniques, analytical queries and machine learning algorithms to data sets
to create predictive models that quantify the likelihood of a particular
event happening. munotes.in
Page 17
Introduction to Big Data
17 Errors are an inescapable element of predictive analytics that should also
be quantified and presented along with any model, often in the form of a
confidence interval that indicates how accurate its predictions are expected
to be. Analysis of prediction errors from similar or previous models can
help determine confidence intervals.
In artificial intelligence ( AI), the analysis of prediction errors can help
guide machine learning ( ML), similarly to the way it does for human
learning. In reinforcement learning , for example, an agent might use the
goal of minimizing error feedback as a way to improve. Prediction errors,
in that case, might be assigned a negative value and predicted outcomes a
positive value, in which case the AI would be programmed to attempt to
maximize its score. That approach to ML, sometimes known as error -
driven learning, seeks to stimulate learning by approximating the human
drive for mastery.
1.3. DATA ANALYSIS:
Regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable and one or more independent
variables . The most common form of regression analysis is linear
regression , in which one finds the line that most closely fits the data
according to a specific mathematical criterion.
For example, the method of ordinary least squares computes the unique
line that minimizes the sum of squar ed differences between the true data
and that line. For specific mathematical reasons, this allows the researcher
to estimate the conditional expectation of th e dependent variable when the
independent variables take on a given set of values. Less common forms
of regression use slightly different procedures to estimate alternative
location parameters or estimate the conditional expectation across a
broader collection of non -linear models.
Regression analysis is primarily used for two conceptually distinct
purposes.
First, regression analysis is widely used for prediction and forecasting ,
where its use has substantial overlap with the field of machine learning .
Second, in some situations regression analysis can be used to infer causal
relationships between the independent and dependent variables.
Importantly, regressions by themselves only reveal relationships between
a dependent variable and a collection of independent variables in a fixed
dataset. To use regressions for prediction or to infer causal relationships,
respectively, a rese archer must carefully justify why existing relationships
have predictive power for a new context or why a relationship between
two variables has a causal interpretation.
1.3.1 Regression modeling
Regression is a form of machine learning where we try to pr edict a
continuous value based on some variables. It is a form of supervised
learning where a model is taught using some features from existing data. munotes.in
Page 18
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
18 From the existing data the regression model then builds its knowledge
base. Based on this knowledge base t he model can later make predictions
for outcomes on new data.
Continuous values are numerical or quantitative values that have to be
predicted and are not from an existing set of labels or categories. There are
lots of examples of regression where it is he avily used on a daily basis and
in many cases it has a direct business impact.
Types of Regressions Models:
Linear Regression
Logistic Regression
Polynomial Regression
Stepwise Regression
Ridge Regression
Lasso Regression
1.4. ANALYSIS OF TIME SERIES:
1.4.1 Linear systems analysis
A CEO of car manufacturing company is interested in knowing what will
be approximate sale of cars for next 2 years. Airline Company is eager to
know how many passengers are likely to travel through their flights in
next 2 month s. Manufacturer of perishable sweet items would want to
know how much demand will be there for next 2 weeks. Head of Supply
Chain Company wants to know how much will be petrol and diesel prices
for next 2 days. A CFO of an IT company is interested in knowi ng stock
prices for next 2 hours.
Everybody sitting at higher positions are taking decisions is of utmost
importance. Only resource they have with them is historical data. Time
series analysis is a technique with which one can forecast for the future,
based on historical data. In all such scenarios, one can use historical data
and apply time series analysis on the data to create a model which can aid
in getting some idea about future. It is important to note that the historical
data has to be time -dependent (collected with respect to time function).
Univariate time series is one where data is collected with respect to only
one variable, with respect to periodic time instance, over s period where as
multivariate time series in one in which data for multiple v ariables is
collected for a certain time period. Recording temperature values every
hour for a week is an example of univariate time series. Whereas,
recording temperature, pressure and humidity every hour for a week is an
example of multivariate time seri es.
Data collected for the time series can be linear or non -linear. Linear data,
when plotted in the form of a graph, will be sequential in nature. Any data
point with be connected to only two other datapoints, previous and the munotes.in
Page 19
Introduction to Big Data
19 next. On the other hand, no n-linear data when plotted in the form of graph
will not result into a straight line.
A. Components of time series data
Any time series may have some inherent properties / components – Trend,
Seasonality, Cyclicity and Irregularity.
(1) Trend is an importan t component of any time series which is a
result of overall long term effect of environmental factors. Trend may
show inclining or declining effect over a period.
Figure 4.1: Trend component of time series
As seen in graph A, in figure 4.1, there has bee n overall increase in the
sale of air conditioners and overall decrease in sale of kerosene for
cooking purpose.
(2) Seasonality is the short term movement in data due to seasonal
factors. E.g. there can be notable increase of sale of warm clothes during
winter season or even sudden increase in the sale of washing machines
during rainy season can be attributed to seasonal fluctuation.
Figure 4.2: Seasonality component of time series munotes.in
Page 20
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
20 (3) Cyclicity is a pattern observed when the data is collected for a very
long d uration, say 40 -50 years. This pattern repeats over a period, but the
gap between two time instances may not be fixed. E.g. recession occurring
time and again, but it is difficult to predict the next occurrence.
Figure 4.3: Cyclicity component of time se ries
(4) Irregularities / random component are the sudden changes in data
which are unlikely to be repeated. Such a sudden change in data cannot be
predicted by other components like trend / seasonality or cyclicity. These
variations are mostly accidental in n ature and may result in to change in
trends / seasonality and cyclicity in the forthcoming period. Natural
calamities can be an example which may cause irregularities in data. E.g.
Covid pandemic has resulted steep increase in the sale of electronic
gadget s such as tablets, laptops and cell phones on account of online
lectures from schools and colleges.
Figure 4.4: Irregularity component of time series
There are certain situation when data is not changing with respect to time,
then time series analysis is not applicable to such situations. E.g. If
average rainfall over the years in 3 -4 decades is approximately same, then
it implies that time factor has not affected the rainfall or one can conclude
that rainfall is independent of time. There is no point in applying time
series analysis to such situations. munotes.in
Page 21
Introduction to Big Data
21 B. Types of analysis on time series data
Time series analysis can be categorized into Descriptive, Diagnostic,
Predictive and Prescriptive analysis.
Descriptive analysis gives idea about what happened in t he past. It helps
in interpretation of the patterns followed by the data. It can be represented
in the form of data visualizations like graphs, charts, dashboards etc.
Variations in the data can be tracked with the help of descriptive analysis.
Diagnostic analysis is like an extension of descriptive analysis, which
helps in answering the reasoning behind variations in the data. This is
often referred to as root -cause analysis. Techniques like data discovery,
data mining and drilling down data come handy for this purpose
Predictive analysis tries to generate a model based on the historical data.
The model understands the basic pattern and trends of the data. The same
model is then applied to predict for the future. E.g. based on the sale of
apartments in a ci ty for last 50 years, a model can help predict the same
for next 5 years.
Prescriptive analysis takes predictive data, a step higher and helps to
decide what action should be taken. E.g. If certain number of demand is
predicted for next year for electric vehicles, then accordingly production
planning can be prepared by a company.
It is a prerequisite for any time series forecasting that the data is
stationary. If components like trends, seasonality, cyclicity and irregularity
are present in the data, it is considered as non -stationary. It is necessary to
smoothen the data before it is used for further forecasting. Mean, variance
and covariance values help deciding whether the data is stationary or non -
stationary. Stationary data may have seasonality compone nt but not the
trend component and mean, variance and covariance should not change as
per time. To illustrate on non -stationary data further, consider plotting
blood pressure against time. It may have minor seasonal variation but
definitely no trend. It wi ll never continuously increase or decrease with
time. Plotted in the form of a graph, blood pressure values will look as a
flat line with no slope. In some medical conditions, there can be
irregularities as well, there can be sudden spikes in blood pressed and
medical practitioners are definitely interested to find out the root cause
behind such spikes and remove them.
Figure 4.5: Stationary Vs Non -stationary data munotes.in
Page 22
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
22 Smoothing of data (i.e. converting non -stationary data to stationary) can
be achieved by app lying moving average to the data. Moving average
technique removes the randomness in the data. Consider the figure 4.6, the
graph represents monthly sales figures for 3 consecutive years. Though
there is overall increase in sales, there are variations in b etween.
Figure 4.6: Monthly sales figures (Stationary and non -stationary)
After applying moving average – MV4 (Take of first 4 data values and
calculate average, then take 2nd, 3rd, 4th and 5th data values to calculate
average, then 3rd to 6th and so on.). Next, calculate centered moving
average of every 2 data values to further smoothen the rough edges. Plot
the line graph of Centered Moving average instead of actual data values),
against the time frequencies for data collection i.e. every month of 3 years. munotes.in
Page 23
Introduction to Big Data
23 The graph will then look as shown in figure 4.6. Except the last part, the
graph is much smoother. Decomposition procedure helps in understanding
trend and seasonality factors in time series. De -trending and removing
seasonal effect followed by ste p to identify irregularity causing factors in
the original data can prepare data for applying models for forecasting.
Next important task is to forecast based on historical non -stationary data.
Certain tools with programming languages like R, Python can al so be used
for forecasting purpose. Or mathematical models also can be used for this
purpose, 2 such models are widely used and they are
a) Additive model:
Xt = Trend + Seasonal + Irregular
In a party, a cook assumes that on an average, people will eat 2 rot is and
accordingly will prepare the food. But, if some people are hungry, may be
they will eat one extra roti. So, one who east 1 roti normally, will eat 2.
One who eats 2 in normal situation, will eat 3. This is 1 extra to normal
situation, irrespective o f what original number is. In such a case, additive
model is used for forecasting.
b) Multiplicative model:
Xt = Trend * Seasonal * Irregular
When there is increase in product prices, it is in percentage terms. E.g.
Price of laptop increases by 5% than prev ious year, cost of certain model
laptop which costed 50,000 Rs. previous year will now cost 52,500 Rs.
The one which costed 70,000 Rs. previous year will now cost 73,500 Rs.
So, the increase in cost is not fixed but in terms of percentage, and such
scenari os, multiplicative model is best suited.
So, we can summarize that additive model is useful when the seasonal
variation is relatively constant over time.
The multiplicative model is useful when the seasonal variation increases
over time.
In the additive mo del, we take the addition of trend, seasonal and irregular
factor and divide it by centered moving average.
Exponential smoothing is a feature available in Excel worksheet, which
takes care of this entire process. After applying exponential smoothing, the
graph will show actual as well as predicted values of sales, which can be
further extrapolated for forecasting. As we can observe, actual and
forecasted values are pretty matching with each other. Hence, we can say
that this model is accurate and can be us ed for forecasting future sales
values. Other data smoothing techniques like random walk, simple
exponential smoothing are also available. Once smoothing is done, we
need to right click on the line chart and add equation, R2 value on the
graph. Also after adding trend line on the graph, one can forecast for the
future. R2 helps in indicating how good the model is for prediction. munotes.in
Page 24
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
24
Figure 4.7: Sales values (actual vs. predicted) along with forecast line.
1.4.2 Nonlinear dynamics:
In case of non -linear data, data points are connected to each other in
multiple ways. As shown in figure 4.7, the number of data point and
degree of connections with each other may vary. Further, elements can
also be heterogeneous in nature. This also called as topology of the
syste m. Consider a pendulum moving with certain initial state and velocity
it will follow certain pattern of movements.
Figure 4.8: Non -linear data
But, if the pendulum is bent in between, it will b e controlled by 2
equilibriums. The resultant motion will bec ome non -linear. Hypothetically,
imagine the earth is also controlled by another plant, which will exert its
own gravitational force effect on the earth, the entire structure of earth’s
orbit will change and may look like 2 connected ovals. This will again
result into non -linear motion. If data for non -linear motion is plotted as
graph, instead of sequential nature of line, it will look curved, more like a
quadratic equation. Such a change in movement is called as chaos . Chaos
theory studies behavior of dyna mical systems, sensitive to initial
conditions (referred to as butterfly effect). Motion of pendulum with 2
pendulums, recorded in isolation, is predictable. But when combined,
reveal non -linear behavior.
Two sound waves, perfectly out of synch with each o ther, rather than
adding with each other, will cancel the effect of both. Many human being munotes.in
Page 25
Introduction to Big Data
25 working in tandem, may synergize overall output, much higher than
addition of individual outputs. Non -linear systems may shift to whole new
regime, even if there is small change in input condition. Such a change is
called as phase -transition.
For a quick comparison between linear and non -linear time series data, a
linear data will reveal a straight line when plotted in graph, whereas non -
linear data will generated a curved shaped graph. A linear data, when
presented in an equation, will be first degree equation whereas non -linear
data will be a quadratic equation. It is crucial to find out whether data is
linear or non -linear before deciding the techniques to use for forecasting
purpose. When represented graphically, non -linear time -series data will
generally one of following shapes:
Figure 4.9: Graphical representation of non -linear time series data
Structural breaks (outside forces that may cause sudden and permane nt
change in the pattern of the data) play vital role while studying non -linear
time series data. Identifying the presence of structural breaks, estimating
their timings and studying behavior of data before, after and during the
breaks needs to be studied while dealing with non -linear data.
Brock - Dechert -Scheinkman test (denoted as the BDS test) is the most
widely used test for detecting non -linearity of the data. The BDS test gets
its name from its original authors William Brock, Davis Dechert and Jose
Scheinkman, who develop it in 1987. It is generally used indirectly to test
alternative hypothesis for non -linearity. The BDS test uses the correlation
function (also called the correlation integral) as the statistic test. In case of
non-linear data, which is time dependent, BDS test checks dependence of
data points in the space where point are plotted. Naturally, unlike linear
data, there is more than 2 dependence of datapoints in case on non -linear
data. This is denoted as checking spatial dependence check .
ARIMA (Auto Regressive Integrated Moving Average) Model is used
for forecasting in non -linear time series data. ARIMA model is denoted as
ARIMA (p, d, q) where p is The number of lag observations included in munotes.in
Page 26
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
26 the model, also called the lag order. D is th e number of times that the raw
observations are differenced, also called the degree of differencing and q
is the size of the moving average window, also called the order of moving
average. Steps in ARIMA are stated as
1. Model identification. Use plots an d summary statistics to identify
trends, seasonality, and auto regression elements to get an idea of the
amount of differencing and the size of the lag that will be required.
2. Parameter Estimation. Use a fitting procedure to find the coefficients of
the regression model.
3. Model Checking. Use plots and statistical tests of the residual errors to
determine the amount and type of temporal structure not captured by the
model. The process is repeated until either a desirable level of fit is
achieved on the in-sample or out -of-sample observations (e.g. training or
test datasets).
ARIMA includes both auto regression and moving average features. It
needs at least 50 and on an average 100 records to build a proper model.
The ARIMA model tend to be unstable, bot h with respect to changes in
observations and changes in model specification. Because of the large data
requirements, the lack of convenient updating procedure, ARIMA
becomes high cost model.
1.4.3 Rule induction:
Rule induction is a process of deriving if -then rules as a part of data
mining. Rules are most popular symbolic representation of knowledge.
Rules are not only very easy but also natural and in human understandable
form. Such decision rules help in discovering inherent relationships
amongst the da ta sets as well as use them for business. Consider an
example – If it is 8 pm on Saturday, then there will be lot of rush in the
restaurants. Predictions based on such rules are based on everyday
observation for long duration. Rules are easier to understan d than decision
trees. Consider a scenario which has more than 30 -35 decision situations.
A decision tree built based on such decision points will not only be a very
large diagram but will be difficult to understand as well. Hence, decision
rules are more preferred over decision trees or any other technique for
classification.
Such rules can be extracted from a decision tree. Rules consist of attribute
– value pairs which can be traced from a root of a decision tree to a
particular node. These rules are mu tually exclusive (without conflict /
overlap) and exhaustive (covering all possible scenarios of decision
making).
For deciding income tax to be paid by a person, following rules can be
followed (The given example is totally hypothetical and for academic
purpose only).
munotes.in
Page 27
Introduction to Big Data
27 If a person is a senior citizen and earning in slab 1 Then No income tax
If a person is a senior citizen and earning in slab 1 Then 5% income tax
If a person is salaried, not a senior citizen and
earning in slab 1 and gender -Male Then 5 % income tax
If a person is salaried, not a senior citizen and
earning in slab 2 and gender -Male Then 10% income
tax
If a person is salaried , not a senior citizen and
earning in slab 1 and gender -Female Then No income tax
If a person is salaried , not a senior citizen and
earning in slab 2 and gender -Female Then 5% income tax
If a person is business person , not a senior citizen
and earning in slab 1 and gender -Male Then 15% income
tax
If a person is business person , not a senior citizen
and earning in slab 1 and gender -Female Then 10% income
tax
a. Rule Induction algorithms:
Apart from inducing rules from decision trees, certain algorithms can also
be used for rule induction process. Training data can be used for deriving
rules. Generall y one rule is learnt by using the process of machine
learning. For more number of rules, iterations are carried out on the
dataset for every new rule.
i. Learn one rule:
This rule follows greedy search technique where it searches for a rule
which has high accuracy but less coverage classifying all positive
examples for a given instance. Strength of this algorithm lies in its ability
to create relations amongst the given attributes under test and cover
maximum number of dataset for these attributes. Conside r a situation
where in a decision of playing cricket match is based on certain
parameters such as weather, rains, cloudiness, light intensity, temperature,
nature of grass on the playground and soil quality. Based on possible
alternatives to all these para meters, final ruleset is designed.
E.g. Rule number 1 can be - If quality of soil is good, and no grass on the
ground, and light intensity is good, and no cloudiness and no rains, match
will be played.
Another rule can be - If heavy rains, even if no gra ss, soil quality is good,
match will be played.
munotes.in
Page 28
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
28 ii. Sequential covering :
This is widely used algorithm for rule based classification for learning
disjunction rules. In this algorithm, based on learn one rule, one rule is
discovered. After that all the da ta covered by this rule is removed. Then
the same process is repeated in a sequential manner for all other rules.
iii. FOIL :
First Order Inductive Learning is a rule based algorithm which is a
natural extension of Sequential Covering algorithm and Learn On e rule
algorithms. FOIL used the concept of inductive logic which involves
analyzing and understanding evidences and then use them for prediction.
Look at the example, wherein the evidence say 80% of youth go for
movies on weekends, and the fact that A i s a youth, one can predict that
this person will go and watch movie on a weekend. The algorithm works
in iteration forming new rules, and for every new rule, all previous
positive and negative examples are eliminated.
iv. AQ:
Algorithm Quasi Optimal is a powerful machine learning methodology
aimed at learning symbolic decision rules from a set of examples and
counterexamples (negative examples). AQ starts with assigning class
(labels) to input data. So it can be treated as supervised algorithm. AQ
involve 4 major steps – data preparation, rule learning, postprocessing and
optional testing.AQ is used in two ways, for theory formation (TF) and
Pattern Discovery (PD). AQ segregates all ambiguous data/ event into 4
categories – Positive, where all ambiguous dat a is gathered into a class.
Negative, where ambiguous data is eliminated. Eliminate, where
ambiguous data is not used further. Majority, where ambiguous data is
labeled to a class where it mostly appears. Further, the algorithm selects
only most relevant attributes. This avoid unnecessary rule formation in a
highly noisy situation. In the beginning, a general rule is formed by
comparing with positive and negative examples, and keep repeating this
process by refining previous rules.
v. CN2:
CN2 algorithm w orks best in a noisy environment. It is a classification
technique for inducing simple if -then rules to predict a class to which data
related to an event belongs to. There is inbuilt process for removing empty
columns, removing instances with unknown targe t values and imputing
missing values with mean values. Two algorithms, search algorithm
(decides which are the best rules) and control algorithm (exerts criteria for
deciding best rules) which are part of CN2, work in tandem to induce
rules, in an ordered and unordered set.
vi. RIPPER :
It stands for Repeated Incremental Pruning to Produce Error Reduction.
The Ripper Algorithm is a Rule -based classification algorithm . It derives munotes.in
Page 29
Introduction to Big Data
29 a set of rules from the training set. It is a widely used rule induction
algori thm. RIPPER algorithm is used when the dataset is imbalanced one
(Unequal number of data elements in different classes). Amongst
imbalanced datasets, this algorithm selects the majority class as a default
class. The algorithm starts with the assumption tha t records belonging to
default class are positive example and all other classes with reducing
frequenting of data elements are considered as negative examples.
Sequential Covering Algorithm is used to generate the rules that
discriminate between +ve and -ve examples. Then RIPPER considers next
class for deriving the rules. It starts with empty rules and then keeps
adding best conjunct (conditions connected by AND) to the antecedents (If
part). All such conjuncts are evaluated by a metric. When the rule star ts
covering negative examples, the algorithm stop execution.
Once a rule is derived, all positive and negative examples are covered by a
rule are eliminated and the rule is added to rule set.
Accuracy of such rule induction system can be calculated based on
number of correct data elements covered by a rule and number of total
number of data elements covered by a rule. It is possible that there are
more than one rules are applied for uncovering such hidden relationships
in the dataset. In such a case, prior itization of rules depending on the
requirement is carried out. Such prioritization of rules will avoid conflict
while triggering the rules.
b. Conflict resolution techniques:
To avoid multiple rules being triggered at the same time, or conflict
between r ules and class which it belongs to, following conflict resolution
techniques are used.
i. Size ordering – In this technique, the rules with maximum number of
attributes is given the highest priority.
ii. Class based ordering – Rules with maximum frequency class i s
considered at the priority.
iii. Rule based ordering – Rules are arranged into a long list of priority
based on some measure of rule qualities such as accuracy, coverage
and experts’ opinion.
1.5. NEURAL NETWORKS:
A neural network is a computational data mod el which captures and
represents complex input & output mechanism. The main motives come
for the development of neural network technology is from the thought to
develop an artificial system which can perform "intelligent" tasks similar
to human brain. NN ( Neural networks) reflect the behavior of the human
brain. It allows computer programs to recognize patterns and solve
common Artificial intelligence problems. NN is also known as Artificial
Neural Networks (ANNs). NNs are having many layers, which mainly
divide in three categories like an input layer, one or more hidden layers,
and an output layer. Node is also known as artificial neuron. Each node
connects to another node and each node has an associated weightage and munotes.in
Page 30
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
30 specific threshold. If the output of an y specific node is above the threshold
value, then that specific node is activated and it sends data to the next
layer of the network.
NNs rely on training data to learn and continuously improve their
accuracy over time. For getting the better accuracy there is need to tune up
the learning algorithms. Tasks in speech recognition and image
recognition can be completed within minutes when it takes several hours
in the manual human expert’s identification. Google’s search algorithm is
one of the most well k nown examples of NNs. Face recognition or
character recognition is not the only the problems that NNs can solve.
NNs have been successfully applied to wide spectrum of data -intensive
applications like:
Fraud Detection - Detect fake transactions of credit c ard and
automatically refuse such charges.
Process Modeling and Controlling - Creating a NN model for a physical
plant for best automation.
Machine Diagnostics - Detect the failure of machine and automatically
shut down the machine systems when this proble m occurs.
Targeted Marketing and survey – For getting highest response rate for a
particular marketing campaign.
Quality Control and Maintenance – Identifying the product defects
based on the recorded data.
Portfolio Management - Allocate the assets in a portfolio in for
maximum return with minimum risk.
Medical Diagnosis Application - Help doctors with their diagnosis by
analyzing the image data such as MRIs & X -rays.
Financial Forecasting & Credit Rating – Do the financial forecasting
with the available data also calculate the credit rating based on current
financial conditions.
munotes.in
Page 31
Introduction to Big Data
31 Military Application -Target Recognition - Determine target if any
enemy present in given data.
1.5.1 Learning and Generalization:
First step in NNs training is generalization. Generalization specifies
how good our model is for learning from the provided data and applying
the learnt information. When we train a NN, some data we will use for
train the model and some we will reserve for checking the performance of
model. Here we a re explaining generalization of NN with an example.
We are training a NN which should give the decision about given image is
of dog or not. We have some pictures of dogs, each dog belonging to a
certain breed and having different features like color, strip s, height and
many more. We have a total 12 pictures of dog. We will use 10 pictures
for training and remaining 2 for checking the accuracy of model.
Now we will show this to a person and train them with 10 breeds of dogs
and after training ask person to detect other dogs from testing data.
Hopefully person will give answer about asked question. Here 10 breeds
should be enough to understand and identify the unique features of a dog.
This concept of learning is called generalization in which Learning from
some data and correctly applying the gained knowledge on other data
munotes.in
Page 32
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
32
1.5.2 Competitive Learning:
Competitive learning is a specific form of unsupervised learning in NNs.
This type of learning is done without any supervision of a teacher. This is
indep endent learning process. At the time of training of NN under
unsupervised learning, the similar input vectors combined and form a
cluster. In this system when a new input pattern is applied, then the NN
gives a response indicating the class to which input pattern belongs. There
is a no any feedback from the environment as to what should be the
desired output and whether the generated result is correct or incorrect. In
this type of learning the network and discover the patterns. This is based
on the concept of Competitive Learning Network.
Competitive Network is like a single layer feed -forward network having
feedback connection between the outputs. The connections between the
outputs are inhibitory type, which is shown by dotted lines, which means
the compet itors never support themselves. Here the competition done
between the output nodes specifically during the training. Output node
unit which has the highest activation to a given input pattern will be
declared the winner node. During training, the output un it that provides
the highest activation to a given input pattern is declared the specific
weights of the winner and is moved closer to the input pattern; whereas
the rest of the neurons are remain unchanged. In this strategy winner -take-
all and only the wi nning neuron is updated other remain as it is.
munotes.in
Page 33
Introduction to Big Data
33 1.5.3 Principal Component Analysis and Neural Networks:
Principal Component Analysis (PCA) is an unsupervised learning
methodology which is generally used to reduce the dimensionality of large
datasets or generally use to simplify the complexity of dataset by
transforming a large set of variables into a smaller one while trying to
retain most of the information of the original dataset.PCA reduces data by
geometrically projecting it onto lower dimensions whi ch in turn are called
as Principal Components (PC).
The purpose of this method is to find the best summary of our data by
using the least amount of principal components. By choosing principal
components we minimizing our distance between the original dat a and its
projected values on the principal components, as a result of minimizing
the distance we maximize the variance of the projected points, same we
can repeat for all other principal components.
The basic idea of PCA is to preserve maximal variance f or a data set with
a minimal set of linear descriptors. High dimensional datasets are
projected into a smaller number of dimensions maximizing the variance on
the new axes. PCA is a very important Statistical analysis tool and
therefore many researchers ar e working to improve the algorithm for
better performance and better data interpretation.
Let’s take an example, if we have a training set consisting of 250 images
of “person wearing glasses” and “person not wearing glasses” having 4096
features per image , when we directly apply NN to our dataset it would
take a huge amount of time for the training purpose, but if we pre -process
our data using PCA it will reduce the dimensions of our dataset to
(250,250) from the original (250,4096) hence when we apply NN to our
resulting dataset the time required to train the dataset will reduce
drastically without a huge loss in accuracy.
1.6. FUZZY LOGIC:
The term “Fuzzy Logic” refers to the data which is imprecise or vague.
This concept was first introduced in 1965 by Lotfi A Zadeh, A Barkley
Professor in Electronics and computer Science, who was basically a
Mathematician. He is also called as “Father of Fuzzy Logic”. He realized munotes.in
Page 34
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
34 that the legacy application were not capable of handling imprecise data
and mainly focu sed on handling precise (Boolean) data such as True /
False in the form of 1 and 0. In real world, more often one has to process
unclear data than clear data.
To further elaborate, consider a question “Is the car is moving?. “We have
only two answers – YES or NO. It’s pretty simple to handle such
situations and transform them in to software systems. But imagine a
situation where we think of developing an autopilot application for a car.
Software is expected to handle a decision making situation wherein a
decision needs to be made about applying brakes or pushing accelerator
based on the existing speed of the car, whether it is moving slow or fast.
Let’s assume, 40 km per hour is a threshold, below which car is
considered to have slow speed and above it, cons idered to be fast. If a car
is moving at the speed of 10 km/hr. is definitely slow speed and pushing
accelerator will be appropriate decision. On the other hand, if a car is
moving at the speed of 60 km/hr is definitely moving with fast speed and
applying brakes will be advisable. But think of a situation where the speed
is 39.5 km/hr. as per traditional logic, accelerators should be pushed and
as soon as speed becomes 40.5 /hr., brakes should be applied. In this way,
a car will keep speeding up and sudden ly stopping. The person inside the
car will keep experiencing continuous jerks.
The only solution to handle such a situation was to consider speed of the
car as a continuous imprecise data than fixed precise. Slow speed can be
anywhere between 0 to 40, de pending on how close it is to the threshold
value, we can say that it is extremely slow, very slow, little slow, slow,
little fast, very fast or extremely fast. Fuzzy logic helps in accepting such
continuous data and further take actions based on such inpu t.
munotes.in
Page 35
Introduction to Big Data
35
Figure 6.1 Extracting fuzzy models (rules) from data
a. Architecture of fuzzy logic based software systems
A software system based on fuzzy logic mainly has 4 components:
Fuzzifier module, a rule base, Inference engine and De -Fuzzifier module.
Figure 6.2: Architecture of Fuzzy Logic System
1. Fuzzifier module: A fuzzifier module accepts inputs in the form of
crisp values. These values are further converted fuzzy data by applying
membership function. E.g. consider an answer to a question – Is it hot
today? The respondent, depending upon his / her perception, may answer
differently - Very hot / extremely hot / hot / slightly hot / Not at all hot.
Instead of plain YES or NO binary answer, there are varied answers. Such munotes.in
Page 36
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
36 answers are treated as LP, M P, SM, MN, and LN. Such an input is more
like human like and realistic one.
Figure 6.3: Fuzzification rules
2. Rule base: This is a collection of set of rules, which are applied on
the fuzzy input received from fuzzifier module. The rules are in the form
of if-then conditions and respective action, designed by experts. Such set
of rules can be further updated to fine tune the system.
3. Inference engine: This is the most important component of the
system. Based on inputs received from Fuzzifier module and rules base,
inference engine is responsible for making decisions. After matching
fuzzy inputs and selecting appropriate rules, the inference engine
determines which rules to be applied for developing control actions.
4. De-Fuzzification module: This is r esponsible for output from
inference engine to crisp values and present to user. Further, user can
choose the best option to reduce the error.
De-Fuzzification methods – Lambda -Cut method, maxima method,
weighted -sum method and centroid method are the met hods which are
used for converting fuzzy values to crisp values which are in the human
understandable form.
Let’s consider the illustration of designing fuzzy logic system for a smart
air conditioner. The system can detect temperature through a thermometer .
This crisp value is taken as input for fuzzy system. Fuzzifier modules,
using membership function, will convert it into fuzzy data set. These fuzzy
values, combined with if -then rules base, inference engine will generate
output, which is again fuzzy. Usi ng defuzzifying techniques, output will
be again converted into crisp value, based on which air conditioner will
automatically adjust its value.
A fuzzy logic based system is one which can treat the input as a set of
limited approximate values instead of p recise values. All the values are munotes.in
Page 37
Introduction to Big Data
37 nothing but matter of degrees. Knowledge is nothing but a set of variables.
Any logical system can be converted into fuzzy logic based system.
b. Membership function:
A membership function is one which can help in transfor ming crisp values
to fuzzy sets. It was first put forth by Lotfi A Zadeh. Such a function helps
in representing all the data in fuzzy set (discrete and continuous both). It
helps in handling real world problems with the help of experts. It is
possible to h ave one or many fuzzy rules with one or many antecedents
and consequents. Consider Following If -Then rule, part I is called
antecedent or premise and part II is called as consequence. In the
following case there is only 1 antecedent and 1 consequents in th e rule.
In the above example, there are 2 antecedents and 1 consequents in the
rule.
Rules for defining fuzzy values are also fuzzy. In the similar way, it is
possible to have multiple antecedent and multiple rules.
Here is an example with multiple rul es with multiple antecedents.
Rule 1: If x is A and y is B Then z is C
Rule 2: If x is A1 or y is B1 Then z is C1.
Part I and part II in the above rules indicate antecedent I and II whereas III
indicates consequents. Consequents of multiple rules in a ru le base can be
aggregated to generate defuzzified output. munotes.in
Page 38
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
38 The method of assigning membership values are as follows: 1. Intuition 2.
Inference 3. Rank Ordering 4. Angular Fuzzy Sets 5. Neural Networks 6.
Genetic Algorithm 7. Inductive Reasoning
One can rep resent a membership function with the help of a graph. A
membership value will always range between 0 and 1.
Figure 6.4: Graphical representation of membership function
Membership values lie between 0 and 1 and the one which are equal to full
membership i.e. 1 are called core values. The values which are non -zero
membership values are called support values. All the values which are
greater than zero but have incomplete membership are called as boundary
values. Membership values can be assigned based on i ntuition of experts,
through referencing, by rank ordering, angular fuzzy sets, neural networks,
genetic algorithm, and induction reasoning.
1.6.1 Extracting Fuzzy Models from Data
A Fuzzy rule consists of antecedent (also called as hypothesis),
consequen ce (also known as conclusion). Multiple antecedents are
possible in a rule and there can be many rules in a given scenario. Such an
expression with antecedents and consequences, with optional AND, OR
conjunctive/disjunctive operators If -Then rule. These ar e also called as
canonical form of rule base .
When the two antecedents are conjunctive in nature i.e. joined by AND
then the aggregated output is intersection of all membership values. In
this case, all conditions that should be jointly satisfied, joined with AND.
This is called as Conjunctive system of rules. It can be represented
mathematically as
µx(x) =min (µ x1(x1), µ x2(x2)…… µ xn (xn)) munotes.in
Page 39
Introduction to Big Data
39 On the other hand, if antecedents are disjunctive in nature, i.e. joined by
OR, then aggregated output is union of al l membership values. In this
case, at least one conditions that should be satisfied, joined with OR. This
is called as Disjunctive system of rules. It can be mathematically
represented as
µx(x) =max (µ x1(x1), µ x2(x2)…… µ xn (xn))
There are well researched fuzzy methods that provide well defined
systems for which can be used in inference system.
A. Mamdani system
Ebhasim Mamdani suggested this method in the year 1975. This method
can accept crisp as well as fuzzy inputs for the purpose of inference.
Consider a set of 2 rules
Rule 1 - If x is A and y is B Then z is C
Rule 2 - If x is A1 or y is B1 Then z is C1
There are two cases for 2 inputs methods in Mamdani system
a. Max -Min inference method – Considering above Rule 1 and Rule
2, with x=2.5 and y=3 as the in puts, minimum of membership values for
different antecedent is considered. Let µ1 be the membership value for
x=2.5 and µ2 be the membership value for y=3. Then minimum of µ1 and
µ2 is considered for Rule 1. Same procedure is followed for Rule 2 (and
up to Rule n if there are any) and maximum µ of all these rules is
considered for final Defuzzification. This method is also called as
truncated membership method .
Area covered under marked area is considered for finding out the final
crisp value. Appropriate e quation for area calculation is used based on the
shape that is formed in the final output graph.
Figure 6.5: Max - Min Mamdani Method
munotes.in
Page 40
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
40 b. Max -product inference method – Same procedure as Max -Min is
followed, except, instead of considering minimum membership value for
each rule, product is considered. For aggregation, maximum of all these
products considered for final Defuzzification. Instead of truncated
membership method, scaled membership method is used in Max -Product
Method.
Figure 6.6: Max Product Ma mdani Method
B. TSK / Sugeno system (Takagi Sugeno Kang / Sugeno)
This model was suggested in the year 1985. In case of Mamdani system,
all the antecedents in If Then rule were in the fuzzy form and consequent
is also fuzzy. But the consequent is a polynom ial function represented as
y=f(x, y), which is a crisp function.
Rule 1 - If x is small and y is small Then z1= ( -x) +y+1
Rule 2 - If x is small and y is large Then z2= ( -y) +3
Rule 3 - If x is large and y is small Then z3= ( -x) +3
Rule 4 - If x is large and y is large Then z4= ( -x) +y+2
Consider values of x=1.5 and y=2.5
Figure 6.7: Graphical representatiopn of TSK / Sugeno method
Minimum of membership values for x=1.5 for small and large are 0.3 and
0.3 and that of y=2.5 are 0.4 and 0.7 respectively. munotes.in
Page 41
Introduction to Big Data
41
y* = (0.3 * 2) + (0.3 * 0.5) + (0.4 * 1.5) + (0.7 * 6)
0.3 + 0.3 + 0.4 +0.7
y* = 3.264
Before discussing the third system i.e.Tsukamoto system, let’s have brief
discussion on comparison of Mamdani and Sugeno system.
a. As per Mamdani system, consequent is s fuzzy data set whereas
according to Sugeno system, output membership function is a either
linear or constant.
b. Sugeno system is more based on mathematical rules than Mamdani.
c. Mamdani system more suitable for human inputs
d. Sugeno controller has more adjustable parameters than Mamdani
system.
e. Mamdani system is more intuitive and has widespread acceptance,
but Sugeno method is more computationally efficient.
f. Sugeno system works better for optimization and adaptive
techniques.
C. Tsukamoto system
In this system, antecedents as well as consequent is a fuzzy set, but the
membership function of the consequent is a fuzzy set, based on monotonic
function (which is also called shoulder function) whose successive values
are increasing, d ecreasing or constant. Output of each rule is defined as a
crisp value induced by membership value coming from antecedent rule.
Rule 1 – If x is A and y is B Then z is C
Rule 2 – If x is A1 and y is B1 Then z1 is C1
w1, w2 and w3 represent corresponding we ights (based on membership
value) for x, y and z.
Figure 6.8: Graphical representation of Tsukamoto method munotes.in
Page 42
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
42 Based on rule 1 and rule 2, we find corresponding values for x and y and
the membership value for output z. This is done using maximum or
minimum d epending on the rule (whether AND/OR condition used in the
rule). In a given illustration, both the rules are connected with AND, hence
we consider minimum of the two membership values. After extending w2
in first case and w2 in second case, corresponding values for membership
value w3 can be obtained. (This is a crisp value). Overall output can be
obtained by weighted average of each rule’s output (i.e. w3 and z in both
the rules).
Consider values of x=2 and y=5 ,
y*= (0.2 * 0.5) + (0.5 * 5)
0.2 + 0.5
y* = 3.714 (Final output)
Main advantage of Tsukamoto method is that it bypasses the long process
of Defuzzification as each rule renders a crisp value, and overall output
can be calculated with weighted average method.
Major lacuna of this method is that, it can be applied only when
monotonic function used. In all other generic case Tsukamoto method
cannot be used.
1.6.2 Fuzzy Decision Trees
Decision tree is a diagrammatic representation of decision rules and
corresponding outcomes. A d ecision tree consists of 2 parts, decision node
and branches. A decision tree of such kind helps the end user better design
strategy in a complex situation where there are multiple decision rules and
conditions. Let’s consider the example of decision incom e tax percentage
to be deducted in a given situation. Tax to be deducted will be decided
upon following conditions
1. Whether a person under consideration is a salaried person or a
business person
2. Age of the person
3. Gender of a person
4. Total amount of earning
To design a tree for this situation, Questions are designed in such a way
that there are only two possible answers to a question. A condition is
considered as a decision node and answers are like branches. All possible
conditions and their answers are incl uded in a single tree so that end -user
can easily take decision. It is important to note that all the decision
rules and possible answers are clear and well -defined. munotes.in
Page 43
Introduction to Big Data
43 Now, let’s consider another situation, where person X has been
interviewed by different co mpanies and has received job offers from 4
different companies. X has to make a decision, which offer to accept,
based on 3 criteria he has in mind. The criteria r \are salary, distance from
home and growth opportunities. X is looking for salary in the rang e of 35 -
55 thousand per month. Distance from home should be between 5 to 30
km. Growth opportunities are indicated by number of ticks where more
number of ticks indicate more opportunities. Unlike previous example of
tax calculation, all the above conditio ns and possible criteria and unclear
and fuzzy in this case.
Following table represents all the criteria and available job opportunities
available for person X wherein J1, J2, J3 and J4 indicate job offers and
C1, C2, C3 indicate criteria for selection of job offer.
Job offers
Criteria J1 J2 J3 J4
Salary C1 40 k 45 k 50 k 60 k
Distance C2 27 7.5 12 2.5
Growth C3 √√ √√√ √ √
After assigning membership function and assigning weights for salary
criteria, we get a continuous line as shown below.
Figure 6.9: Graphical presentation of salary vs. membership values
After following similar procedure for all other criter ia, we get a table as
follows
Job offers
Criteria J1 J2 J3 J4
Salary C1 0.25 0.5 0.7 1
Distance C2 1 0.9 0.78 0.1
Growth C3 0.5 0.8 0.2 0.2
munotes.in
Page 44
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
44 For finding out best possible option, following equation used:
D1=Min (C1 (J1), C1 (J2), C1 (J3), C1 (J4)) = 0.25
D2=Min (C2 (J1), C2 (J2), C2 (J3), C2 (J4)) = 0.5
D3=Min (C3 (J1), C3 (J2), C3 (J3), C3 (J4)) = 0.2
D4=Min (C4 (J1), C1 (J2), C1 (J3), C1 (J4)) = 0.1
D (Final) = Max (D1, D2, D3, D4) = Max (0.25, 0.5, 0.2, 0.1) = 0.5
Hence, best option is with the we ight 0.5 i.e. option 2. Hence J2, job offer
2 is most advisable for X.
A decision tree based on fuzzy value will not have just 2 branches, but can
have multiple branches. Experts’ views do matter for designing weights
giver to the values and membership fu nction.
Figure 6.10: Decision tree (fuzzy values)
6.3 Stochastic Search Methods.
Since the advent of computers and software systems, they have undergone
lot of evolution. In the recent days, software systems have reached a stage
where one can expect them to imitate human intelligence. Needless to say,
agility and adaptability is one of the most prominent feature of human
intelligence. Incorporating changing environment to support decision
making in most complex systems, machine learning, deep learning and
neural networks have immensely aided in the development of appropriate
artificially intelligent software.
An efficient adaptive, self -learning algorithm for speedy search in a large
size database can give and edge over other traditional search algorithms .
Previously used deterministic and probabilistic models may not give
expected intelligent output, nearer to human intelligence. Deterministic
models as experiment based and with same set of initial conditions, will
generate same output. Probabilistic work s with certain degree of munotes.in
Page 45
Introduction to Big Data
45 randomness, but fails to work in an inherent highly random environment.
Further, deterministic and probabilistic methods are not capable of
handling time -dependent randomness. Consider an example of bacterial
growth in a controlled environment. In spite of same set of initial
condition and environment, final results may vary. Predicting stock prices
at different points of time is also highly unpredictable process and asks for
algorithms that can handle the nature of randomness in su ch a situation.
Modeling efficient supply chain management from production facilities to
warehouse, designing best red -yellow -green signal timings in various
directions in a traffic -network, deciding time to administer a drug for its
best therapeutic effec ts, Gaussian movement of particles are some more
such areas with high degree of randomness. Stochastic methods can come
handy in such situations.
What is stochastic search? : Most of the real -world problems need
stochastic approach. Stochastic process is a set of random variables, which
are time -dependent (time can be discrete -X0, X 1, X 2, X 3… X n or
continuous – {X t} t>=0). Certain degree of uncertainty helps in improving
ability in optimizing search processes. Natural world is full of
stochasticity. M ost of the machine learning algorithms are based on
stochastic methods. Games do have certain level of stochasticity, such as
rolling dice or shuffling cards. Following are some generic steps for
building stochastic search model:
1. Creating a sample space (Ω) — which includes a list of all possible
outcomes,
2. Assigning probabilities to all the elements in a sample space
3. Identifying different events of interest,
4. Calculating the probabilities for the events of interest.
Let’s see a common example of this proc ess in action: You are rolling a
dice in a casino. If you roll a six or a one, you win Rs. 1000. The steps
would be:
The sample space includes all possibilities for dice roll outcomes: Ω = {1,
2, 3, 4, 5, 6}.
The probability for any number being rolled is 1/6.
The event of interest is “roll a 6 or roll a 1”.
The probability for “roll a 6 or 1” is 1/6 + 1/6 = 2/6 = 1/3.
Implementation of stochastic search is achieved through different
algorithms and techniques. Such techniques are based on exploitation and
exploration principles.
Following are some of the popular techniques for stochastic search:
a. Simulated annealing – The name simulated annealing come from the
field of metallurgical engineering i n which temperature is brought down in munotes.in
Page 46
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
46 a very slow manner, so that particles can settle down gradually while
cooling (minimum lattice energy state, thus avoiding and crystal defects,
final configuration results in a solid with such superior structural
integrity). Simulation of this entire process of annealing is used in an
algorithm, which can be represented as:
Simulated Annealing ()
Step 1 – Start with any random node and generate a solution
Step 2 - Using any cost function, calculate the cost of the solu tion
Step 3 - Generate a new solution using a random neighbor
Step 4 – Calculate the cost of new function
Step 5 – Compare new solution cost against the cost of previous solution
Step 6 - If new solution is better than the old one cost wise, move to new
solution and move to one more iteration.
Step 7 - Keep checking for termination condition, which may either
maximum number of iteration or optimal solution resulted.
b. Genetic algorithms – Motivation for genetic algorithm come from
nature and the way it has evolved. Genetic mutation is a common process
that keep happening in animals as well as trees. In this process, a gene is
replaced by another, for environmental reasons. The evolutionary
fundamentals when applied for computation purpose, are called as
Evol utionary Computing and one of the branch of Evolutionary
Computing is Genetic algorithm. In GA, there is a population, which
consists of all possible encoded solutions to a given problem, wherein,
every single solution in the population is called as chromo some.
Population in computational space is called as genotype, whereas
population in real world is called phenotype. Genotype is basically
encoded solution from real world population to computational space. On
the other hand, phenotype is decoded solution from computational space
to real world. In a given population with problem, which is random one or
generated from other known heuristics problem, fit parent candidates are
selected. Fitness function is used to select the fit parents, the function has
to be very fast and is expected to quantitatively measure the fitness of the
candidates selected as parent. Crossover and mutation is carried out to
generate a new off -sprint, which in turn replace the one in original
population. This process is repeated again and again till number of
iterations are met or optimal solution is arrived. GA is widely used in
robotic engineering as well as other search optimization techniques.
Process in the GA can be represented in the form of algorithm as follows:
Genetic Algorith m ()
Step 1 – Initialize the population
Step 2 - Using fitness function, check the fitness of population munotes.in
Page 47
Introduction to Big Data
47 Step 3 – Select the parent
Step 4 – Probability of cross over is P1
Step 5 – Probability of mutation is P2
Step 6 – Decode the solution to real world an d calculate for fitness
Step 7 – Select the survivor
Step 8 – Find the best one and return the same to real world population
Step 9 – Repeat the steps 3 to 8 till termination criteria is met
Before moving ahead with explanation for Hill climbing technique for
stochastic search optimization, let’s see the comparison of Simulated
Annealing and Genetic Algorithm.
SA: Comes from metallurgy engineering
Uses cost function to compare two solutions
Uses only one population space
Keeps comparing one solution with t he neighboring to reach optimal
solution
Widely used in solving combinatorial problems
GA: Comes from human evolutionary concepts
Uses Best fit function for the comparison purpose
Uses 2 population spaces, Phenotype and Genotype.
Keeps combining two solut ions to reach target best off -spring
Widely used in robotic application, production planning .
c. Hill climbing – Hill climbing algorithm starts with a random value and
continues searching higher value, till it reaches peak. Then peak values of
neighboring peaks are compared with each other for better optimization.
TSP (Traveling salesman problem) is an area where hill -climbing
algorithm is widely used. Hill -climbing algorithm is a variation of
generate and test method, which helps to decide in which direct ion to
move in a search space. The direction to move in a search space is decided
based on cost function value. Steps for hill -climbing algorithm are as
follows:
Hill Climbing ()
Step 1 – If existing state is equal to target state, then stop
Step 2 – If ex isting state is not the target, keep repeating the process of
finding and comparing new states until target state is achieved or there is
no new operator left to apply munotes.in
Page 48
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
48 Step 3 - Select new operator and apply on the current state
Step 4 - Keep swapping current state and new state if new state is better
Step 5 - Exit the repetitive process the moment current state becomes the
optimum state.
QUESTION BANK
Q-1. What is a Sample? Why to use Sample?
A-1. Sample is a subset of large population. It allow researcher to conduct
their study in a timely manner as its size is small.
Q-2. What is a Sampling Distribution?
A-2. It is a probability distribution of a statistic from a large number of
samples.
Q-3. Give one example of sampling distribution.
A-3. Any live example us ed by researcher for analysis.
Q-4. Define following terms.
Sample, Population, Sampled Population, Element and Frame.
A-4. A sample is a subset of the population.
A population is a collection of all the elements of interest.
The sampled population is the population from which the sample is
drawn.
An element is the entity on which data are collected.
A frame is a list of the elements that the sample will be selected from.
Q-5. What is re -sampling?
A-5. We only have a single estimate of the populatio n parameter. To avoid
this situation, we can use estimating the population parameter
multiple times from our data sample. This is called re -sampling.
Q-6. What are TWO commonly used re -sampling methods?
A-6. (1). Bootstrap
(2). K – fold Cross Validation
Q-7. Discuss Statistical Inference.
A-7. Statistical Inference makes propositions about a population.
Statistical Inference consists of selecting a statistical model and
process that generates data and deducing propositions from the
model. munotes.in
Page 49
Introduction to Big Data
49 Q-8. Define Predic tion error.
A-8. Prediction error is the failure of some expected event to occur.
Q-9. What is Regression Analysis?
A-9. It is a set of statistical processes for estimating relationships between
a dependent variable and one or more independent variables.
Q-10. What are the types of Regressions models?
A-10. Linear
Logistics
Polynomial
Stepwise
Ridge
Lasso
Q-11. Case study I
A group of estate agents carried out a survey in Mumbai for predicting
rent and dep osit amounts for apartments in different locations. Rent and
deposit amounts can vary upon variety of factors such as distance from
railway station, locality of the flat, distance from airport, nearest school
and mall, amenities and carpet area of the flat . Mr. and Mrs. Y are looking
for an apartment on rent and approached group of property agents. Their
criteria for selecting an apartment are proximity from a school (2 -4 km),
distance from nearest railway station (5 -10 km), amenities and locality.
Property agents have short listed 4 properties for them. For finalizing the
best property for them, create a decision table and tree based on fuzzy
logic. Refer following values to prepare the table: P1 to P4 are shortlisted
properties and C1 to C4 are criteria.
Properties
Criteria P1 P2 P3 P4
School distance C1 3.5 km 2 km 4 km 3 km
Railway station
distance C2 8 km 6.5 km 10 km 5 km
Amenities C3 √√√ √√ √√ √
Locality C4 * *** * **
munotes.in
Page 50
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
50 A-11. Refer the section 6.2, for solving the case. Using a membership
function (1/5 for school distance criteria, we can plot the graph with the
gap of 5), write the membership values along with school distance in a
tabular form. Find out minimum of membership value for each short -listed
property, after repeating this process for all the criteria. Consider the
maximum membership value from all membership values for different
criteria and corresponding property. This pro perty is ideal for Mr. and
Mrs. Y based on their criteria. Draw a decision tree based on membership
values for all the criteria.
Q-12. Case study II
In class of 40, students are graded as poor, average and extraordinary
based on their percentages in the ex amination. Consider the universe with
percentage values as:
U={0,10,20,30,40,50,60,70,80,90,100} and students with percentage
below 40 are considered as poor, above 40 till 70 percent are average and
above 70 are extraordinary.
Assume 2 subsets A and B
A= {33, 56, 87, 96, 25, 66, 79}
B= {78, 42, 64, 86, 35, 27, 31}
Assign weights (membership values) to all the values in A and B, design a
membership function for the same. Draw graphs and find out count of
core, support and boundary values in subset A and B.
A-12. Consider the universe U, which has values 0 to 100, where the gap
is of 10. Hence, membership function can be 1/10. For each member in A
and B, apply membership function to find out membership values. Then
plot membership values against original val ues of each element.
Q-13 Multiple choice questions
1. Membership functions are better represented with the help of
a. Tabular form b. Graphical form
c. Mathematical form d. Logical form
2. Which of the following are fuzzy operators?
a. AND b. OR c. NOT d. All o f the above
3. How best can we define dry in terms of humidity of the weather?
a. Fuzzy set c. Crisp set
b. Fuzzy and Crisp d. None of the above
4. Values of X mapped to lie between 0 to 1 which is called as
a. Membership value c. Degree of membership
b. All of the above d. None of the above
munotes.in
Page 51
Introduction to Big Data
51 5. Fuzzy systems can be implement with the help of
a. Hardware c. Software
b. Both of the above d. None of the above
6. For a given fuzzy set A, which of the following elements do not
belong to A?
A={(a,0.5) , (b,0.2) , (c,0), (d,1 ) , (e,0.8), (f,0.3)}
a. c b. d c. None of the above d. All of the above
7. _____ is best used to represent fuzzy values in a graph.
a. Square c. Hexagon
b. Triangle d. All of the above
8. A fuzzy system architecture has ____ main components.
a. 2 b.4 c.5 d. None of the above
9. Which of the following logic is the form of Fuzzy logic?
a. Two-valued logic d. Crisp set logic
b. Binary set logic e. Many -valued logic
c. None of these
A-13. Answers in Red color above.
Q-14.What is a membership function used in fuzzy logic? What are
different techniques for fuzzifying or defuzzifying data?
A-14. Definition and need for fuzzy logic with example. List down
techniques for Defuzzification.
Q-15. Compare Mamdani and Sugeno model with their pros and cons.
A-15. Explain the concept of Stoc hastic Search methods. Mamdani and
Sugeno method, Advantages and disadvantages of each.
Q-16. Explain the concept of ‘Monotonic function”? Why it is
alternatively called as shoulder function?
A-16. In Tsukamoto method, outcome is a polynomial function inst ead of
fuzzy value. Monotonic function is one which takes increasing,
decreasing or constant values. After plotting such successive values,
we get a graph of following nature, which resembles human
shoulder. Hence called as shoulder function.
munotes.in
Page 52
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
52 Q-17. Expla in architecture of Fuzzy system with appropriate diagram.
A-17. Draw a diagram with 4 important components of Fuzzy System:
Fuzzifier, Rule Base, Inference System and De -fuzzifier. Explain
each and list down any 2 techniques, each for Fuzzification and
Defuzzification.
Q-18. Compare and contrast Simulated annealing, Genetic algorithm and
Random walk techniques for stochastic search.
A-18. Explain SA and GA technique and steps for the same. Write one
application where it can be used. Write advantages and
disadvantages of each.
Q-19.What is Neural Network?
Q-20.What is generalization in Neural Network?
Q-21.List out various applications where we can use Neural Network
Q-22.What is Competitive Learning in Neural Network?
Q-23.What is need of Principal component s analysis in Neural
Network?
Q-24. List five characteristic of big data.
A-24. Volume, variety, veracity, variability and velocity are
characteristics of big data.
Q-25. Name few unit of measurement for memory used in today's era of
big data.
A-25. Terab ytes, Petabytes, Zettabytes and Exabytes.
Q-26. Write various steps to carry out for analysis process in general.
A-26. The following steps has to carry out analysis process: Data
collection, Data cleaning, Data preprocessing, Data analysis,
Visualisat ion and Representation, Understanding results.
Q-27. Write 2 differences between analysis and reporting process.
A-27. (1) The goal of the analysis process is inspecting the data and
transforming into useful meaningful information. Whereas, the goal
of the reporting process is to transforming the output of process in
to presentable format. (2) The main purpose of conducting analysis
process is examining interpreting comparing and predicting about
the data. Whereas reporting process is mainly focusing on
highlighting organizing summarizing and formatting process.
Q-28. Write difference between structure and unstructured data.
A-28. Structure data can store with two dimensional structure like
worksheets. The structure of data is predefined and fixed. Whereas munotes.in
Page 53
Introduction to Big Data
53 unstructured data do not have fixed data format. It is volatile in
nature.
Q-29. Write examples of structure and unstructured data.
A-29. Structure data - Business data stored in RDBMS system, excel
worksheet Unstructured data - text data, web data, image s.
Q-30. Write any three reasons behind increasing volume of internet data in
last few years.
A-30. The reasons behind increasing volume of internet data are as follow:
(1) Increase in number of internet users.
(2) Increasing popularity of social media we bsites and online shopping
websites.
(3) IoT systems usage.
Q-31. Explain the term 'velocity' with reference to big data.
A-31. Velocity measures how fast the data is coming in. In some system
data are come in in real -time, whereas in other systems data ar e
come in batches. Depending on the velocity of data, data storage
system has to manage the flow of the data.
Q-32. Name any three technology used for Big data analytics.
A-32. R language, Python language and Hadoop ecosystem are popular
technology used fo r big data analytics.
Q-33. Differentiate between linear and non -linear time series data
Q-34. Explain various inherent components of time -series data, with
suitable examples.
Q-35. Mention and briefly introduce algorithms available for rule
induction pro cess.
Q-36. Illustrate the statement “BDS is a litmus test for deciding non -
linearity of time series data”.
Q-37. Compare and contrast Additive and Multiplicative methods.
Q-38. Explain steps to carry out Exponential Smoothing of time series
data.
Q-39. Di scuss pros and cons of ARIMA for the purpose of forecasting of
time series data.
Text book:
Mining of Massive Datasets, Anand Rajaraman and Jeffrey David
Ullman, Cambridge University Press, 2012. munotes.in
Page 54
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
54 Big Data, Big Analytics: Emerging Business Intelligence and
Analytic Trends for Today's Businesses, Michael Minelli, Wiley, 2013
References:
Big Data for Dummies, J. Hurwitz, et al., Wiley, 2013
Understanding Big Data Analytics for Enterprise Class Hadoop and
Streaming Data, Paul C. Zikopoulos, Chris Eaton, Dirk de Roos,
Thomas Deutsch, George Lapis, McGraw -Hill, 2012.
Big data: The next frontier for innovation, competition, and
productivity, James Manyika ,Michael Chui, Brad Brown, Jacques
Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers,
McKinsey Global I nstitute May 2011.
Big Data Glossary, Pete Warden, O’Reilly, 2011.
Big Data Analytics: From Strategic Planning to Enterprise Integration
with Tools, Techniques, NoSQL, and Graph, David Loshin, Morgan
Kaufmann Publishers, 2013
munotes.in
Page 55
55 2
MAP REDUCE
Unit Structure
2.0 Objectives
2.1 Introduction
2.2 Distributed File Systems
2.2.1 Physical Organization of Compute Nodes
2.2.2 Large -Scale File System Organization
2.3 Apache Hadoop
2.3.1 Elements of Hadoop Ecosystem
2.4 Map Reduce
2.5 Steps of Map Reduce
2.5.1 The Map Task
2.5.2 Grouping by Key
2.5.3 The Reduce Tasks
2.5.4 Combiners
2.5.5 Details of Map Reduce Execution
2.5.6 Coping with Node Failures
2.6 Algorithms using Map Reduce
2.6.1 Matrix -Vector Multiplication by Map Reduce
2.6.2 If the Vector v Cannot Fit in Main Memory
2.6.3 Relational Algebra Operations
2.6.4 Computing Selections by Map Reduce
2.6.5 Computing Projections by Map Reduce
2.6.6 Union, Intersections and Difference b y Map Reduce
2.6.7 Computing Natural Join by Map Reduce
2.7 Extensions to Map Reduce munotes.in
Page 56
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
56 2.7.1 Workflow Systems
2.7.2 Recursive Extensions to Map Reduce
2.7.3 Pregel
2.8 Common Map Reduce Algorithms
2.8.1 Sorting
2.8.2 Searching
2.8.3 Indexing
2.8.4 TF -IDF
2.9 Summary
2.10 List of References and Bibliography
2.11 Unit End Exercise
2.0 OBJECTIVES
This chapter will make you to understand the following concepts:
● The requirement of the big data handling tool.
● The structure and working of the Distributed File System.
● Physical Organization of Compute Nodes
● Large -Scale File System Organization
● The importance of Apache Hadoop, MapReduce and parallel processing
for mining large -scale data.
● The MapReduce Framework and it s steps of execution.
● The features and working flow of the MapReduce system.
● The MapReduce execution of Matrix Multiplication algorithm and
relational algebra operations.
● The input and output file format of MapReduce phases.
● The generalized form of MapRedu ce, a workflow system.
● The recursive extension of MapReduce and handling faults during
execution of MapReduce.
● Designing the MapReduce algorithm for small tasks and large data.
2.1 INTRODUCTION
In modern applications the quick data insights or analysis require us to manage
the immense amount of data quickly. In most of these applications, the data is
extremely regular, and there is ample opportunity to exploit parallelism. Some of
the Important examples are:
1. Importance wise ranking of Web pages, involves an iterated matrix -vector
multiplication where the dimension is in the tens of billions.
2. At social networking sites, searches in “friends” networks involve graphs
with hundreds of millions of nodes and many billions of edges. munotes.in
Page 57
Map Reduce
57 In these applications, a new software stack has developed. These applications are
using the new form of file system, which features much larger units than the disk
blocks in a conventional operating system. This file system also provides the
facility of replication of data to protect against the frequent media failures that
occur when data is distributed over thousands of disks.
Now a day many of the higher -level programming languages support these file
systems. The central component of these programming languages is MapReduce.
The Map Reduce implementation helps us to perform most common calculations
on large -scale data on large collections of computers efficiently, that is tolerant
of hardware failures during the computation.
Map-reduce systems are evolving and extending rapidly. In this chapter we will
discuss the distributed file systems, MapReduce, generalizations of map -reduce,
first to acyclic workflows and then to recursive algorithms. We will discuss some
common algorithms of MapReduce as well.
2.2 DISTRIBUTED FILE SYSTEMS
Most computations are performed on a single processor that uses its own main
memory, cache and local disk (a computing node).In such systems the files are
managed by a file management system. The file management system is capable
of handling the files that are stored on a single computer or cluster. In the past
parallel processing applications, the parallel processing was done on special
purpose computers with multiple processors and specialized hardware. The ever -
increasing web services have created the demand to do huge computing
independently and instantly on a large extensible cluster. As compared to the
special -purpose parallel computers the Commodity hardware is cheap in cost.
The availability of cheap and faster hardware gives rise to a new generation of
programming systems with the feature of parallelism. These systems take
advantage of the power of parallelism and at the same time avoid the reliability
problems that arise when the computing hardware consists of thousands of
independent components, any o f which could fail at any time.
In this chapter, we will discuss the characteristics of the computing installations
and the specialized file systems that have been developed to take advantage of
them.
2.2.1 Physical Organisation of Compute Nodes
The parall el-computing architecture or cluster computing comprises the
computing nodes that are organised into the number of racks. The rack may
contain 8 to 64 computing nodes that are connected by a network like gigabit
ethernet. The racks are connected with each other through a switch or another
level of network. In order to communicate with the nodes in other rack, the
bandwidth of inter -rack communication should be greater than the bandwidth of
intra-rack ethernet. Figure 2.1 shows the architecture of a large -scale computing
system with multiple racks, each with multiple nodes.In this network, the
principal modes are loss of a single node or loss of an entire rack. If any of the munotes.in
Page 58
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
58 nodes failed due to some reason, then the network will not be able to provide the
data of this node or perform computations on this node. Or if any connection to a
rack fails, then the network connecting its nodes to each other and the outside
world fails.
Figure 2.1: Computing nodes are organized into racks and racks are
interconnected by a switch. The large computations may take minutes or even
hours. During computation, if any one component failed, the abort or restart of
computation may lead to failure.
To overcome this problem,
1. Files must be stored redundantly onto multiple nodes.
2. Computations must be divided into tasks and allocated to the multiple
nodes.
2.2.2 Large -Scale File System Organization
To store the enormous file on multiple computers you need to use the distributed
file system. The distributed File Systems (DFS) can ha ndle the data stored across
multiple clusters or nodes. The files that are stored on a distributed file system
are rarely updated. The file is stored on multiple nodes by dividing it into a
number of chunks.
For example, to store the file of 30 TB in a di stributed file system with multiple
nodes in a cluster (each of capacity 10 TB), needs to be divided into the blocks or
chunks. The size of the chuck is defined by the user like 64 megabytes, 128
megabytes and so on.
The Fault tolerance is achieved by rep licating the chunks three times, at three
different compute nodes of different racks. This also helps us to get the copy of
the chunk in case of rack failure. Usually, both the chunk size and the degree of
replication can be decided by the user.
The metad ata of the chunks of a file is stored on a name node which acts as a
master node. The master node is itself replicated, and a directory for the file
system as a whole knows where to find its copies. The directory itself can be munotes.in
Page 59
Map Reduce
59 replicated, and all participa nts using the DFS know where the directory copies
are.
DFS Implementations
There are several distributed file systems of the type. Some of these systems that
are used in practice are:
1. The Google File System (GFS), the original of the class.
2. Hadoop Distributed File System (HDFS), an open -source DFS used with
Hadoop, an implementation of map -reduce and distributed by the Apache
Software Foundation.
3. Cloud Store, an open -source DFS originally developed by Kosmix.
2.3 APACHE HADOOP
Apache Hadoop is a collection of open -source utilities that allows us to use a
network of many computers to solve problems involving massive amounts of
Data and computation. Hadoop provides the software framework for distributed
data storage and MapReduce programming model for processing big data.
Hadoop is designed to scale up from a single server to a cluster of thousands of
machines. Each of these machines in the cluster offers the local computation and
storage.
Apache Hadoop was originally designed for computer clusters that are built from
commodity hardware or even high -end hardware. The Hadoop framework
distributes an analytical computation of massive data on many machines, each of
which simultaneously operates on their own individual chunk of data.
For distributed computing, the distributed systems shall meet the following
requirements -
1. Fault Tolerance : If any of the components fails, the entire system should not
get fail. The system should gracefully degrade into a lower performing state.
If a fail ed component recovers, it should be able to rejoin the system.
2. Recoverability : In case of failure, no data should be lost.
3. Consistency : The final result should not get affected due to failure of any
component.
4. Scalability : Adding more data and more computa tion leads to a decline in
performance but not fail; increasing resources should result in a proportional
increase in capacity.
Hadoop addresses these requirements through the abstract concepts, as defined in
the following list:
1. Data is distributed immediately when added to the cluster and stored on
multiple nodes. Nodes prefer to process data that is stored locally in order to
minimize traffic across the network.
2. Data is stored in blocks of a fixed size (usually 128 MB) and each block is
duplicated multiple times across the system to provide redundancy and data
safety.
3. A computation is usually referred to as a job; jobs are broken into tasks
where each individual node performs the task on a single block of data. munotes.in
Page 60
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
60 4. Jobs are written at a high level without concern for network programming,
time, or low -level infrastructure, allowing developers to focus on the data
and computation rather than distributed programming details.
5. The amount of network traffic between nodes should be minimi zed
transparently by the system. Each task should be independent and nodes
should not have to communicate with each other during processing to ensure
that there are no inter -process dependencies that could lead to deadlock.
6. Jobs are fault tolerant usually through task redundancy, such that if a single
node or task fails, the final computation is not incorrect or incomplete.
7. Master programs allocate work to worker nodes such that many worker
nodes can operate in parallel, each on their own portion of the l arger dataset.
These basic concepts, while implemented slightly differ to various Hadoop
systems, drive the core architecture and together ensure that the requirements for
fault tolerance, recoverability, consistency, and scalability are met. These
require ments also ensure that Hadoop is a data management system that behaves
as expected for analytical data processing, which has traditionally been
performed in relational databases or scientific data warehouses.
2.3.1 Elements of Hadoop Ecosystem
The Hadoop ecosystem is a platform that provides various services to solve the
big data problems. This includes various Apache products, commercial tools and
solutions. The four major elements of Hadoop Ecosystem are Hadoop Distributed
File System (HDFS), MapReduce, YARN and Hadoop Common. Hadoop
Ecosystem provides the tools that are used to perform tasks like load, analyse,
and maintain data. Some of the components/tools of Hadoop Ecosystem are as
follows:
1. Hadoop Distributed File System (HDFS)
2. Yet Another Resource Negotiator (YARN)
3. MapReduce - Programming based Data Processing
4. Spark for In -Memory data processing
5. PIG and HIVE - Query based processing of data services
6. HBase - NoSQL Database
7. Mahout and Spark MLLib - Machine Learning algorithm libraries
8. Solar and Lucene - Searching and Indexing
9. Zookeeper - Managing Cluster
10. Oozie - Job Scheduling
2.4 MAPREDUCE
Hadoop MapReduce is a Software framework. MapReduce is also referred to as a
programming model that performs parallel and distributed processing on massive munotes.in
Page 61
Map Reduce
61 datasets. The implementations of MapReduce can be used to manage large -scale
computations in a way that is tolerant of hardware faults.
MapReduce is the processing component of Hadoop and i s used to write the
applications that process huge] amounts of data in parallel on large Hadoop
clusters of commodity hardware. These clusters are scalable, reliable and fault
tolerant.
The term ‘MapReduce’ specifies the two distinct tasks that are to be performed
by Hadoop programs:
1. Map task which accepts the data and converts it into another set of data. Here
each individual element of the data is split into Key -value pairs.
2. Reduce Task that takes the output of Map Task as input and combines them
into a smaller set of tuples. So, the reducer task takes place after the
completion of the map task.
In brief, a map -reduce computation executes as follows:
1. The Map tasks with given one or more chunks from a distributed file
system turns the chunk into a sequence of key -value pairs. The way key -
value pairs are produced from the input data is determined by the code
written by the user for the Map function.
2. The key -value pairs from eac h Map task are collected by a master
controller and sorted by key. The keys are divided among all the Reduce
tasks, so all key -value pairs with the same key wind up at the same Reduce
task.
3. The Reduce tasks work on one key at a time, and combine all the va lues
associated with that key in some way. The manner of combination of
values is determined by the code written by the user for the Reduce
function.
The figure 2.2 shows the schematic of a MapReduce computation.
Figure 2.2: Schematic of a MapReduce Computation munotes.in
Page 62
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
62 The MapReduce programming is useful to gain valuable insights from the data.
The advantages of MapReduce programming are as follows:
1. Simple : Developers can write code using any of the languages including Ja va,
C++ and Python.
2. Scalability : Businesses can process petabytes of data stored in the Hadoop
Distributed File System (HDFS).
3. Flexibility : Hadoop enables easier access to multiple sources of data and
multiple types of data.
4. Speed : Due to parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
The MapReduce programming model is developed using Java with three classes:
1. Mapper class: The Mapper class performs the Map task of splitting data and
converti ng it into key -value pairs. The mapper class stores these resultant
key-value pairs in HDFS.
2. Reducer class: The Reducer class reads the output of the mapper class from
HDFS, processes it and generates the final output in the form of Key -value
pairs. The r educer stores this output of the Reduce task in HDFS.
3. Driver class: The Driver class sets up the MapReduce job to run in Hadoop.
With the help of Mapper class and Reducer class, MapReduce processes the
given input data and generates the output in form of key-value pairs. During this
process the data undergoes the various MapReduce steps.
2.5 STEPS OF MAP REDUCE
The MapReduce programming model follows the following steps for solving the
problem.
1. The Map Task
2. Grouping by Keys
3. Reduce Task
4. Combiner
2.5.1 The Map Task
The mapper accepts the user input file with the elements of any type like tuples
or a document. The mapper will split the input into the number of chunks and
distribute it over the network of map nodes. Here, a chunk is a collection of data
elemen ts and no element is stored across the two chunks. Each map node will
process the data and will return the list of key -value pairs.
Technically, all inputs from Map tasks and outputs of Reduce tasks are of the
key-value -pair form, but normally the keys of input elements are not relevant and
we shall tend to ignore them. Insisting on this form for inputs and outputs is
motivated by the desire to allow composition of several map -reduce processes.
A Map function is written to convert input elements to key -value pairs. The types
of keys and values are each arbitrary. Here, the keys are not “keys” in the usual munotes.in
Page 63
Map Reduce
63 sense; they do not have to be unique. Rather a Map task can produce several key -
value pairs with the same key, even from the same element.
Example 2.1 : Let us discuss a map -reduce computation with the standard Word
count example application: counting the number of occurrences for each word in
a collection of documents.
Here, in this example the input file is a repository of documents, and each
document is an element. Here the Map function defines the key value pair with
the document words as keys and the number of occurrences of words as integer
values. The Map task reads a document and splits it into its sequence of words
w1, w2, . . .,w n. After processing the Map task emits a sequence of key -value pairs
where the value is always 1. That is, the output of the Map task for this document
is the sequence of key -value pairs:
(w1,1), (w 2,1),. . ., (w n,1)
A single Map task will typically process many documents where each of these
documents is in one or more chunks. In such cases the output will be more than
the sequence for the one document suggested above. If a word w appears m times
among all the document s assigned to that process, then there will be m key-value
pairs (w, 1) among its output.
2.5.2 Grouping by Keys
Grouping and aggregation task is performed independently by the master
controller process. It is not related to Map and Reduce tasks. The mast er
controller process knows how many Reduce tasks r, that are given by the user to
the map -reduce system.
The master controller uses a hash function to group the keys. To do so it
produces a bucket number from 0 to r-1. So, each key that is emitted by a M ap
task is hashed and its key -value pair is put in one of r local files. Each file is
intended for one of the Reduce tasks.
After completing all the Map tasks successfully, the master controller merges the
file from each Map task that are intended for a pa rticular Reduce task and feeds
the merged file to that process as a sequence of key -list-of-value pairs. For each
key k, the input to the Reduce task that handles key k is a pair of the form (k, [v1,
v2, . . .,vn]), where (k, v1), (k, v2), . . ., (k, vn) are all the key -value pairs with key
k coming from all the Map tasks.
2.5.3 The Reduce Task
The Reduce function readsthe output of the Mapper function, which is in the key -
value pairs and combines the values in some way. After reading these key -value
pairs, the reducer function combines the list of values for each key. Once
combined, the reducer function merges the output of all reduce tasks into a single
file. munotes.in
Page 64
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
64 Example 2.2: In continuation to Example 2.1 of word -count. The Reduce
function sums all the values and returns a sequence of (w, m) pairs, where w is a
key or word that appears at least once in the input documents and m is the total
number of occurrences of word w among all those documents.
2.5.4 Combiners
Usually, the Reduce function is associative a nd commutative, which helps to
combine the values in any order, with the same result. In Example 2.2, the
addition performed is an example of an associative and commutative operation.
While grouping the values, the order of numbers in the list of values v1, v2, . . ., v n
does not affect the sum value.
Since the Reduce function is associative and commutative, it is possible to push
some tasks of Reduce to the Map tasks. For example, in Example 2.1, instead of
producing many pairs (w, 1), (w, 1), . . ., in ma p task, we could apply the Reduce
function within the Map task before sending output of map task for grouping and
aggregation. So, this list of key -value pairs would thus be replaced by one pair
with key w and value equal to the sum of all the 1’s in all t hose pairs. That
means, the pairs with key w generated by a single Map task would be combined
into a pair (w, m) , where m is the number of times that w appears among the
documents handled by this Map task. Though the reduced task is applied in map
tasks, t here is still a need for grouping and aggregation operations for grouping
the key -value pairs that are coming from map tasks of different map nodes.
2.5.5 Details of MapReduce Execution
While executing the MapReduce tasks, the various processes, tasks an d files
interact with each other as shown in Figure 2.3. With the help of Hadoop, a
library provided by a MapReduce system, the user program forks a Master
Controller process and some number of Worker processes at different compute
nodes. Here each of the worker nodes can act as a Map worker or a Reduce
worker but not both. The Map worker handles the Map task whereas the Reduce
worker handles the Reduce task.
The Master creates some number of Map tasks and some number of Reduce tasks
where this number is b eing selected by the user program. After creation, the
Master assigns these tasks to worker processes. Depending on the size of the
input file and the size of the chunk defined by the user, the Master creates one
Map task for every chunk. For each Reduce t ask, the Map task needs to create an
intermediate file. So, the number of reduced tasks should be less than the Map
tasks, otherwise the number of intermediate files explodes. munotes.in
Page 65
Map Reduce
65
Figure 2.3: Overview of the execution of a MapReduce program
All the Map and Reduce nodes status is maintained and controlled by the Master
node. The Master node keeps track of the execution process of the Map and
Reduce nodes. If any of the Map or Reduce nodes finish the execution, the
Master allocates the other ta sk to this node. If any task execution fails at a
particular node, the Master node reallocates that task to another node.
Every Map task is assigned one or more chunks of the input files. The Map node
executes the mapper code, written by the user on thes e chunks. The Map tasks
creates an output file for each reduce task and stores it on the local disk of Map
node and sends information about size and location of this file to Master node.
The Master node assigns the Reduce task to worker nodes and provide s the
output files of Map tasks as an input. The reduce task executes code written by
the user and sends output to the file that is stored in a distributed file system.
2.5.6 Coping with Node Failures
The Mater node is controlling the failure of the Map nodes and the Reduce
nodes. But what if the Master node fails? The one node can bring the entire
process down. In this case the entire MapReduce job needs to be restarted and
completed eventually.
The Master node periodically pings the worker nodes, and he nce the worker
processes. In case of the failure of the worker node, the master node reallocates
the complete process of this node to another node, since the output of this process
needs to be assigned to the Reduce task. The Master must also inform each
Reduce task that the location of its input from that Map task has changed. munotes.in
Page 66
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
66 Managing the failure of a Reduce worker is simpler. The Master simply sets the
status of its currently executing Reduce tasks to idle. These will be rescheduled
on another reduce wo rker later.
Example 2.3: Let us understand the steps of MapReduce with a word count
example. The Word count example reads a text file and counts the total number
of occurrences of each word.
Let us consider the text file sample.txt with the following te xt.
Bus. Car, Train, Ship, Ship, Train, Bus, Ship, Car
Figure 2.4 shows the steps of MapReduce tasks for the Wordcount example.
Figure 2.4: MapReduce steps of Wordcount example
The Map Task
For the above Example 2.3 mapper call read the input text file ‘Sample.txt’ and
split it into the 3 chunks, (Bus. Car, Train), (Ship, Ship, Train), and (Bus, Ship,
Car). After splitting the mapper assigns each of these chunks to the map nodes.
Each map node then splits the chunk text into words and converts it into th e key
value pair, by adding a frequency of occurrence as 1. Here the key is the word
and the value is the frequency of occurrence of that word.
Map Node 1: (Bus.1), (Car,1), (Train,1),
Map Node 2: (Ship,1), (Ship,1), (Train,1), and
Map Node 3: (Bus,1), ( Ship,1), (Car,1)
Grouping by Key
In Example 2.3 after completing a mapper phase, the reducer will read the output
of mapper, and partition it with the help of sorting and shuffling process for each
of the keys in the data set. The partition process sent th e tuples with the same key
to a respective reducer. The sort and shuffle acts on these lists of
pairs and sends out unique keys and a list of values associated with this unique
key .
Reducer Node 1: (Bus,1,1)
Reducer Node 2: (Car,1,1) munotes.in
Page 67
Map Reduce
67 Reducer Node 3: (Ship,1,1,1)
Reducer Node 4: (Train,1,1)
The Reduce Task
In Example 2.3 the reducer will aggregate the values of intermediate tuples that
are generated in sorting and shuffling step and will generate the list of unique
key-value pairs with the total number of key occurrences by summing the list of
values.
Reducer Node 1: (Bus,2)
Reducer Node 2: (Car,2)
Reducer Node 3: (Ship,3)
Reducer Node 4: (Train,2)
Combiner
In Example 2.3, the combiner will read the key -value pairs that are g enerated by
reducer and combine it into a single set of key -value pairs and write it into the
output file.
Final Output:
(Bus,2)
(Car,2)
(Ship,3)
(Train,2)
2.6 ALGORITHMS USING MAPREDUCE
MapReduce is growing rapidly and helps in parallel computing tasks like
determining the price for products, yielding the highest profits, predicting and
recommending analysis and so on. It allows programmers to run models over
different data sets and uses advanced statistical techniques and machine learning
techniqu es that help in predicting data.
MapReduce algorithms are not used for smaller tasks. Even every problem needs
not to use the Distributed File Systems for storing data. For example, we would
not expect to use either a DFS or an implementation of MapReduce for managing
online retail sales, even though a large on -line retailer such as Amazon.com uses
thousands of compute nodes when processing requests over the Web. The reason
is that the principal operations on Amazon data involve responding to searches
for p roducts, recording sales, and so on. MapReduce algorithms are not advised
to use for the processes that involve relatively little calculation and that need to
update the database. munotes.in
Page 68
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
68 On the other hand, the MapReduce algorithms are used for large computations or
processes. For example, Amazon uses MapReduce to perform analytic queries on
large amounts of data, such as finding the customers pattern, who are buying the
particular product.
MapReduce algorithm can be used with a variety of applications. It can be u sed
for distributed pattern -based searching, distributed sorting, web link graph
reversal, web access log stats. It can also help in creating and working on
multiple clusters, desktop grids, volunteer computing environments. One can also
create dynamic clo ud environments, mobile environments and also high -
performance computing environments.
Google made use of MapReduce which regenerates Google Index of the World
Wide Web. The original purpose for which the Google implementation of
MapReduce was created is to execute very large matrix -vector multiplications as
are needed in the calculation of Page Rank. Another important class of operations
that can use MapReduce effectively are relational -algebra operations.
2.6.1Matrix -Vector Multiplication by MapReduce
Let us consider matrix M of size n × n. The location of the element of matrix M
is referred to by row i and column jand will be denoted by m ij. Let us have a
vector v of length n, whose jth element is vj. The product of vector v and matrix M
is the vector x of length n, whose ith element x i is given by
If the value of n is small, say 100, we do not want to use a DFS or MapReduce
for this calculation. On the other hand, this method can be used when n is large.
For example, in search engines for the ranking of Web pages, n is in the tens of
billions.
When n is large, it should not be so large that vector v cannot fit in main memory
and be part of the input to every Map task. It is observed that there is nothing in
the definition of map -reduce that for bids p roviding the same input to more than
one Map task.
Both the matrix M and the vector v each will be stored in a file of the Distributed
File System. The elements of the Matrix are stored in rows and columns. The
element m ij, that is stored at the row i and column j, can be referred to by a
triple( i, j, m ij). In the same way the position of jthelement in the vector v is
referred to by vj.
The Map Function: Each Map task will take the entire vector v and a chunk of
the matrix M. From each matrix element mij it produces the key -value pair ( i,
mijvj). Thus, all terms of the sum that make up the component xi of the matrix -
vector product will get the same key. munotes.in
Page 69
Map Reduce
69 The Reduce Function: The Reduce task simply sums all the values associated
with a given key i. The result will be a pair ( i, x i).
2.6.2 If the Vector v Cannot Fit in Main Memory
If the vector v is so large that it will not fit in main memory, then to perform the
Matrix -Vector multiplication operation, we need to divide the vector into
horizontal stripe s of equal height. But in that case, we also need to divide the
matrix into the vertical stripes of equal width. Here we need to use enough stripes
so that the portion of the vector in one stripe can fit conveniently into the main
memory at a compute node. Figure 2.5shows the matrix and vector, which are
divided into five stripes.
Figure 2.5: Division of matrix and vector into five stripes
The ith stripe of the matrix multiplies only components from the ith stripe of the
vector. We can store each stripe o f matrix and vector into individual files. Each
Map task is assigned a chunk from one of the stripes of the matrix and gets the
entire corresponding stripe of the vector. The Map and Reduce tasks can then act
exactly as was described above for the case whe re Map tasks get the entire
vector.
2.6.3 Relational Algebra Operations
In database queries, the number of operations needs to be performed on large -
scale data. In many traditional database applications, the database is large but
some of the queries need t o retrieve a small amount of data. For example, in bank
applications, the database is too large but the query for getting the balance of an
account is too small. In all such applications, we need not to use MapReduce
algorithms.
In fact, there are many ope rations on data that can be described easily in terms of
the common database -query primitives, even if the queries themselves are not
executed within a database management system. Thus, a good starting point for
exploring applications of MapReduce is by co nsidering the standard operations
on relations.
In relational model, a relation is a table with column headers called attributes.
Rows of the relation are called tuples. The set of attributes of a relation is called munotes.in
Page 70
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
70 its schema. We often write an expression for a Relation R like R(A 1, A 2, . . . ,
An)where A1,A2, . . . , A n.are the attributes of it.
Example 2.4: The Figure 2.6shows the part of the relation Links that describes
the structure of the Web. The relation has two attributes, From and To. In this
relation, a row, or tuple is a pair of URLs, such that there is at least one link from
the first URL to the se cond. For example, the first row of Figure 2.6 is the pair
(url1, url2) that says the Web page url1 has a link to page url2. Figure 2.6 shows
only four tuples. But the typical search engine stores the billions of tuples that
define the relation of url1 to url2. Such large files of rations are stored on DFS.
Figure 2.6: Relation Links consists of the set of pairs of URLs, such that the first
has one or more links to the second
The relation algebra specifies several standard operations on relations that ar e
used to implement queries. The queries are usually written in SQL. Some of the
relational -algebra operations are:
1. Selection : Applying a condition C to each tuple in the relation and producing
as output only those tuples that satisfy C. The result of the selection is
denoted σC(R).
2. Projection : For subset S of the attributes of the relation, produce from each
tuple only the components for the attributes in S. The result of the projection
is denoted πS(R).
3. Union, Intersection, and Diffe rence : These set operations apply to the sets
of tuples in two relations that have the same schema.
4. Natural Join : Given two relations, compare each pair of tuples, one from
each relation. If the tuples agree on all the attributes that are common to the
two schemas, then produce a tuple that has components for each of the
attributes in either schema and agrees with the two tuples on each attribute. If
the tuples disagree on one or more shared attributes, then produce nothing
from this pair of tuples. The n atural join of relations R and S is denoted R ⊳⊲
S. While we shall discuss executing only the natural join with map -reduce,
all equijoins can be executed in the same manner.
5. Grouping and Aggregation: For a given relation R, partition its tuples
according to their values in one set of attributes G, called the grouping
attributes. Then, for each group, aggregate the values in certain other
attributes.
munotes.in
Page 71
Map Reduce
71 The common aggregations are SUM, COUNT, AVG, MIN, and MAX. Here the
MIN and MAX require the aggregated attributes of number or string type. The
SUM and AVG require the numeric type attribute to perform arithmetic. The
grouping -and-aggregation operation on a relation R is denoted by γX(R) , where X
is a list of elements that are either
a) A grouping attribute, or
b) An expression θ(A), where θ is one of the five aggregation operations such
as SUM, and A is an attribute not among the grouping attributes.
The result of this operation is one tuple for each group. That tuple has a
component for each of the grouping attributes, with the value common to tuples
of that group, and a component for each aggregation, with the aggregated value
for that group.
Example 2.5 : For the relation in Figure 2.6, let us try to find the paths of length
two in the Web. That is, we want to find the triples of URLs (u, v, w) such that
there is a link from u to v and a link from v to w.
We need to take the natural join of Links with itself, but we first need to imagine
that it is two relations, with different schemas, so we can describe the desired
connection as a natural join. Thus, imagine that there are two copies of Links,
namely L1(U1, U2) and L2(U2, U3) . Now, if we compute L1⊳⊲L2, we s hall
have exactly what we want. That is, for each tuple t1 of L1 (i.e., each tuple of
Links) and each tuple t2 of L2 (another tuple of Links, possibly even the same
tuple), see if their U2 components are the same. Note that these components are
the second component of t1 and the first component of t2. If these two
components agree, then produce a tuple for the result, with schema (U1, U2, U3) .
This tuple consists of the first component of t1, the second component of t1
(which must equal the first component of t2), and the second component of t2.
We may not want the entire path of length two, but only want the pairs (u, w) of
URLs such that there is at least one path from u to w of length two. If so, we can
project out the middle components by computing πU1,U3(L1 ⊳⊲ L2).
2.6.4 Computing Selections by MapReduce
The Selection operations does not require the full power of map -reduce. They
can be done most conveniently either by using Map portion or the Reduce
portion. A map -reduce implementation of selection is denoted by σC (R).
The Map Function: For each tuple t in R, test if it satisfies C. If so, produce the
key-value pair (t, t). That is, both the key and value are t.
The Reduce Function: The Reduce function is the identity. It simply passes
each key -value pair to the output.
Here, the output is not exactly a r elation, since it has key -value pairs. However, a
relation can be obtained by using only the value components (or only the key
components) of the output.
munotes.in
Page 72
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
72 2.6.5 Computing Projections by MapReduce
Projection is performed similarly to selection, because proje ction may cause the
same tuple to appear several times, the Reduce function must eliminate
duplicates. We may compute π S (R) as follows.
The Map Function: For each tuple t in R, construct a tuple t′ by eliminating
from t those components whose attributes a re not in S. Output the key -value pair
(t′, t′).
The Reduce Function: For each key t′ produced by any of the Map tasks, there
will be one or more key -value pairs (t′, t′). The Reduce function turns (t′, [t′, t′, . .
., t′]) into (t′, t′), so it produces ex actly one pair (t′, t′) for this key t′.
The Reduce operation is duplicate elimination. This operation is associative and
commutative, so a combiner associated with each Map task can eliminate
whatever duplicates are produced locally. However, the Reduce t asks are still
needed to eliminate two identical tuples coming from different Map tasks.
2.6.6 Union, Intersection and Difference by MapReduce
Union
Let us consider the union of two relations. Suppose relations R and S have the
same schema. Map tasks will be assigned chunks from either R or S; it doesn’t
matter which. The Map tasks don’t really do anything except pass their input
tuples as key -value pairs to the Reduce tasks. The latter need only eliminate
duplicates as for projection.
The Map Function: Turn each input tuple t into a key -value pair (t, t).
The Reduce Function: Associated with each key t there will be either one or
two values. Produce output (t, t) in either case.
Intersection
To compute the intersection, we can use the same Map function. However, the
Reduce function must produce a tuple only if both relations have the tuple. If the
key t has two values [t, t] associated with it, then the Reduce task for t should
produce (t, t). However, if the value assoc iated with key t is just [t], then one of
R and S is missing t, so we don’t want to produce a tuple for the intersection. We
need to produce a value that indicates “no tuple,” such as the SQL value NULL.
When the result relation is constructed from the out put, such a tuple will be
ignored.
The Map Function: Turn each tuple t into a key -value pair (t, t).
The Reduce Function: If key t has value list [t, t], then produce (t, t). Otherwise,
produce (t, NULL).
munotes.in
Page 73
Map Reduce
73 Difference
The Difference R−S requires a bit more thought. The only way a tuple t can
appear in the output is if it is in R but not in S. The Map function can pass tuples
from R and S through, but must inform the Reduce function whether the tuple
came from R or S. We shall thus use the relation as the va lue associated with the
key t. Here is a specification for the two functions.
The Map Function : For a tuple t in R, produce key -value pair (t, R), and for a
tuple t in S, produce key -value pair (t, S). Note that the intent is that the value is
the name of R or S, not the entire relation.
The Reduce Function : For each key t, do the following.
1. If the associated value list is [R], then produce (t, t).
2. If the associated value list is anything else, which could only be [R, S], [S,
R], or [S], produce (t, NULL).
2.6.7 Computing Natural Join by MapReduce
The idea behind implementing natural join via map -reduce can be seen if we look
at the specific case of joining R(A, B) with S(B, C). We must find tuples that
agree on their B components, that is the second compone nt from tuples of R and
the first component of tuples of S. We shall use the B -value of tuples from either
relation as the key. The value will be the other component and the name of the
relation, so the Reduce function can know where each tuple came from.
The Map Function: For each tuple (a, b) of R, produce the key -value pair (b,(R,
a)). For each tuple (b, c) of S, produce the key -value pair (b,(S, c)).
The Reduce Function : Each key value b will be associated with a list of pairs
that are either of the for m (R, a) or (S, c). Construct all pairs consisting of one
with first component R and the other with first component S, say (R, a) and (S,
c). The output for key b is (b, [(a1, b, c1), (a2, b, c2), . . .]), that is, b associated
with the list of tuples that can be formed from an R -tuple and an S -tuple with a
common b value.
2.7 EXTENSIONS TO MAPREDUCE
The MapReduce method of computation gave rise to many systems with some
extensions and modifications. Some of the common characteristics of these
systems and M apReduce systems are as follows:
1. Both the extended systems and the MapReduce are built on a distributed file
system.
2. Both of them manage large numbers of tasks, which are nothing but the
instantiations of a small number of user -written functions.
3. Both of these provide the feature of fault tolerance, that handles the
execution of a large job, without having to restart that job from the
beginning. munotes.in
Page 74
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
74 In this topic, we will discuss “workflow” systems, that are nothing but the
extension of MapReduce. The workf low system supports acyclic networks of
functions, where each function is implemented by a collection of tasks. The
Systems like UC Berkeley’s Spark, Google’s Tens or Flow have been
implemented using the workflow system. The latest machine learning
applica tions have workflow systems at heart.
2.7.1 Workflow Systems
The experimental system, Clustera from the University of Wisconsin and
Hyracks from the University of California at Irvine has extended the map -reduce
from the simple two -step workflow (the Map function feeds the Reduce function)
to any collection of functions, with an acyclic graph representing workflow
among the functions. The MapReduce is a two -step workflow in which the Map
function feeds the Reduce function. Workflow systems extend MapReduc e to
any collection of functions, with an acyclic graph representing workflow among
the functions. In Workflow systems, the workflow is represented with an acyclic
flow graph, whose arcs a → b represents the fact that function a’s output is an
input to fun ction b.
In the workflow system the data file containing the elements of one type is
passed from one function to the next function. In case of single input, the
function is applied to each input independently, same as that of the Map and
Reduce functions are applied to their input elements individually. Each of these
functions spits the output in the form of a file, that is generated after processing
the input file. When a function has inputs from multiple files, elements from each
of the files can be com bined in various ways. But the function itself is applied to
combinations of input elements, at most one from each input file.
Figure 2.7: An example of a workflow that is more complex than Map feeding
Reduce
Example2.6: The Figure 2.7 shows a workflow with five functions, f through j.
Here the data is passed from left to right in such a way that the flow of data is
acyclic and no tasks need to provide data out before getting its entire input. For
example, the function h takes its input from a pre -existi ng file of the distributed
file system. Then each output element of h is passed to the functions i and j. The
function i takes the outputs of both f and h as inputs. The output of function j is munotes.in
Page 75
Map Reduce
75 either passed to an application that invoked this dataflow or is stored in the
distributed file.
The workflow systems are analogous to the MapReduce functions. So, in a
workflow system each function of a workflow can be executed by many tasks
where each of these functions is assigned a portion of the input. A master
controller divides the work among the tasks that implement a function by hashing
the input elements to decide on the proper task to receive an element. Same as
that of the Map tasks, the workflow task that is implementing the function f has
an output file of data, which is passed to each task implementing the successor
function(s) f. After completing the task execution, the controller delivers these
output files to the DFS.
Similar to MapReduce tasks, the workflow tasks follow the blocking property, in
whic h they only deliver output after they complete. In case of task failure, it has
not delivered output to any of its successors in the flow graph. To recover this
failed task, a master controller restarts this task at another compute node, without
worrying t hat the output of the restarted task will duplicate output that previously
was passed to some other task
Some workflow systems applications effectively cascade the MapReduce jobs.
For example, in the join of three relations, one MapReduce job joins the fir st two
relations, and a second MapReduce job joins the third relation with the result of
joining the first two relations.
The advantage of implementing cascades as a single workflow is that the master
controller manages the flow of data among tasks, and i ts replication without
storing the temporary file in the distributed file system whereas the MapReduce
jobs stores output file in the distributed systems. By locating tasks at compute
nodes that have a copy of their input, we can avoid much of the communic ation
that would be necessary if we stored the result of one MapReduce job and then
initiated a second MapReduce job. The Hadoop and other MapReduce systems
also try to locate Map tasks where a copy of their input is already present.
The other popular extensions of MapReduce are Spark and Google’s Tens or
Flow, which has a workflow system at heart.
Spark
Spark uses a workflow system and provides the following advanced features:
1. A more efficient way to cope up with the failures.
2. A mo re efficient way of grouping tasks among compute nodes and
scheduling execution of functions.
3. Integration of programming language features such as looping (which
technically takes it out of the acyclic workflow class of systems) and
function libraries. munotes.in
Page 76
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
76 Spark uses the central data abstraction, called the Resilient Distributed Dataset
(RDD). RDD is a file of objects of one type. One of the examples of an RDD is
the files of key -value pairs that are used in MapReduce systems or the files that
get passed amo ng functions of the workflow system as shown in Figure 2.7. The
RDDs are normally broken into chunks that may be held at different compute
nodes. The RDDs are “resilient” and are able to recover from the loss of any or
all chunks of an RDD. Unlike the key -value -pair abstraction of MapReduce,
there is no restriction on the type of the elements that comprise an RDD.
The Spark program performs the transformations and actions on the RDDs. A
Sparks program consists of a sequence of steps. Each of these steps app lies some
function to an RDD to produce another RDD. These operations are also referred
to as transformations. Some of the commonly used operations are Map, Flatmap,
and Filter. Spark also allows to take data from the surrounding file system, such
as Hadoo p Distributed File System, and turn it into an RDD, and to take an RDD
and return it to the surrounding file system or to produce a result that is passed
back to an application that called a Spark program. Here the process of returning
the RDD output to an application is also referred to as actions. In Spark, the
Reduce operation is an action, not a transformation.
The Spark implementation differs from Hadoop or other MapReduce
implementations. It uses lazy evaluation of RDD’s and lineage for RDD’s.
Tenso F low
TensorFlow is an open -source system developed at Google to support machine -
learning applications. Same as that of Spark, TensorFlow provides a
programming interface in which one writes a sequence of steps. Programs are
typically acyclic, although like Spark it is possible to iterate blocks of code.
One major difference between Spark and TensorFlow is the type of data that is
passed between steps of the program. In place of the RDD, TensorFlow uses
tensors; a tensor is simply a multidimensional matrix.
2.7.2 Recursive Extensions to MapReduce
Many large -scale computations like Google’s search algorithm, Page Rank are
recursive extensions of MapReduce. These are nothing but the computations of
the fixed point of a matrix -vector multiplication that can be computed under
MapReduce systems by the matrix -vector multiplication iterative algorithm. The
iteration typically continues for an unknown number of steps, each step being a
MapReduce job, until the results of two consecutive iterations are sufficiently
close that we believe convergence has occurred.
Recursions present a problem for failure recovery. Recursive tasks inherently
lack the blocking property necessary for independent restart of failed tasks. It is
impossible for a collection of mutually recursive tasks, each of which has an
output that is input to at least some of the other tasks, to produce output only at
the end of the task. If they all followed that policy, no task would ever receive
any input, and nothing could be accomplished. As a r esult, some mechanism munotes.in
Page 77
Map Reduce
77 other than simple restart of failed tasks must be implemented in a system that
handles recursive workflows (flow graphs that are not acyclic). We shall start by
studying an example of a recursion implemented as a workflow, and then di scuss
approaches to dealing with task failures.
Example 2.7: Let us consider a directed graph with arcs, that are represented by
the relation E(X, Y) . That means there is an arc from node X to node Y. Here we
wish to compute the paths relation P(X, Y), that is a path from node X to node
Yhaving of length 1 or more. The path P is the transitive closure of E. A simple
recursive algorithm is:
1. Start with P(X, Y ) = E(X, Y ).
2. While changes to the relation P occur, add to P all tuples in
The above equation states that the pairs of nodes X and Y for some point Z are
known to have the path from X to Z and from Z to Y.
Figure 2.8shows the organization of recursive tasks to be performed for this
computation. There are two kinds of tasks: Join tasks and Dup -elim tasks. The
figure 2.8 shows the some of the n tasks with the respective bucket of hash
function h.
Once discovered, a path tuple P(a, b) , becomes input to two Join tasks that are
numbered h(a) and h(b). The job of the ith Join task, whe n it receives input tuple
P(a, b) , is to find certain other tuples seen previously (and stored locally by that
task).
1. Store P(a, b) locally.
2. If h(a) = i then look for tuples P(x, a) and produce output tuple P(x, b).
3. If h(b) = i then look for tuples P(b, y) and produce output tuple P(a, y).
In rare cases, we have h(a) = h(b), so both steps (2) and (3) are executed. But
generally, only one of these needs to be executed for a given tuple.
Also, Figure 2.8 shows m Dup -elim tasks with the corresponding bu cket of hash
function g with two arguments. The output of some Join task P(c, d) is then sent
to Dup -elim task j = g(c, d). On receiving this tuple, the jth Dup-elim task checks
that it has not received this tuple before, since its job is duplicate eliminat ion. If
previously received, the tuple is ignored. But if this tuple is new, it is stored
locally and sent to two Join tasks, those numbered h(c) and h(d).
Every Join task has m output files, one for each Dup -elim task. Every Dup -elim
task has n output fil es, one for each Join task. These files are distributed
according to any of several strategies. Initially, the E(a, b) tuples representing the
arcs of the graph are distributed to the Dup -elim tasks, with E(a, b) being sent as
P(a, b) to Dup -elim task g(a, b). The master controller waits until each Join task
has processed its entire input for a round. Then, all output files are distributed to munotes.in
Page 78
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
78 the Dup -elim tasks, which create their own output. That output is distributed to
the Join tasks and becomes their in put for the next round.
Figure 2.8: Implementation of transitive closure by a collection of recursive tasks
In Example 2.7 it is not necessary to have two kinds of tasks. Instead, Join tasks
could eliminate duplicates as they are received, since they mus t store their
previously received inputs anyway. This arrangement has an advantage when we
must recover from a task failure. If each task stores all the output files it has ever
created, and we place Join tasks on different racks from the Dup -elim tasks, t hen
we can deal with any single compute node or single rack failure. That is, a Join
task needing to be restarted can get all the previously generated inputs that it
needs from the Dup -elim tasks, and vice versa.
In the specific case of computing transitiv e closure, it is not necessary to prevent
a restarted task from generating outputs that the original task generated
previously. In the computation of the transitive closure, the rediscovery of a path
does not influence the eventual answer. However, many co mputations cannot
tolerate a situation where both the original and restarted versions of a task pass
the same output to another task. For example, if the final step of the computation
were an aggregation, say a count of the number of nodes reached by each node in
the graph, then we would get the wrong answer if we counted a path twice.
Let us discuss at least three different approaches that have been used to deal with
failures while executing a recursive program.
1. Iterated MapReduce : Write the recursion as repeated execution of a
MapReduce job or of a sequence of MapReduce jobs. In this case, to handle munotes.in
Page 79
Map Reduce
79 failures at any step, we can then rely on the failure mechanism of the MapReduce
implementation. The very first example of such a sys tem was HaLoop.
2. The Spark Approach: The Spark language includes iterative statements,
such as for -loops that allow the implementation of recursions. In Spark, failure
management is implemented using the lazy -evaluation and lineage mechanisms.
In addition to this the Spark programmer has opt ions to store intermediate states
of the recursion.
3. Bulk -Synchronous Systems : These systems use a graph -based model of
computation. They typically use another resilience approach: periodic check
pointing. One of the examples of bulk synchronous system is Pragel.
2.7.3 Pragel
Another approach that implements the recursive algorithms on a computing
cluste r is represented by Google’s Pra gel system. This System is the first
example of a graph -based, bulk -synchronous system that processes massive
amounts of data. This system views its data as a graph, where each node of the
graph corresponds roughly to a task. Each graph node generates output messages
that are destined for other nodes of the graph, and each graph node processes the
inputs it receives from oth er nodes.
Example 2.8: Suppose our data is a collection of weighted arcs of a graph, and
we want to find, for each node of the graph, the length of the shortest path to
each of the other nodes. As the algorithm executes, each node a will store a set of
pairs (b, w), where w is the length of the shortest path from node a to node b that
is currently known.
Here first we need to store the set of pairs and weight for each graph node. For
example, graph node a, stores the set of pairs (b, w) such that there is a n arc from
a to b of weight w. Then these facts are sent to all other nodes, as triples (a, b, w) ,
with the intended meaning that node a knows about a path of length w to node b.
When the node a receives a triple (c, d, w), it must decide whether this fact
implies a shorter path than a already knows about from itself to node d. Node a
looks up its current distance to c; that is, it finds the pair (c, v) stored locally, if
there is one. It also finds the pair (d, u) if there is one. If w + v < u, then the pa ir
(d, u) is replaced by (d, w + v), and if there is no pair (d, u), then the pair (d, w +
v) is stored at the node a. Also, the other nodes are sent the message (a, d, w + v)
in either of these two cases.
In Pre gel, the computations are organized into sup er steps. In one super step, all
the messages that were received by any of the nodes at the previous super step
are processed, and then all the messages generated by those nodes are sent to
their destination. This approach of packaging many messages into o ne is referred
to as “bulk -synchronous.”
The bulk synchronous approach has reduced the overhead of sending many
messages on the network. This is one of the very important advantages of the
bulk synchronous approach. munotes.in
Page 80
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
80 Suppose that in Example 2.8 we sent a s ingle new shortest -distance fact to the
relevant node every time one was discovered. The number of messages sent
would be enormous if the graph was large, and it would not be realistic to
implement such an algorithm. However, in a bulk -synchronous system, a task
that has the responsibility for managing many nodes of the graph can bundle
together all the messages being sent by its nodes to any of the nodes being
managed by another task. That choice typically saves orders of magnitude in the
time required to send all the needed messages.
Failure Management in Pregel
In case of a compute -node failure, there is no attempt to restart the failed tasks at
that compute node. Rather, Pregel checkpoints its entire computation after some
of the super steps. A checkpoin t consists of making a copy of the entire state of
each task, so it can be restarted from that point if necessary. If any compute node
fails, the entire job is restarted from the most recent checkpoint.
Although this recovery strategy causes many tasks tha t have not failed to redo
their work, it is satisfactory in many situations. Recall that the reason
MapReduce systems support restart of only the failed tasks is that we want
assurance that the expected time to complete the entire job in the face of failur es
is not too much greater than the time to run the job with no failures. Any failure -
management system will have that property as long as the time to recover from a
failure is much less than the average time between failures. Thus, it is only
necessary th at Pregel checkpoints its computation after a number of super steps
such that the probability of a failure during that number of super steps is low.
2.8 COMMON MAPREDUCE ALGORITHMS
The MapReduce implements the number of mathematical algorithms. Such
algorithms divide a task into number of chunks and assign them to distributed
nodes. These distributed nodes act as Map nodes and Reduce nodes, and executes
the map and reduce tasks respectively. Some of the common mathematical
algorithms are:
1. Sorting
2. Searching
3. Indexing
4. TF-IDF
2.8.1 Sorting
Sorting is one of the basic MapReduce algorithms, used to process and analyse
data. MapReduce implements the sorting algorithm to automatically sort the
output key -value pairs from the mapper by their keys. The mappe r class
implements Sorting method. After tokenizing the values, during the Shuffle and
Sort phase, the mapper class collects the matching valued keys as a collection. To
collect similar intermediate key -value pairs, the Mapper class takes the help munotes.in
Page 81
Map Reduce
81 of class to sort the key -value pairs. The set of intermediate key -value pairs for a
given Reducer is automatically sorted by Hadoop to form key -values (K2, {V2,
V2, …}) before they are presented to the Reducer.
2.8.2 Searching
Searching plays an important role in the Map Reduce algorithm. It helps in the
combiner phase and in the Reducer phase. The following example demonstrates
the working of the searching algorithm.
Example 2.9: The example shows how MapReduce employs a Searching
algorithm to find out the detail s of the employee who draws the highest salary in
a given employee dataset.
Let us assume we have employee data in four different files A, B, C, and D. Let
us also assume there are duplicate employee records in all four files because of
importing the emplo yee data from all database tables repeatedly.
Figure 2.9: Data of files A,B,C and D
The Map phase processes each input file and provides the employee data in key -
value pairs ( :) as shown in Figure 2.10.
Figure 2.10: Output of Map Process
The combiner phase (searching technique) will accept the input from the Map
phase as a key -value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employ ee in each file. The expected result is as shown in figure 2.11.
Figure 2.11: Output of Combiner
Reducer phase - Form each file, you will find the highest salaried employee. To
avoid redundancy, check all the pairs and eliminate duplicate entries, if
any. The same algorithm is used in between the four pairs, which are
coming from four input files. Th e final output should be as follows -
munotes.in
Page 82
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
82 2.8.3 Indexing
Normally indexing is used to point to a particular data and its address. It
performs batch indexing on the input files for a particular Mapper.
The indexing technique that is normally use d in MapReduce is known
as inverted index. Search engines like Google and Bing use inverted indexing
techniques . Let us try to understand how Indexing works with the help of a
simple example.
Example 2.10 : The following text is the input for inverted indexing. Here T[0],
T[1], and t[2] are the file names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output -
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1,
2} implies the term "is" appears in the files T[0], T[1], and T[2].
2.8.4 TF -IDF
TF-IDF is a text processing algorithm whic h is short for Term Frequency −
Inverse Document Frequency. It is one of the common web analysis algorithms.
Here, the term 'frequency' refers to the number of times a term appears in a
document.
Term Frequency (TF)
It measures how frequently a particula r term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the
total number of words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total
number of terms in the docume nt)
Inverse Document Frequency (IDF)
It measures the importance of a term. It is calculated by the number of documents
in the text database divided by the number of documents where a specific term
appears. munotes.in
Page 83
Map Reduce
83 While computing TF, all the terms are considered e qually important. That means,
TF counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus,
we need to know the frequent terms while scaling up the rare ones, by computing
the following -
IDF(the) = log_e(Total number of documents / Numb er of documents with term
‘the’ in it).
2.9 SUMMARY
● The common architecture, cluster of compute nodes, is used to process very
large -scale applications.
● The Distributed File Systems architecture is used to store and process the
large data files on distributed nodes.
● The MapReduce framework processes the data parallelly on the DFS with the
help of cluster nodes like Master node, Map node and Reduce node and so
on.
● The Map and Reduce functions are problem specific and need to be designed
by the user.
● The Map and Reduce functions generate the output in Key -value pair
formats. The Map function stores output in the intermediatory file whereas
the Reduce function stores the final output file.
● Apache Hadoop is the open -source implementation of a Distribute d File
System also referred as HDFS.
● The MapReduce framework is fault tolerant and manages the faults of
Master, Map and Reduce nodes.
● MapReduce is not suitable for all parallel algorithms. The Simple
implementations like, Matrix -Vector multiplication, Ma trix-Matrix
Multiplications, Principal operators of linear algebra can be done in
MapReduce.
● MapReduce is generalized to the systems, supporting any acyclic collection
of functions, which are referred to as workflow systems. Each of these
functions can be instantiated by any number of tasks that are responsible for
executing that function on a portion of the data.
● In case of recursive workflows, it is not possible to restart the whole task
again. Instead, a number of checkpointing parts of the computation a llows
restart of single task. You can also restart all tasks from a recent checkpoint
has been proposed.
● The MapReduce algorithms can be implemented by using any of the
programming languages like, Java, Python and so on. The MapReduce
algorithms are genera lly written for large -scale data.
munotes.in
Page 84
Track C Business Intelligence
and Big Data Analytics –II
(Mining Massive Data sets )
84 2.10 LIST OF REFERENCES AND BIBLIOGRAPHY
1. Mining of Massive Datasets, Anand Rajaraman and Jeffrey David Ullman,
Cambridge University Press, 2012.
2. Big Data, Big Analytics: Emerging Business Intelligence and Analytic
Trends forToday's Businesses, Michael Minelli, Wiley, 2013.
3. https://www.ibm.com/topics/mapreduce
4. https://www.dcs.bbk.ac.uk /~dell/teaching/cc/book/mmds/mmds_ch2n_4.pdf
5. https://www.tutorialspoint.com/map_reduce/map_reduce_algorithm. htm
6. https://stanford.edu/~rezab/amdm/notes/lecture4.pdf
7. https://www.cdac.in/index.aspx?id=ev_hpc_hadoop -map-reduce#hadoop -
map-reduce -par-prog-id12
2.11 UNIT END EXERCISE
1. Distributed File System? How is DFS extended in the Hadoop Distributed
File System?
2. What is a Distributed File System? How does the system store file of large
size on DFS?
3. What is Apache Hadoop? What are the characteristics of a Distributed File
System?
4. What is the Hadoop ecosystem? Discuss the various elements of the
Hadoop Ecosystem?
5. What is MapReduce? What are the advantages of MapReduce?
6. Explain the steps of execution of MapReduce.
7. Describe the Map task and Reduce task with an example for each.
8. What is the role of mapper function and combiner function in MapReduce?
9. What is the role of a Master node? How does the master role control the
failure of a task or a node?
10. Explain the steps of execution for word count algorithm with an example.
11. Explain the Matrix -Vector multiplication algorithm with an example.
12. How does the MapReduce algorithm handle the vector of large size?
13. What are relational algebra operations? Explain each operation in brief.
14. How does MapRed uce handle the selection and projection operations
computing? Explain the role of Map and Reduce tasks and an example for
each.
15. Explain union, intersection and NaturalJoin computing operations of
MapReduce.
16. What are the characteristics of the MapReduce System? How is the
MapReduce framework extended to the workflow system?
17. Explain the function of the workflow system with an example. munotes.in
Page 85
Map Reduce
85 18. What is the purpose of the workflow system? Discuss any two examples of
workflow systems.
19. What do mean by recursive extensi on of MapReduce? Describe the process
of transitive closure for the number of recursive tasks.
20. Discuss the various approaches of handling the failure of recursive
MapReduce tasks?
21. Describe the Bulk -Synchronous System - Pregel with an example.
22. Discuss any three common MapReduce algorithms.
23. Write a program to implement the matrix -multiplication algorithm using
any one programming language.
❖❖❖❖
munotes.in
Page 86
86 3
SHINGLING OF DOCUMENTS
Unit Structure
3.0 Objectives
3.1 Introduction
3.2 Finding Similar Items
3.3 Applications of Near -Neighbor Search
3.4 Shingling of Documents
3.5 Similarity -Preserving Summaries of Sets
3.6 Locality -Sensitive Hashing for Documents
3.7 Dista nce Measures
3.8 The Theory of Locality -Sensitive Functions
3.9 LSH Families for Other Distance Measures
3.10 Applications of Locality -Sensitive Hashing
3.11 Methods for High Degrees of Similarity
3.0 OBJECTIVES
We will study how to define the distance betw een sets. To illustrate and
motivate this study, we will focus on using Jaccard distance to measure
the distance between documents. This uses the common “bag of words”
model, which is simplistic, but is sufficient for many applications. We
start with some big questions. This lecture will only begin to answer them.
• Given two homework assignments (reports) how can a computer detect if
one is likely to have been plagiarized from the other without
understanding the content?
• In trying to index webpages, how does Google avoid listing duplicates or
mirrors?
• How does a computer quickly understand emails, for either detecting
spam or placing effective advertisers? (If an ad worked on one email, how
can we determine which others are similar?) munotes.in
Page 87
Shingling of Documents
87 The key to answer ing these questions will be convert the data
(homeworks, webpages, emails) into an object in an abstract space that we
know how to measure distance, and how to do it efficiently.
3.1 INTRODUCTION
In data mining large number of dataset is finding similar items. As an
example, finding similar documents can be recommended. In this case
many methods are existed. For example, Shingling method and length
based filtering are one of them.
In Shingling method, from each document, substrings have been selected
with symbol name and, they are placed on one set. For finding similar
documents, the similarities of sets that related with them have been
calculated. In Length based filtering just documents which close these
lengths have been compared. These methods don’t consider repetition of
symbols. With considering the repetition can calculate length of
documents with more accurately.
In this paper we suggested a method for finding similar documents with
considering the repetition of symbols. This method separated doc uments
to better form. The main goal of this a method for finding similar
documents with take fewer comparisons and time indeed.
3.2 FINDING SIMILAR ITEMS
A fundamental data -mining problem is to examine data for “similar”
items. We shall take up applica tions in Section 3.1, but an example
would be looking at a collection of Web pages and finding near -
duplicate pages. These pages could be plagiarisms, for example, or they
could be mirrors that have almost the same content but differ in
information about the host and about other mirrors.
We begin by phrasing the problem of similarity as one of finding
sets with a relatively large intersection. We show how the problem of
finding textually similar documents can be turned into such a set problem
by the techniq ue known as “shingling.” Then, we introduce a technique
called “minhashing,” which compresses large sets in such a way that
we can still deduce the similarity of the underlying sets from their
compressed versions. Other techniques that work when the requir ed
degree of similarity is very high are covered in Section 3.9.
Another important problem that arises when we search for similar
items of any kind is that there may be far too many pairs of items to
test each pair for their degree of similarity, even if c omputing the
similarity of any one pair can be made very easy. That concern
motivates a technique called “locality -sensitive hashing,” for focusing
our search on pairs that are most likely to be similar.
Finally, we explore notions of “similarity” that are not expressible as
inter - section of sets. This study leads us to consider the theory of munotes.in
Page 88
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
88
| ∩ | | ∪ | distance measures in arbitrary spaces. It also motivates a general
framework for locality -sensitive hashing that applies for other
definitions of “similarity.”
3.3 APPLIC ATIONS OF NEAR -NEIGHBOR
SEARCH
We shall focus initially on a particular notion of “similarity”: the
similarity of sets by looking at the relative size of their intersection.
This notion of similarity is called “Jaccard similarity,” and will be
introduced in Section 3.1.1. We then examine some of the uses of
finding similar sets. These include finding textually similar documents
and collaborative filtering by finding similar customers and similar
products. In order to turn the problem of textual similarity of
documents into one of set intersection, we use a technique called
“shingling,” which is introduced in Section 3.2.
3.3.1 Jaccard Similarity of Sets
The Jaccard similarity of sets S and T is S T / S T, that is, the
ratio of the size of the intersection of S a nd T to the size of their
union. We shall denote the Jaccard similarity of S and T by SIM
(S, T).
Example 3.1: In Fig. 3.1 we see two sets S and T. There are three
elements in their intersection and a total of eight elements that appear
in S or T or both. Thus, SIM(S , T) = 3/8. ✷
S
T
Figure 3.1: Two sets with Jaccard similarity 3/8
3.3.2 Similarity of Documents
An important class of problems that Jaccard similarity addresses
well is that of finding textually similar documents in a large cor pus
such as the Web or a collection of news articles. We should understand
that the aspect of similarity we are looking at here is character -level
similarity, not “similar meaning,” which requires us to examine the
words in the documents and their uses. That problem is also interesting
but is addressed by other techniques, which we hinted at in Section
1.3.1. However, textual similarity also has important uses. Many of
these involve finding duplicates or near duplicates. First, let us observe munotes.in
Page 89
Shingling of Documents
89 that testing whether two documents are exact duplicates is easy; just
compare the two documents character -by-character, and if they ever
differ then they are not the same. However, in many applications, the
documents are not identical, yet they share large portions of their text.
Here are some examples:
APPLICATIONS OF NEAR -NEIGHBOR SEARCH
Plagiarism
Finding plagiarized documents tests our ability to find textual
similarity. The plagiarizer may extract only some parts of a document
for his own. He may alter a few words and may alter the order in which
sentences of the original appear. Yet the resulting document may still
contain 50% or more of the original. No simple process of comparing
documents character by character will detect a sophisticated
plagiarism.
Mirror Pages
It is common for important or popular Web sites to be duplicated at a
number of hosts , in order to share the load. The pages of these mirror
sites will be quite similar, but are rarely identical. For instance, they
might each contain information assoc iated with their particular host ,
and they might each have links to the other mirror sites but not to
themselves . A related phenomenon is the appropriation of pages from
one class to another. These pages might include class notes,
assignments, and lecture slides. Similar pages might change the name
of the course, year, and make small changes from year to year. It is
important to be able to detect similar pages of these kinds, because
search engines produce better results if they avoid showing two pages
that are nearly identical within the first page of results.
Articles from the Same Source
It is common for one reporter to write a news article that gets
distributed, say through the Associated Press, to many newspapers,
which then publish the artic le on their Web sites. Each newspaper
changes the article somewhat. They may cut out paragraphs, or even
add material of their own. They most likely will surround the article by
their own logo, ads, and links to other articles at their site. However,
the c ore of each newspaper’s page will be the original article. News
aggregators, such as Google News, try to find all versions of such an
article, in order to show only one, and that task requires finding when
two Web pages are textually similar, although not identical.1
3.3.3 Collaborative Filtering as a Similar -Sets Problem
Another class of applications where similarity of sets is very important
is called collaborative filtering , a process whereby we recommend to
users items that were liked by other users who have exhibited similar
tastes. We shall investigate collaborative filtering in detail in Section munotes.in
Page 90
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
90 9.3, but for the moment let us see some common examples.
1News aggregation also involves finding articles that are about the
same topic, even though not textually similar. This problem too can
yield to a similarity search, but it requires techniques other than
Jaccard similarity of sets.
On-Line Purchases
Amazon.com has millions of customers and sells millions of items. Its
database records which items have been bought by which customers.
We can say two cus- tomers are similar if their sets of purchased items
have a high Jaccard similarity. Likewise, two items that have sets of
purchasers with high Jaccard similarity will be deemed similar. Note
that, while we might expect mirror sites to have Jaccard similarity
above 90%, it is unlikely that any two customers have Jac - card
similarity that high (unless they have purchased only one item). Even a
Jaccard similarity like 20% might be unusual enough to identify
customers with similar tastes. The same observation holds for items;
Jaccard similarities need not be very high to be significant.
Collaborative filtering requires several tools, in addition to finding
similar customers or items, as we discuss in Chapter 9. For exa mple,
two Amazon customers who like science -fiction might each buy many
science -fiction books, but only a few of these will be in common.
However, by combining similarity - finding with clustering (Chapter 7),
we might be able to discover that science - fiction books are mutually
similar and put them in one group . Then , we can get a more powerful
notion of customer -similarity by asking whether they made purchases
within many of the same groups.
Movie Ratings
Netflix records which movies each of its customers rented, and also the
ratings assigned to those movies by the customers. We can see movies
as similar if they were rented or rated highly by many of the same
customers, and see customers as similar if they rented or rated highly
many of the same movies. The same observations that we made for
Amazon above apply in this situation: similarities need not be high to
be significant, and clustering movies by genre will make things easier.
When our data consists of ratings rather than binary decisions
(bought/did not buy or liked/disliked), we cannot rely simply on sets as
representations of customers or items. Some options are:
3.3.3.1 Ignore low -rated customer/movie pairs ; that is, treat these
events as if the customer never watched the movie.
3.3.3.2 When comparing customers, im agine two set elements for
each movie, “liked” and “hated.” If a customer rated a movie highly,
put the “liked” for that movie in the customer’s set. If they gave a low
rating to a movie, put “hated” for that movie in their set. Then, we can munotes.in
Page 91
Shingling of Documents
91 { } { } look for high Jaccard similarity among these sets. We can do a similar
trick when comparing movies.
3.3.3.3 If ratings are 1 -to-5-stars, put a movie in a customer’s set n
times if they rated the movie n -stars. Then, use Jaccard similarity for
bags when measuring the similarity of customers. The Jaccard
similarity for bags B and C is defined by counting element n times in the
intersection if n is the minimum of the number of times the element appears
in B and C. In the union, we count the element the sum of the number of
times i t appears in B and in C.2
Example 3.2 : The bag -similarity of bags a, a, a, b and a, a, b, b, c
is 1/3. The intersection counts a twice and b once, so its size is 3. The
size of the union of two bags is always the sum of the sizes of the two
bags , or 9 in this case. Since the highest possible Jaccard similarity for
bags is 1/2, the score of 1/3 indicates the two bags are quite similar, as
should be apparent from an examination of their contents.
3.3.4 Exercises for Section 3.1
Exercise 3.1.1: Compute th e Jaccard similarities of each pair of the
following three sets: {1, 2, 3, 4}, {2, 3, 5, 7}, and {2, 4, 6}.
Exercise 3.1.2: Compute the Jaccard bag similarity of each pair of the
fol- lowing three bags: {1, 1, 1, 2}, {1, 1, 2, 2, 3}, and {1, 2, 3, 4}.
!! Exercise 3.1.3: Suppose we have a universal set U of n elements, and
we choose two subsets S and T at random, each with m of the n
elements. What is the expected value of the Jaccard similarity of S
and T?
3.4 SHINGLING OF DOCUMENTS
The most effective wa y to represent documents as sets, for the purpose
of iden - tifying lexically similar documents is to construct from the
document the set of short strings that appear within it. If we do so, then
documents that share pieces as short as sentences or even phr ases will
have many common elements in their sets, even if those sentences
appear in different orders in the two docu - ments. In this section, we
introduce the simplest and most common approach, shingling, as well
as an interesting variation.
3.4.1 k-Shingles
A document is a string of characters. Define a k -shingle for a
document to be any substring of length k found within the document.
Then, we may associate with each document the set of k -shingles that
appear one or more times within that document.
Example 3.3: Suppose our document D is the string abcdabd, and
we pick k = 2. Then the set of 2-shingles for D is {ab, bc, cd, da, and
bd}. munotes.in
Page 92
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
92 Note that the substring ab appears twice within D, but appears only
once as a shingle. A variation of shingling produces a bag, rather than a
set, so each shingle would appear in the result as many times as it
appears in the document. However, we shall not use bags of shingles
here.
There are several options regarding how white space (blank, tab,
newline, etc.) is treated. It probably makes sense to replace any
sequence of one or more white -space characters by a single blank. That
way, we distinguish shingles that cover two or more words from those
that do not.
Example 3.4: If we use k = 9, but eliminate whitespace altogether, the n
we would see some lexical similarity in the sentences “The plane was
ready for touch down ” and “The quarterback scored a touchdown ”.
However, if we retain the blanks, then the first has shingles touch dow
and ouch down, while the second has touchdown . If we eliminated the
blanks , then both would have touchdown.
3.4.2 Choosing the Shingle Size
We can pick k to be any constant we like. However, if we pick k too
small, then we would expect most sequences of k characters to appear
in most documents. If so, t hen we could have documents whose
shingle -sets had high Jaccard simi - larity, yet the documents had none
of the same sentences or even phrases. As an extreme example, if we
use k = 1, most Web pages will have most of the common characters
and few other cha racters, so almost all Web pages will have high
similarity.
How large k should be depends on how long typical documents are and
how large the set of typical characters is. The important thing to
remember is: k should be picked large enough that the probab ility of
any given shingle appearing in any given document is low.
Thus, if our corpus of documents is emails, picking k = 5 should
be fine. To see why, suppose that only letters and a general white -
space character ap- pear in emails (although in practice, most of the
printable ASCII characters can be expected to appear
occasionally). If so, then there would be 275 = 14,348,907 possible
shingles. Since the typical email is much smaller than 14 million
characters long, we would expect k = 5 to work well, and indeed it
does. However, the calculation is a bit more subtle. Surely, more
than 27 charac - ters appear in emails , However, all characters do not
appear with equal proba - bility. Common letters and blanks
dominate, while ”z” and other letters that have hig h point -value in
Scrabble are rare. Thus, even short emails will have many 5 -shingles
consisting of common letters, and the chances of unrelated emails
sharing these common shingles are greater than would be implied by
the calculation in the paragraph abov e. A good rule of thumb is to
imagine that there are only 20 characters and estimate the number of k -
shingles as 20k. For large documents, such as research articles, choice
k = 9 is considered safe. • munotes.in
Page 93
Shingling of Documents
93 −
− 3.4.3 Hashing Shingles
Instead of using substrings directly as shingles, we can pick a hash
function that maps strings of length k to some number of buckets and
treat the resulting bucket number as the shingle. The set representing a
document is then the set of integers that are bucket numbers of one or
more k -shing les that appear in the document. For instance, we could
construct the set of 9 -shingles for a document and then map each of
those 9 -shingles to a bucket number in the range 0 to 232 -1. Thus , each
shingle is represented by four bytes instead of nine. Not o nly has the
data been compacted, but we can now manipulate (hashed) shingles by
single -word machine operations.
Notice that we can differentiate documents better if we use 9 -shingles
and hash them down to four bytes than to use 4-shingles, even though
the space used to represent a shingle is the same. The reason was touched
upon in Section 3.2.2. If we use 4 -shingles, most sequences of four
bytes are unlikely or impossible to find in typical documents. Thus, the
effective number of different shingles is much less than 232 -1. If, as in
Section 3.2.2, we assume only 20 characters are frequent in English
text, then the number of different 4 -shingles that are likely to occur is
only (20)4 = 160,000. However, if we use 9 -shingles, there are many
more than 232 likely shingles. When we hash them down to four bytes,
we can expect almost any sequence of four bytes to be possible, as was
discussed in Section 1.3.2.
3.4.4 Shingles Built from Words
An alternative form of shingle has proved effective for the problem of
ident ifying similar news articles, mentioned in Section 3.1.2. The
exploitable distinction for this problem is that the news articles are
written in a rather different style than are other elements that typically
appear on the page with the article. News articl es, and most prose,
have a lot of stop words (see Section 1.3.1), the most common words
such as “and,” “you,” “to,” and so on. In many applications, we
want to ignore stop words, since they don’t tell us anything useful
about the article, such as its topic .
However, for the problem of finding similar news articles, it was found
that defining a shingle to be a stop word followed by the next two
words, regardless of whether or not they were stop words, formed a
useful set of shingles. The advantage of this ap proach is that the news
article would then contribute more shingles to the set representing the
Web page than would the surrounding ele-ments. Recall that the goal
of the exercise is to find pages that had the same articles, regardless of
the surrounding elements. By biasing the set of shingles in favor of
the article, pages with the same article and different surrounding
material have higher Jaccard similarity than pages with the same
surrounding material but with a different article.
Example 3.5: An ad mi ght have the simple text “ Buy Sudzo .”
However, a news article with the same idea might read something munotes.in
Page 94
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
94 like “ A spokesperson for the Sudzo Corporation revealed today
that studies have shown it is good for people to buy Sudzo
products.” Here, we have italicized all the likely stop words, although
there is no set number of the most frequent words that should be
considered stop words. The first three shingles made from a stop
word and the next two following are:
A spokesperson for
for the Sudz o
the Sudzo Corporation
There are nine shingles from the sentence, but none from the “ad.”
3.4.5 Exercises for Section 3.2
Exercise 3.2.1 : What are the first ten 3 -shingles in the first sentence of
Sec- tion 3.2?
Exercise 3.2.2 : If we use the stop-word -based shingles of Section
3.2.4, and we take the stop words to be all the words of three or fewer
letters, then what are the shingles in the first sentence of Section 3.2?
Exercise 3.2.3 : What is the largest number of k -shingles a document
of n bytes can ha ve? You may assume that the size of the alphabet is
large enough that the number of possible strings of length k is at least
as n.
3.5 SIMILARITY -PRESERVING SUMMARIES
OF SETS
Sets of shingles are large. Even if we hash them to four bytes each, the
space needed to store a set is still roughly four times the space taken by
the document. If we have millions of documents, it may well not be
possible to store all the shingle -sets in main memory.3
Our goal in this section is to replace large sets by much smaller
represen- tations called “signatures.” The important property we need
for signatures is that we can compare the signatures of two sets and
estimate the Jaccard sim - ilarity of the underlying sets from the
signatures alone. It is not possible that the similarit y of each pair.
We take up the solution to this problem in Section 3.4. the
signatures give the exact similarity of the sets they represent, but the
esti- mates they provide are close, and the larger the signatures the
more accurate the estimates. For exa mple, if we replace the 200,000 -
byte hashed -shingle sets that derive from 50,000 -byte documents by
signatures of 1000 bytes, we can usually get within a few percent.
3.5.1 Matrix Representation of Sets
Before explaining how it is possible to construct small signatures
from large sets, it is helpful to visualize a collection of sets as their munotes.in
Page 95
Shingling of Documents
95 { characteristic matrix . The columns of the matrix correspond to the
sets, and the rows correspond to elements of the universal set from
which elements of the sets are drawn . There is a 1 in row r and
column c if the element for row r is a member of the set for column c.
Otherwise the value in position (r, c) is 0. Element S1 S2 S3 S4
a 1 0 0 1
b 0 0 1 0
c 0 1 0 1
d 1 0 1 1
e 0 0 1 0
Figure 3.2: A matrix representing four sets
Example 3.6: In Fig. 3.2 is an example of a matrix representing sets
chosen from the universal set {a, b, c, d, e}. Here, S 1 = {a, d}, S 2 =
{c}, S 3 = {b, d, e}, and S 4 = a, c, d . The top row and leftmost
columns are not part of the matrix, but are present only to remind us
what the rows and columns represent.
It is important to remember that the characteristic matrix is unlikely to
be the way the data is stored, but it is useful as a way to visualize the
data. For one reason not to store data as a matrix, these matrices are
almost always sparse (they have many more 0’s than 1’s) in practice.
It saves space to represent a sparse matrix of 0’s and 1’s by the
positions in which the 1’s appear. For another reason, the data is
usually stored in some other format for other purposes.
As an example, if rows are products, and columns are customers,
represented by the set of products they bought, then this data would
really appear in a database table of purchases. A tuple in this table
would list the item, the purchaser, and probably other details about the
purchase, such as the date and the credit card used.
3.5.2 Minhashing
The signatures we desire to construct for sets are composed of the
results of a large number of calculations, say several hundred, each of
which is a “minhash” of the characteristic matrix. In this section, we
shall learn how a minhash is computed in principle, and in later
sections we shall see how a good approxi - mation to the minhash is
computed in practice.
To minhash a set represented by a column of the characteristic matrix,
pick a permutation of the rows. The minhash value of any column is
the number of the first row, in the permuted order, in which the
column has a 1. munotes.in
Page 96
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
96 Example 3.7 : Let us suppose we pick the order of rows beadc for the
matrix of Fig. 3.2. This permutation defines a minhash function h that
maps sets to rows. Let us compute the minhash value of set S 1
according to h. The first column, which is the column for set S 1, has 0
in row b, so we proceed to row e, the second in the permuted order.
There is again a 0 in the column for S1, so we proceed to row a,
where we find a 1. Thus. h(S 1) = a. Element S1 S2 S3 S4
b 0 0 1 0
e 0 0 1 0
a 1 0 0 1
d 1 0 1 1
c 0 1 0 1
Figure 3.3: A permutation of the rows of Fig. 3.2
Although it is not physically possible to permute very large
characteristic matrices, the minhash function h implicitly reorders the
rows of the matrix of Fig. 3.2 so it becomes the matrix of Fig. 3.3.
In this matrix, we can read off the values of h by scanning from the
top until we come to a 1. Thus, we see that h(S 2) = c, h(S 3) = b, and
h(S 4) = a.
3.5.3 Minhashing and Jaccard Similarity
There is a remarkable connection between minhashing and Jaccard
similarity of the sets that are minhashed.
The probability that the minhash function for a random permutation of
rows produces the same value for two sets equals the Jaccard
similarity of those sets.
To see why, we need to picture the columns for those two sets. If we
restrict ourselves to the columns for sets S 1 and S 2, then rows can be
divided into three classes:
3.5.3.1 Type X rows have 1 in both columns.
3.5.3.2 Type Y rows have 1 in one of the columns and 0 in the other.
3.5.3.3 Type Z rows have 0 in both columns.
Since the matrix is sparse, most rows are of type Z. However, it is
the ratio of the numbers of type X and type Y rows that determine
both SIM(S 1, S2) and the probability that h(S 1) = h(S 2). Let there be
x rows of type X and y rows of type Y . Then SIM(S 1, S2) = x/(x +
y). The reason is that x is the size of S1 ∩ S2 and x + y is the size of
S1 ∪ S2. • munotes.in
Page 97
Shingling of Documents
97 − − Now, consider the probability that h(S 1) = h(S 2). If we imagine the
rows permuted randomly, and we proceed from the top, the probability
that we shall meet a type X row before we meet a type Y row is
x/(x + y). But if the first row from the top other than type Z rows is
a type X row, then surely h(S 1) = h(S 2). On the other hand, if the
first row other than a type Z row that we meet is a type Y row, then
the set with a 1 gets that row as its minhash value. However the set
with a 0 in that row surely gets some row further down the permuted
list. Thus, we know h(S 1) /= h(S 2) if we first meet a type Y row. We
conclude the probability that h(S 1) = h(S 2) is x/(x + y), which is
also the Jaccard similarity of S1 and S2.
3.5.4 Minha sh Signatures
Again think of a collection of sets represented by their characteristic
matrix M. To represent sets, we pick at random some number n of
permutations of the rows of M. Perhaps 100 permutations or several
hundred permutations will do. Call the minhash functions determined
by these permutations h1, h2, . . . , hn. From the column representing set
S, construct the minhash signature for S, the vector [h 1(S), h 2(S), . . . ,
hn(S)]. We normally represent this list of hash -values as a column.
Thus, we can form from matrix M a signature matrix , in which the
ith column of M is replaced by the minhash signature for (the set
of) the ith column.
Note that the signature matrix has the same number of columns as M
but only n rows. Even if M is not represent ed explicitly, but in some
compressed form suitable for a sparse matrix (e.g., by the locations
of its 1’s), it is normal for the signature matrix to be much smaller
than M.
3.5.5 Computing Minhash Signatures
It is not feasible to permute a large characteristic matrix explicitly.
Even picking a random permutation of millions or billions of rows is
time-consuming, and the necessary sorting of the rows would take
even more time. Thus, permuted matrices like that suggested by Fig.
3.3, while conceptually appealing , are not implementable.
Fortunately, it is possible to simulate the effect of a random
permutation by a random hash function that maps row numbers to as
many buckets as there are rows. A hash function that maps integers 0,
1, . . . , k 1 to bucket numbers 0 through k 1 typically will map some
pairs of integers to the same bucket and leave other buckets unfilled.
However, the difference is unimportant as long as k is large and there
are not too many collisions. We can maintain the fiction that our h ash
function h “permutes” row r to position h(r) in the permuted order.
Thus, instead of picking n random permutations of rows, we pick n
randomly chosen hash functions h 1, h 2, . . . , h n on the rows. We
construct the signature matrix by considering each r ow in their given
order. Let SIG(i, c) be the element of the signature matrix for the ith munotes.in
Page 98
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
98 hash function and column c. Initially, set SIG(i, c) to ∞ for all i and
c. We handle row r by doing the following:
1. Compute h1(r), h2(r), . . . , hn(r).
2. For each column c do the following:
(a) If c has 0 in row r, do nothing.
(b) However, if c has 1 in row r, then for each i = 1, 2, . . . , n set
SIG(i, c) to the smaller of the current value of SIG(i, c) and hi(r). Row S1 S2 S3 S4 x + 1 mod 5 3x + 1 mod 5 0 1 0 0 1 1 1
1 0 0 1 0 2 4
2 0 1 0 1 3 2
3 1 0 1 1 4 0
4 0 0 1 0 0 3
Figure 3.4: Hash functions computed for the matrix of Fig. 3.2
Example 3.8 : Let us reconsider the characteristic matrix of Fig. 3.2,
which we reproduce with some additional data as Fig. 3.4. We have
replaced the letters naming the rows by integers 0 through 4. We have
also chosen two hash functions: h 1(x) = x+1 mod 5 and h 2(x) = 3x+1
mod 5. The values of these two functions applied to the row numbers
are given in the last two columns of Fig. 3.4. Notice that these simple
hash functions are true permutations of the rows, but a true
permutation is only possible because the number of rows, 5, is a prime.
In general, there will be collisions, where two rows get the same hash
value.
Now, let us simulate the algorithm for computing the signature
matrix.
Initially, this matrix consists of all ∞’s:
S1 S2 S3 S4
h1
h2 ∞
∞ ∞
∞ ∞
∞ ∞
∞
First, we consider row 0 of Fig. 3.4. We see that the values of
h1(0) and h2(0) are both 1. The row numbe red 0 has 1’s in the
columns for sets S1 and S4, so only these columns of the signature munotes.in
Page 99
Shingling of Documents
99 ∞ matrix can change. As 1 is less than, we do in fact change both values
in the columns for S 1 and S 4. The current estimate of the signature
matrix is thus:
S1 S2 S3 S4
h1
h2 1
1 ∞
∞ ∞
∞ 1
1
Now, we move to the row numbered 1 in Fig. 3.4. This row has 1
only in S3, and its hash values are h 1(1) = 2 and h 2(1) = 4. Thus, we
set SIG(1, 3) to 2 and SIG(2, 3) to 4. All other signature entries
remain as they are because their columns have 0 in the row
numbered 1. The new signature matrix:
S1 S2 S3 S4
h1
h2 1
1 ∞
∞ 2
4 1
1
The row of Fig. 3.4 numbered 2 has 1’s in the columns for S 2 and
S4, and its hash values are h 1(2) = 3 and h 2(2) = 2. We could
change the values in the signature f or S 4, but the values in this column
of the signature matrix, [1, 1], are each less than the corresponding hash
values [3, 2]. However, since the column for S2 still has ∞’s, we
replace it by [3, 2], resulting in:
S1 S2 S3 S4
h1 1 3 2 1
h2 1 2 4 1
Next comes the row numbered 3 in Fig. 3.4. Here, all columns but
S2 have 1, and the hash values are h 1(3) = 4 and h 2(3) = 0. The
value 4 for h 1 exceeds what is already in the signature matrix for all
the columns, so we shall not change any values in the first row of
the signature matrix. However, the value 0 for h 2 is less than what is
already present, so we lower SIG(2, 1), SIG(2, 3) and SIG(2, 4) to 0.
Note that we cannot lower SIG(2, 2) because the column for S 2 in Fig.
3.4 has 0 in the row we are currently considering. The resulting
signature matrix:
munotes.in
Page 100
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
100 S1 S2 S3 S4
h1 1 3 2 1
h2 0 2 0 0
Finally, consider the row of Fig. 3.4 numbered 4. h 1(4) = 0 and
h2(4) = 3. Since row 4 has 1 only in the column for S 3, we only
compare the current signature column for th at set, [2, 0] with the hash
values [0, 3]. Since 0 < 2, we change SIG(1, 3) to 0, but since 3 > 0 we
do not change SIG(2, 3). The final signature matrix is:
S1 S2 S3 S4
h1 1 3 0 1
h2 0 2 0 0
We can estimate the Jaccard similarities of the underlying sets from
this signature matrix. Notice that columns 1 and 4 are identical, so
we guess that SIM(S 1, S4) = 1.0. If we look at Fig. 3.4, we see that
the true Jaccard similarity of S 1 and S 4 is 2/3. Remember that the
fraction of rows that agree in the signat ure matrix is only an
estimate of the true Jaccard similarity, and this example is much too
small for the law of large numbers to assure that the estimates are
close. For additional examples, the signature columns for S 1 and S3
agree in half the rows (true similarity 1/4), while the signatures of
S1 and S2 estimate 0 as their Jaccard similarity (the correct value).
3.5.6 Exercises for Section 3.3
Exercise 3.3.1 : Verify the theorem from Section 3.3.3, which relates
the Jac - card similarity to the probability of minhashing to equal
values, for the partic - ular case of Fig. 3.2.
(a) Compute the Jaccard similarity of each of the pairs of columns in
Fig. 3.2.
! (b) Compute, for each pair of columns of that figure, the fraction of
the 120 permutations of the rows that make the two columns hash to
the same value.
Exercise 3.3.2 : Using the data from Fig. 3.4, add to the signatures of
the columns the values of the following hash functions:
(a) h3(x) = 2x + 4 mod 5.
(b) h4(x) = 3x − 1 mod 5.
munotes.in
Page 101
Shingling of Documents
101 − Element S1 S2 S3 S4
0 0 1 0 1
1 0 1 0 0
2 1 0 0 1
3 0 0 1 0
4 0 0 1 1
5 1 0 0 0
Figure 3.5: Matrix for Exercise 3.3.3
Exercise 3.3.3 : In Fig. 3.5 is a matrix with six rows.
(a) Compute the minhash signature for each column if we use the
following three hash functions: h 1(x) = 2x + 1 mod 6; h 2(x) =
3x + 2 mod 6; h3(x) = 5x + 2 mod 6.
(b) Which of these hash functions are true permutations?
(c) How close are the estimated Jaccard similarities for the six pairs of
columns to the true Jaccard similarities?
! Exercise 3.3.4 : Now that we know Jaccard similarity is related to the
proba - bility that two sets minhash to the same value, reconsider
Exercise 3.1.3. Can you use this relationship to simplify the problem of
computing the expected Jaccard similarity of randomly chosen sets?
! Exercise 3.3.5 : Prove that if the Jaccard similarity of two columns is
0, then minhashing always gives a correct estimate of the Jaccard
similarity.
!! Exercise 3.3.6 : One might expect that we could estimate the Jaccard
simi- larity of columns without usi ng all possible permutations of
rows. For example, we could only allow cyclic permutations; i.e., start
at a randomly chosen row r, which becomes the first in the order,
followed by rows r + 1, r + 2, and so on, down to the last row, and
then continu ing with the first row, second row, and so on, down to
row r 1. There are only n such permutations if there are n rows.
However, these permutations are not sufficient to estimate the Jaccard
similarity correctly. Give an example of a two -column matrix wh ere
averaging over all the cyclic permutations does not give the Jaccard
similarity.
! Exercise 3.3.7 : Suppose we want to use a MapReduce framework to
compute minhash signatures. If the matrix is stored in chunks that
correspond to some columns, then it is quite easy to exploit
parallelism. Each Map task gets some of the columns and all the hash munotes.in
Page 102
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
102 functions, and computes the minhash signatures of its given columns.
However, suppose the matrix were chunked by rows, so that a Map
task is given the hash functio ns and a set of rows to work on. Design
Map and Reduce functions to exploit MapReduce with data in this
form.
3.6 LOCALITY -SENSITIVE HASHING FOR
DOCUMENTS
Even though we can use minhashing to compress large documents into
small signatures and preserve the expe cted similarity of any pair of
documents, it still may be impossible to find the pairs with greatest
similarity efficiently. The reason is that the number of pairs of
documents may be too large, even if there are not too many
documents.
Example 3.9 : Suppose we have a million documents, and we use
signatures of length 250. Then we use 1000 bytes per document for the
signatures, and the entire data fits i n a gigab yte – less than a typical
main memory of a laptop.
However, there are 1,000,000 or half a trillion pairs of documents. If
it takes a microsecond to compute the similarity of two signatures,
then it takes almost six days to compute all the similarities on that
laptop.
If our goal is to compute the similarity of every pair, there is
nothing we can do to reduce the work, although parallelism can reduce
the elapsed time. However, often we want only the most similar pairs
or all pairs that are above some lower bound in similarity. If so, then
we need to focus our attention only on pairs that are likely to be
similar, without investigating every pair. There is a general theory of
how to provide such focus, called locality -sensitive hashing (LSH) or
near -neighbor search . In this section we shall consider a specific form
of LSH, designed for the partic ular problem we have been studying:
documents, represented by shingle -sets, then minhashed to short
signatures. In Section 3.6 we present the general theory of locality -
sensitive hashing and a number of applications and related
techniques.
3.6.1 LSH for Minhash Signatures
One general approach to LSH is to “hash” items several times, in such
a way that similar items are more likely to be hashed to the same
bucket than dissimilar items are. We then consider any pair that
hashed to the same bucket for any of the hashings to be a candidate
pair. We check only the candidate pairs for similarity. The hope is that
most of the dissimilar pairs will never hash to the same bucket, and
therefore will never be checked. Those dissimilar pairs that do hash to
the same bucke t are false positives ; we hope these will be only a small
fraction of all pairs. We also hope that most of the truly similar munotes.in
Page 103
Shingling of Documents
103 pairs will hash to the same bucket under at least one of the hash
functions. Those that do not are false negatives ; we hope these will be
only a small fraction of the truly similar pairs.
If we have minhash signatures for the items, an effective way to
choose the hashings is to divide the signature matrix into b bands
consisting of r rows each. For each band, there is a hash funct ion that
takes vectors of r integers (the portion of one column within that band)
and hashes them to some large number of buckets. We can use the
same hash function for all the bands, but we use a separate bucket
array for each band, so columns with the s ame vector in different
bands will not hash to the same bucket.
Example 3.10 : Figure 3.6 shows part of a signature matrix of 12 rows
divided into four bands of three rows each. The second and fourth of
the explicitly shown columns each have the column vector [0, 2, 1] in the
first band, so they will definitely hash to the same bucket in the
hashing for the first band. Thus, regardless of what those columns
look like in the other three bands, this pair of columns will be a
candidate pair. It is possible th at other columns, such as the first two
shown explicitly, will also hash to the same bucket according to the
hashing of the first band. However, since their column vectors are
different, [1, 3, 0] and [0, 2, 1], and there are many buckets for each
hashing, we expect the chances of an accidental collision to be very
small. We shall normally assume that two vectors hash to the same
bucket if and only if they are identical.
Two columns that do not agree in band 1 have three other chances to
become a candidate pair; they might be identical in any one of
these other bands.
band 1
band 2
band 3
band 4
Figure 3.6: Dividing a signature matrix into four bands of three rows
per band
However, observe that the more similar two columns are, the more
likely it i s that they will be identical in some band. Thus, intuitively
the banding strategy makes similar columns much more likely to be
candidate pairs than dissimilar pairs. 1 0 0 0 2
. . . 3 2 1 2 2 . . .
0 1 3 1 1
munotes.in
Page 104
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
104 — − 3.6.2 Analysis of the Banding Technique
Suppose we use b bands of r rows each, and suppose tha t a particular
pair of documents have Jaccard similarity s. Recall from Section 3.3.3
that the prob - ability the minhash signatures for these documents agree
in any one particular row of the signature matrix is s. We can calculate
the probability that thes e documents (or rather their signatures)
become a candidate pair as follows:
3.6.2.1 The probability that the signatures agree in all rows of one
particular band is sr.
3.6.2.2 The probability that the signatures disagree in at least one
row of a par- ticular band is 1 − sr.
3.6.2.3 The probability that the signatures disagree in at least one
row of each of the bands is (1 − sr)b.
3.6.2.4 The probability that the signatures agree in all the rows of at
least one band, and therefore become a candidate pair, is 1 − (1 − sr)b.
It may not be obvious, but regardless of the chosen constants b and r,
this function has the form of an S-curve , as suggested in Fig. 3.7. The
threshold , that is, the value of similarity s at which the probability
of becoming a candidate is 1/2, is a function of b and r. The
threshold is roughly where the rise is the steepest, and for large b
and r there we find that pairs with similarity above the threshold are
very likely to become candidates, while those below the threshold are
unlikely to become candidates – exactly the situation we want.
Probability of becoming a candidate
0 Jaccard similarity 1
of documents
Figure 3.7: The S-curve
An approximation to the threshold is (1/b)1/r. For example, if b
= 16 and r = 4, then the thresho ld is approximately at s = 1/2,
since the 4th root of 1/16 is 1/2.
Example 3.11 : Let us consider the case b = 20 and r = 5. That is, we
suppose we have signatures of length 100, divided into twenty bands munotes.in
Page 105
Shingling of Documents
105 − of five rows each. Figure 3.8 tabulates some of t he values of the
function 1 (1 s5)20. Notice that the threshold, the value of s at which
the curve has risen halfway, is just slightly more than 0.5. Also notice
that the curve is not exactly the ideal step function that jumps from 0
to 1 at the threshold, but the slope of the curve in the middle is
significant. For example, it rises by more than 0.6 going from s = 0.4
to s = 0.6, so the slope in the middle is greater than 3.
s 1 − (1 − sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
Figure 3.8: Values of the S-curve for b = 20 and r = 5
For example, at s = 0.8, 1 (0.8)5 is about 0.672. If you raise this
number to the 20th power, you get about 0.00035. Subtracting this
fraction from 1 yields 0.99965. That is, if we consider two documents
with 80% similarity, then in any one band, they have only about a 33%
chance of agreeing in all five rows and thus becoming a candidate pair.
However, th ere are 20 bands and thus 20 chances to become a
candidate. Only roughly one in 3000 pairs that are as high as 80%
similar will fail to become a candidate pair and thus be a false
negative.
3.6.3 Combining the Techniques
We can now give an approach to finding th e set of candidate pairs for
similar documents and then discovering the truly similar documents
among them. It must be emphasized that this approach can produce
false negatives – pairs of similar documents that are not identified as
such because they never become a candidate pair. There will also be
false positives – candidate pairs that are evaluated, but are found not
to be sufficiently similar.
3.6.3.1 Pick a value of k and construct from each document the set of
k-shingles. Optionally, hash the k-shingles to shorter bucket numbers.
3.6.3.2 Sort the document -shingle pairs to order them by shingle.
3.6.3.3 Pick a length n for the minhash signatures. Feed the sorted list
to the algorithm of Section 3.3.5 to compute the minhash signatures for
all the documents.
3.6.3.4 Choose a threshold t that defines how similar documents have
to be in order for them to be regarded as a desired “similar pair.” Pick munotes.in
Page 106
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
106 — −
— − a number of bands b and a number of rows r such that br = n, and
the threshold t is approximately (1/b)1/r. If avoidance of false
negatives is important, you may wish to select b and r to produce a
threshold lower than t; if speed is important and you wish to limit false
positives, select b and r to produce a higher threshold.
3.6.3.5 Construct candidate pairs by applying the LSH techniq ue of
Section 3.4.1.
3.6.3.6 Examine each candidate pair’s signatures and determine
whether the frac- tion of components in which they agree is at least t.
3.6.3.7 Optionally, if the signatures are sufficiently similar, go to the
documents themselves and check that they a re truly similar, rather than
documents that, by luck, had similar signatures.
3.6.4 Exercises for Section 3.4
Exercise 3.4.1 : Evaluate the S -curve 1 (1 sr)b for s = 0.1, 0.2, . . . ,
0.9, for the following values of r and b:
• r = 3 and b = 10.
• r = 6 and b = 20.
• r = 5 and b = 50.
! Exercise 3.4.2 : For each of the (r, b) pairs in Exercise 3.4.1, compute
the threshold, that is, the value of s for which the value of 1 (1 sr)b is
exactly 1/2. How does this value compare with the estimate of (1/b)1/r
that was suggested in Section 3.4.2?
! Exercise 3.4.3 : Use the techniques explained in Section 1.3.5 to
approximate the S-curve 1 − (1 − sr)b when sr is very small.
! Exercise 3.4.4 : Suppose we wish to implement LSH by MapReduce.
Specifi - cally, assume chunks o f the signature matrix consist of
columns, and elements are key -value pairs where the key is the column
number and the value is the signature itself (i.e., a vector of values).
(a) Show how to produce the buckets for all the bands as output of
a single MapRedu ce process. Hint : Remember that a Map function can
produce several key-value pairs from a single element.
(b) Show how another MapReduce process can convert the
output of (a) to a list of pairs that need to be compared. Specifically,
for each column i, there should be a list of those columns j > i with
which i needs to be compared.
3.7 DISTANCE MEASURES
We now take a short detour to study the general notion of distance
measures. The Jaccard similarity is a measure of how close sets are,
although it is not really a distance measure. That is, the closer sets are, munotes.in
Page 107
Shingling of Documents
107 n the higher the Jaccard similarity. Rather, 1 minus the Jaccard similarity
is a distance measure, as we shall see; it is called the Jaccard
distance .
However, Jaccard distance is not the only measure of clo seness that
makes sense. We shall examine in this section some other distance
measures that have applications. Then, in Section 3.6 we see how some
of these distance measures also have an LSH technique that allows us
to focus on nearby points without compa ring all points. Other
applications of distance measures will appear when we study
clustering in Chapter 7.
3.7.1 Definition of a Distance Measure
Suppose we have a set of points, called a space . A distance measure
on this space is a function d(x, y) that takes two points in the space as
arguments and produces a real number, and satisfies the following
axioms:
3.7.1.1 d(x, y) ≥ 0 (no negative distances).
3.7.1.2 d(x, y) = 0 if and only if x = y (distances are positive,
except for the distance from a point to itself).
3.7.1.3 d(x, y) = d(y, x) (distance is symmetric).
3.7.1.4 d(x, y) ≤ d(x, z) + d(z, y) (the triangle inequality ).
The triangle inequality is the most complex condition. It says,
intuitively, that to travel from x to y, we cannot obtain any benefit if
we are forced to travel via some particular third point z. The triangle -
inequality axiom is what makes all distance measures behave as if
distance describes the length of a shortest path from one point to
another.
3.7.2 Euclidean Distances
The most familiar distance measure is the one we norm ally think of as
“dis- tance.” An n -dimensional Euclidean space is one where points
are vectors of n real numbers. The conventional distance measure in
this space, which we shall refer to as the L2-norm , is defined:
‚
.d([x 1, x2, . . . , xn], [y1, y2, . . . , yn]) = ,
i=1 (xi − yi)2
That is, we square the distance in each dimension, sum the squares,
and take the positive square root.
It is easy to verify the first three requirements for a distance
measure are satisfied. The Euclidean distance between two points
cannot be negative, be - cause the positive square root is intended. munotes.in
Page 108
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
108 r 1/r
− ∩ ∪ ∩ / Since all squares of real numbers are nonnegative, any i such that xi /=
yi forces the distance to be strictly positive. On the other hand, if x i
= y i for all i, then the distance is clearly 0. Symmetry follows
because (x i yi)2 = (y i xi)2. The triangle inequality requires a good
deal of algebra to verify. However, it is well understood to be a
property of Euclidean space: the sum of the lengths of any two sides
of a triangle is no l ess than the length of the third side.
There are other distance measures that have been used for Euclidean
spaces. For any constant r, we can define the L r-norm to be the distance
measure d defined by:
Σn
d([x 1, x2, . . . , xn], [y1, y2, . . . , yn]) = (i=1 |xi − yi| )
The case r = 2 is the usual L 2-norm just mentioned. Another common
distance measure is the L 1-norm, or Manhattan distance . There, the
distance between two points is the sum of the magnitudes of the
differences in each dimension. It is calle d “Manhattan distance”
because it is the distance one would have to travel between points if
one were constrained to travel along grid lines, as on the streets of a
city such as Manhattan.
Another interesting distance measure is the L ∞-norm, which is the
limit as r approaches infinity of the L r-norm. As r gets larger, only the
dimension with the largest difference matters, so formally, the L ∞-
norm is defined as the maximum of |xi − yi| over all dimensions i.
Example 3.12 : Consider the two-dimensio nal Euclidean space (the
custom -
ary√plane) and the points √(2, 7) and (6, 4). The L2-norm gives
a distance of (2 − 6)2 + (7 − 4)2 = 42 + 32 = 5. The L1-norm
gives a distance of |2 − 6| + |7 − 4| = 4 + 3 = 7. The L∞-norm gives a
distance of max(|2 − 6|, |7 − 4|) = max(4, 3) = 4
3.7.3 Jaccard Distance
As mentioned at the beginning of the section, we define the Jaccard
distance of sets by d(x, y) = 1 SIM(x, y). That is, the Jaccard distance
is 1 minus the ratio of the sizes of the interse ction and union of sets x
and y. We must verify that this function is a distance measure.
3.7.3.1 d(x, y) is nonnegative because the size of the intersection cannot
exceed the size of the union.
3.7.3.2 d(x, y) = 0 if x = y, because x x = x x = x. However, if
x = y, then the size of x y is strictly less than the size of x y, so d(x, y)
is strictly positive.
3.7.3.3 d(x, y) = d(y, x) because both union and intersection are
symmetric; i.e., x ∪ y = y ∪ x and x ∩ y = y ∩ x. munotes.in
Page 109
Shingling of Documents
109 ≤ 3.7.3.4 For the triangle inequality, recall from Section 3.3.3 that
SIM(x, y) is the probability a random minhash function maps x and y
to the same value. Thus, the Jaccard distance d(x, y) is the probability
that a random min - hash function does not send x and y to the same
value. We can therefore translate the condition d(x, y) d(x, z) + d(z, y)
to the statement that if h is a random minhash function, then the
probability that h(x) = h(y) is no greater than the sum of the
probability that h(x) = h(z) and the probability that h(z) = h(y).
However, this statement is true because whenever h(x) = h(y), at
least one of h(x) and h(y) must be different from h(z). They could
not both be h(z), because then h(x) and h(y) would be the same.
3.7.4 Cosine Distance
The cosine distance makes sense in spaces that have dimensions,
including Eu - clidean spaces and discrete versions of Euclidean spaces,
such as spaces where points are vectors with integer components or
Boolean (0 or 1) components. In such a space, points may be thought
of as directions. We do not distinguish be - tween a vector and a
multiple of that vector. Then the cosine distance between two points is
the angle that the vectors to those points make. This angle will be in
the range 0 to 180 degrees, regardless of how many dimensions the
space has.
We can calculate the cosine distance by first computing the cosine of
the angle, and then applying the arc -cosine function to translate to an
angle in the 0-180 degree range. Given two vecto rs x and y, the cosine
of the angle between them is the dot product x.y divided by the L2-
norms of x and y (i.e., their
Euclidean distances from the orΣigin). Recall that the dot product
of vectors [x1, x2, . . . , xn].[y 1, y2, . . . , yn] is n i=1 xiyi.
Example 3.13 : Let our two vectors be x = [1, 2, −1] and = [2, 1, 1].
The dot √pr oduct x.y is 1 × 2 + 2 × 1 + (−√1) × 1 = 3. The L2-n√orm of
both vectors is 6. For example, x has L2-norm 12 + 22 + (−1)2 =
6. Thus, the cosine of the angle between x and y is 3/(√6√6) or 1/2.
The angle whose cosine is ½ is 60 degrees, so that is the cosine
distance between x and y.
We must show that the cosine distance is indeed a distance measure.
We have defined it so the values are in the range 0 to 180, so no
negative distances are possible. Two vectors have angle 0 if and only if
they are the same direction.4 Symme try is obvious: the angle between x
and y is the same as the angle between y and x. The triangle inequality
is best argued by physical reasoning. One way to rotate from x to y is
to rotate to z and thence to y. The sum of those two rotations cannot
be less than the rotation directly from x to y.
3.7.5 Edit Distance munotes.in
Page 110
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
110 · · · · · ·
— × This distance makes sense when points are strings. The distance
between two strings x = x1x2 xn and y = y1y2 ym is the smallest
number of insertions and deletions of single characters that will
convert x to y.
Example 3.14 : The edit distance between the strings x = abcde
and y = acfdeg is 3. To convert x to y:
3.7.5.1 Delete b.
3.7.5.2 Insert f after c.
3.7.5.3 Insert g after e.
No sequence of fewer than three insertions and/or deletions will
convert x to y. Thus, d(x, y) = 3.
Another way to define and calculate the edit distance d(x, y) is to
compute a longest common subsequence (LCS) of x and y. An LCS
of x and y is a string that is constructed by deleting positions from x
and y, and that is as long as any string that can be constructed that
way. The edit distance d(x, y) can be calculated as the length of x
plus the length of y minus twice the length of their LCS.
Example 3.15 : The strings x = abcde and y = acfdeg from
Example 3.14 have a unique LCS, which is acde. We can be sure it
is the longest possible, because it contains every symbol appearing in
both x and y. Fortunately, these common symbols appear in the same
order in both strings, so we are able to use them all in an LCS. Note
that the length of x is 5, the length of y is 6, and the length of their
LCS is 4. The edit distance is thus 5 + 6 2 4 = 3, which agrees with
the direct calculation in Example 3.14.
For another example, consider x = aba and y = bab. Their edit
distance is 2. For example, we can convert x to y by deleting the first
a and then inserting b at the end. There are two LCS’s: ab and ba.
Each can be obtained by deleting one symbol from each string. As
must be the case for multiple LCS’s of the same pair of strings, both
LCS’s have the same length. Therefore, we may compute the edit
distance as 3 + 3 − 2 × 2 = 2.
Edit distance is a distance measure. Surely no edit distance can be
negative, and only two identical strings have an edit distance of 0. To
see that edit distance is symmetric, note that a sequence of insertions
and deletions can be reversed, with each insertion becoming a deletion,
and vice versa. The triangle inequality is also straightforward. One
way to turn a string s into a string t is to turn s into some string u
and then turn u into t. Thus, the number of edits made going from s
to u, plus the number of edits made going from u to t cannot be less
than the smallest number of edits that will turn s into t.
munotes.in
Page 111
Shingling of Documents
111 | − | 3.7.6 Hamming Distance
Given a space of vectors, we define the Hamming distance between
two vectors to be the number of components in which they differ. It
should be obvious that Hamming distance is a distance measure.
Clearly the Hamming distance cannot be negative, and if it is zero,
then the vectors are identical. The dis - tance does not depend on
which of two vectors we consider f irst. The triangle inequality should
also be evident. If x and z differ in m components, and z and y
differ in n components, then x and y cannot differ in more than m + n
components. Most commonly, Hamming distance is used when the
vectors are Boolean; the y consist of 0’s and 1’s only. However, in
principle, the vectors can have components from any set.
Example 3.16 : The Hamming distance between the vectors 10101 and
11110 is 3. That is, these vectors differ in the second, four th, and fifth
components, while they agree in the first and third components.
3.7.7 Exercises for Section 3.5
! Exercise 3.5.1 : On the space of nonnegative integers, which of the
following functions are distance measures? If so, prove it; if not, prove
that i t fails to satisfy one or more of the axioms.
(a) max(x, y) = the larger of x and y.
(b) diff(x, y) = x y (the absolute magnitude of the difference
between x and y).
(c) sum(x, y) = x + y. Non-Euclidean Spaces
Notice that several of the distance measures introduced in this
section are not Eucl idean spaces. A property of Euclidean
spaces that we shall find important when we take up
clustering in Chapter 7 is that the average of points in a
Euclidean space always exists and is a point in the space.
However, consider the space of sets for which we defined the
Jaccard dis - tance. The notion of the “average” of two sets
makes no sense. Likewise, the space of strings, where we can
use the edit distance, does not let us take the “average” of
strings.
Vector spaces, for which we suggested the cosine distance,
may or may not be Euclidean. If the components of the vectors
can be any real num - bers, then the space is Euclidean.
However, if we restrict components to be integers, then the
space is not Euclidean. Notice that, for instance, we cannot find
an average of the vectors [1, 2] and [3, 1] in the space of vectors
with two integer components, although if we treated them as
members of the two -dimensional Euclidean space, then we
could say that their average was [2.0, 1.5]. munotes.in
Page 112
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
112 Exercise 3.5.2 : Find the L 1 and L 2 distances between the points (5, 6,
7) and (8, 2, 4).
!! Exercise 3.5.3 : Prove that if i and j are any positive integers, and
i < j, then the L i norm between any two points is greater than the L j
norm between those same two points.
Exercise 3.5.4 : Find the Jaccard distances between the following
pairs of sets:
(a) {1, 2, 3, 4} and {2, 3, 4, 5}.
(b) {1, 2, 3} and {4, 5, 6}.
Exercise 3.5.5 : Compute the cosines of the angles between each of the
fol- lowing pairs of vectors.5
(a) (3, −1, 2) and (−2, 3, 1).
(b) (1, 2, 3) and (2, 4, 6).
(c) (5, 0, −4) and (−1, −6, 2).
(d) (0, 1, 1, 0, 1, 1) and (0, 0, 1, 0, 0, 0).
! Exercise 3.5.6 : Prove that the cosine distance between any two vectors
of 0’s and 1’s, of the same length, is at most 90 degrees.
Exercise 3.5.7 : Find the edit distances (us ing only insertions and
deletions) between the following pairs of strings.
(a) abcdef and bdaefc.
(b) abccdabc and acbdcab.
(c) abcdef and baedfc.
! Exercise 3.5.8 : There are a number of other notions of edit distance
available. For instance, we can allow, in additio n to insertions and
deletions, the following operations:
i. Mutation , where one symbol is replaced by another symbol. Note
that a mutation can always be performed by an insertion followed by a
deletion, but if we allow mutations, then this change counts for o nly 1,
not 2, when computing the edit distance.
ii. Transposition , where two adjacent symbols have their positions
swapped. Like a mutation, we can simulate a transposition by one
insertion followed by one deletion, but here we count only 1 for these
two steps .
Repeat Exercise 3.5.7 if edit distance is defined to be the number of
insertions, deletions, mutations, and transpositions needed to transform
one string into another.
! Exercise 3.5.9 : Prove that the edit distance discussed in Exercise
3.5.8 is indeed a distance measure. munotes.in
Page 113
Shingling of Documents
113 Exercise 3.5.10 : Find the Hamming distances between each pair of the
fol- lowing vectors: 000000, 110011, 010101, and 011100.
5Note that what we are asking for is not precisely the cosine
distance, but from the cosine of an angle, you can compute the
angle itself, perhaps with the aid of a table or library function.
3.8 THE THEORY OF LOCALITY -SENSITIVE
FUNCTIONS
The LSH technique developed in Section 3.4 is one example of a
family of func - tions (the minhash functions) that can be combined (by
the banding technique) to distinguish strongly between pairs at a low
distance from pairs at a high dis - tance. The steepness of the S -curve in
Fig. 3.7 reflects how effectively we can avoid false positives and false
negatives among the candidate pairs.
Now, we shall explore other families of functions, besides the minhash
func- tions, that can serve to produce candidate pairs efficiently. These
functions can apply to the space of sets and the Jaccard distance, or to
another space and/or another distan ce measure. There are three
conditions that we need for a family of functions:
1. They must be more likely to make close pairs be candidate pairs
than distant pairs. We make this notion precise in Section 3.6.1.
2. They must be statistically independent, in the sense that it is
possible to estimate the probability that two or more functions will
all give a certain response by the product rule for independent
events.
3. They must be efficient, in two ways:
(a) They must be able to identify candidate pairs in time much less
than the time it takes to look at all pairs. For example, minhash
functions have this capability, since we can hash sets to minhash
values in time proportional to the size of the data, rather than the
square of the number of sets in the data. Since s ets with common
values are colocated in a bucket, we have implicitly produced the
candidate pairs for a single minhash function in time much less than
the number of pairs of sets.
(b) They must be combinable to build functions that are better at
avoid - ing fal se positives and negatives, and the combined functions
must also take time that is much less than the number of pairs. For ex -
ample, the banding technique of Section 3.4.1 takes single minhash
functions, which satisfy condition 3a but do not, by themselve s have
the S -curve behavior we want, and produces from a number of min -
hash functions a combined function that has the S-curve shape.
Our first step is to define “locality -sensitive functions” generally.
We then see how the idea can be applied in several applications. munotes.in
Page 114
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
114 Finally, we discuss how to apply the theory to arbitrary data with
either a cosine distance or a Euclidean distance measure.
3.8.1 Locality -Sensitive Functions
For the purposes of this section, we shall consider functions that take
two items and render a decision about whether these items should be
a candidate pair.
In many cases, the function f will “hash” items, and the decision will
be based on whether or not the result is equal. Because it is
convenient to use the notation f(x) = f(y) to mean t hat f(x, y) is “yes;
make x and y a candidate pair,” we shall use f(x) = f(y) as a
shorthand with this meaning. We also use f(x) = f(y) to mean “do
not make x and y a candidate pair unless some other function
concludes we should do so.”
A collection o f functions of this form will be called a family of
functions. For example, the family of minhash functions, each based on
one of the possible permutations of rows of a characteristic matrix,
form a family.
Let d 1 < d 2 be two distances according to som e distance measure d.
A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for
every f in F:
3.8.1.1 If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at least
p1.
3.8.1.2 If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at
most p2.
p
1
Probabilty of being declared a candidate
P
2
d 1 d 2
Distance
Figure 3.9: Behavior of a (d1, d2, p1, p2)-sensitive function
Figure 3.9 illustrates what we expect about the probability that a given
function in a (d 1, d2, p1, p2)-sensitive family w ill declare two items to
be a can - didate pair. Notice that we say nothing about what happens
when the distance between the items is strictly between d 1 and d 2, but
we can make d 1 and d 2 as close as we wish. The penalty is that
munotes.in
Page 115
Shingling of Documents
115 ≥ ≤ — ≥ − typically p 1 and p 2 are the n close as well. As we shall see, it is possible
to drive p 1 and p 2 apart while keeping d 1 and d 2 fixed.
3.8.2 Locality -Sensitive Families for Jaccard Distance
For the moment, we have only one way to find a family of
locality -sensitive functions: use the family of minhash functions, and
assume that the distance measure is the Jaccard distance. As before,
we interpret a minhash function h to make x and y a candidate pair
if and only if h(x) = h(y).
The family of minhash functions is a (d1, d2, 1 −d1, 1 −d2)-sensitive
family for any d1 and d2, where 0 ≤ d1 < d2 ≤ 1.
The reason is that if d(x, y) ≤ d1, where d is the Jaccard distance,
then SIM(x, y) = 1 d(x, y) 1 d 1. But we know that the
Jaccard similarity of x and y is equal to the probability that a
minhash function will hash x and y to the same value. A similar
argument applies to d2 or any distance.
Example 3.17 : We could let d 1 = 0.3 and d 2 = 0.6. Then we can
assert that the family of minhash functions is a (0.3, 0.6, 0.7, 0.4) -
sensitive family. That is, if the Jaccard distance between x and y is at
most 0.3 (i.e., SIM(x, y) 0.7) then there is at least a 0.7 chance
that a minhash function will send x and y to the same value, and if the
Jaccard distance be tween x and y is at least 0.6 (i.e., SIM(x, y) 0.4),
then there is at most a 0.4 chance that x and y will be sent to the
same value. Note that we could make the same assertion with another
choice of d1 and d2; only d1 < d2 is required.
3.8.3 Amplifying a Loca lity-Sensitive Family
Suppose we are given a (d 1, d2, p1, p2)-sensitive family F. We can
construct a new family F′ by the AND -construction on F, which is
defined as follows. Each member of F′ consists of r members of F
for some fixed r. If f is in F′, and f is constructed from the set {f1,
f2, . . . , fr} of members of F, we say f(x) = f(y) if and only if fi(x)
= f i(y) for all i = 1, 2, . . . , r. Notice that this construction mirrors
the effect of the r rows in a single band: the band makes x and y a
candidate pair if every one of the r rows in the band say that x and y
are equal (and therefore a candidate pair accor ding to that row).
Since the members of F are independentl y chosen to make a member
of F′, we can assert that F′ is a d1, d2, (p1)r, (p2)r -sensitive family.
That is, for any p, if p is the probability that a member of F will
declare (x, y) to be a candidate pair, then the probability that a
member of F′ will so declare is pr.
There is another construction, which we call the OR-construction, that
turn s a (d1, d2, p1, p2)-sensitive family F into a d1, d2, 1 − (1 − p1)b, 1 −
(1 − p2)b - sensitive family F′. Each member f of F′ is constructed
from b members of F, say f1, f2, . . . , f b. We define f(x) = f(y) if munotes.in
Page 116
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
116 — − p 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 − (1 − p4)4 0.0064 0.0320 0.0985 0.2275 0.4260 0.6666 0.8785 0.9860 and only if fi(x) = f i(y) for one or more values of i. The OR-
construction mirrors the effect of combining several bands: x and y
become a candidate pair if any band makes them a candidate pair.
If p is the probability that a member of F will declare (x, y) to be a
candidate pair, then 1−p is the probability it will not so declare. (1−p)b is
the probability that none of f1, f2, . . . , fb will declare (x, y) a
candidate pair, and 1 − (1 − p)b is the probability that at least one fi will
declare (x, y) a candidate pair, and therefore that f will declare (x, y) to
be a candidate pair.
Notice that the AND -construction lowers all probabilities, but if we
choose F and r judiciously, we can make the small probability p 2 get
very close to 0, while the higher probability p 1 stays significantly
away from 0. Similarly, the OR - construction makes all
probabilities rise, but by choosing F and b judiciously, we can make
the larger probability approach 1 while the smaller probability
remains bounded away from 1. We can cascade AND - and OR -
constructions i n any order to make the low probability close to 0 and
the high probability close to 1. Of course the more constructions we
use, and the higher the values of r and b that we pick, the larger the
number of functions from the original family that we are forced to
use. Thus, the better the final family of functions is, the longer it
takes to apply the functions from this family.
Example 3.18 : Suppose we start with a family F. We use the AND -
construc - tion with r = 4 to produce a family F 1. We then apply
the OR -construction to F1 with b = 4 to produce a third family F2.
Note that the members of F2 each are built from 16 members of F, and
the situation is analogous to starting with 16 minhash functions and
treating them as four bands of four rows each.
Figure 3.10: Effect of the 4-way AND -construction followed by the
4-way OR- construction
The 4-way AND -function converts any probability p into p4.
When we follow it by the 4-way OR-construction, that probability munotes.in
Page 117
Shingling of Documents
117 p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 − (1 − p)4 4 0.0140 0.1215 0.3334 0.5740 0.7725 0.9015 0.9680
is further converted into 1 (1 p4)4. Som e values of this transformation
are indicated in Fig. 3.10. This function is an S-curve, staying low for
a while, then rising steeply (although not too steeply; the slope never
gets much higher than 2), and then leveling off at high values. Like
any S-curve, it has a fixedpoint , the value of p that is left
unchanged when we apply the function of the S-curve. In this
case, the fixedpoint is the value of p for which p = 1 (1 p4)4. We
can see that the fixedpoint is somewhere between 0.7 and 0.8. Below
that value, probabilities are decreased, and above it they are increased.
Thus, if we pick a high probability above the fixedpoint and a low
probability below it, we shall have the desired effect that the low
probability is decreased and the high probability is in creased. Suppose
F is the minhash functions, regarded as a (0.2, 0.6, 0.8, 0.4) -sens-
itive family. Then F2, the family constructed by a 4-way AND
followed by a 4-way OR, is a (0.2, 0.6, 0.8785, 0.0985) -sensitive
family, as we can read from the rows for 0.8 and 0.4 in Fig. 3.10. By
replacing F by F2, we have reduced both the false -negative and false -
positive rates, at the cost of making application of the functions take
16 times as long.
Figure 3.11: Effect of the 4-way OR-construction followed by the
4-way AND - construction
Example 3.19 : For the same cost, we can apply a 4-way OR -
construction followed by a 4 -way AND -construction. Figure 3.11
gives the transformation on probabilities implied by this
construction. For instance, suppose that F is a (0.2, 0.6, 0.8, 0.4)-
sensitive family. Then the constructed family is a (0.2, 0.6, 0.9936,
0.5740) -sensitive family. This choice is not necessarily the best.
Although the higher probability has moved much closer to 1, the lower
probability has also raised, i ncreasing the number of false positives.
Example 3.20 : We can cascade constructions as much as we like. For
exam - ple, we could use the construction of Example 3.18 on the
family of minhash functions and then use the construction of Example
3.19 on the r esulting family. The constructed family would then have
functions each built from 256 minhash functions. It would, for instance
transform a (0.2, 0.8, 0.8, 0.2) -sensitive family into a (0.2, 0.8,
0.9991285, 0.0000004) -sensitive family.
munotes.in
Page 118
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
118 3.8.4 Exercises for Secti on 3.6
Exercise 3.6.1 : What is the effect on probability of starting with the
family of minhash functions and applying:
(a) A 2-way AND construction followed by a 3-way OR
construction.
(b) A 3-way OR construction followed by a 2-way AND
construction.
(c) A 2-way AND construction followed by a 2-way OR construction,
followed by a 2-way AND construction.
(d) A 2-way OR construction followed by a 2-way AND construction,
followed by a 2-way OR construction followed by a 2-way AND
construction.
Exercise 3.6.2 : Find the fixed points for each of the functions
constructed in Exercise 3.6.1.
! Exercise 3.6.3 : Any function of probability p, such as that of Fig.
3.10, has a slope given by the derivative of the function. The maximum
slope is where that derivative is a maximum. Find the value of p that
gives a maximum slope for the S -curves given by Fig. 3.10 and Fig.
3.11. What are the values of these maximum slopes?
!! Exercise 3.6.4 : Generalize Exercise 3.6.3 to give, as a function of r
and b, the point of maximum slope and the v alue of that slope, for
families of functions defined from the minhash functions by:
(a) An r-way AND construction followed by a b-way OR
construction.
(b) A b-way OR construction followed by an r-way AND
construction.
3.9 LSH FAMILIES FOR OTHER DISTANCE
MEASURES
Ther e is no guarantee that a distance measure has a locality -sensitive
family of hash functions. So far, we have only seen such families for
the Jaccard distance. In this section, we shall show how to construct
locality -sensitive families for Hamming distance, the cosine distance
and for the normal Euclidean distance.
3.9.1 LSH Families for Hamming Distance
It is quite simple to build a locality -sensitive family of functions for
the Ham - ming distance. Suppose we have a space of d -dimensional
vectors, and h(x, y) denotes the Hamming distance between vectors x
and y. If we take any one position of the vectors, say the ith
position, we can define the function fi(x) to be the ith bit of vector munotes.in
Page 119
Shingling of Documents
119 x y x. Then fi(x) = fi(y) if and only if vectors x and y agree in the ith
position . Then the probability that fi(x) = f i(y) for a ran - domly
chosen i is exactly 1 h(x, y)/d; i.e., it is the fraction of positions in
which x and y agree.
This situation is almost exactly like the one we encountered for
minhashing.
Thus, the family F consisting of the functions {f1, f2, . . . , fd} is
a(d1, d2, 1 − d1/d, 1 − d2/d)-sensitive family of hash functions, for any d1
< d2. There are only two differences between this family and the
family of minhash functions.
3.9.1.1 While Jaccard distance runs from 0 to 1, the Hamming
distance on a vector space of dimension d runs from 0 to d. It is
therefore necessary to scale the distances by dividing by d, to turn
them into probabilities.
3.9.1.2 While there is essentially an unlimited supply of minhash
functions, the size of the family F for Hamming distance is only d.
The first point is of no consequence; it only requires that we divide by
d at appropriate times. The second point is more serious. If d is
relatively small, then we are limited in the number of functions that
can be composed using the AND and OR constructions, thereby
limiting how steep we can make the S-curve be.
3.9.2 Random Hyperplanes and the Cosine Distance
Recall from Section 3.5.4 that the cosine distance between two vectors
is the angle between the vector s. For instance, we see in Fig. 3.12
two vectors x and y that make an angle θ between them. Note that
these vectors may be in a space of many dimensions, but they always
define a plane, and the angle between them is measured in this plane.
Figure 3.12 is a “top -view” of the plane containing x and y.
Figure 3.12: Two vector s make an angle θ munotes.in
Page 120
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
120 − − Suppose we pick a hyperplane through the origin. This hyperplane
intersects the plane of x and y in a line. Figure 3.12 suggests two
possible hyperplanes, one whose intersection is the dashed line and the
other’s intersection is the dotted line. To pick a random hyperplane,
we actually pick the normal vector to the hyperplane, say v. The
hyperplane is then the set of points whose dot product with v is 0.
First, consider a vector v that is normal to the hyperplane whose
projection is repre sented by the dashed line in Fig. 3.12; that is, x and
y are on different sides of the hyperplane. Then the dot products v.x
and v.y will have different signs. If we assume, for instance, that v is a
vector whose projection onto the plane of x and y is above the dashed
line in Fig. 3.12, then v.x is positive, while v.y is negative. The
normal vector v instead might extend in the opposite direction, below
the dashed line. In that case v.x is negative and v.y is positive, but the
signs are still differen t.
On the other hand, the randomly chosen vector v could be normal to a
hyperplane like the dotted line in Fig. 3.12. In that case, both v.x
and v.y have the same sign. If the projection of v extends to the right,
then both dot products are positive, while if v extends to the left, then
both are negative.
What is the probability that the randomly chosen vector is normal
to a hyperplane that looks like the dashed line rather than the dotted
line? All angles for the line that is the intersection of the random
hyperplane and the plane of x and y are equally likely. Thus, the
hyperplane will look like the dashed line with probability θ/180 and
will look like the dotted line otherwise.
Thus, each hash function f in our locality -sensitive family F is
built from a randomly chosen vector vf . Given two vectors x and
y, say f(x) = f(y) if and only if the dot products vf .x and vf .y
have the same sign. Then F is a locality -sensitive family for the
cosine distance. The parameters are essentially the same as for the
Jaccard -distance family described in Section 3.6.2, except the scale of
distances is 0–180 rathe r than 0 –1. That is, F is a (d1, d2, (180 −
d1)/180, (180 − d2)/180) -sensitive family of hash functions. From this
basis, we can amplify the family as we wish, just as for the minhash -
based family.
3.9.3 Sketches
Instead of chosing a random vector from all possible vectors, it turns
out to be sufficient ly random if we restrict our choice to vectors
whose components are +1 and 1. The dot product of any vector x
with a vector v of +1’s and 1’s is formed by adding the
components of x where v is +1 and then subtracting the other
components of x – those where v is −1.
If we pick a collection of random vectors, say v 1, v2, . . . , v n, then we
can apply them to an arbitrary vector x by computing v 1.x, v 2.x, . . . , munotes.in
Page 121
Shingling of Documents
121 − − −
− — − vn.x and then replacing any positive value by +1 and any negative
value by 1. The result is called the sketch of x. You can handle 0’s
arbitrarily, e.g., by chosing a result +1 or 1 at random. Since there is
only a tiny probability of a zero dot product, the choice has
essentially no effect.
Example 3.21 : Suppose our space consists of 4-dimensional vector s,
and we pick three random vectors: v1 = [+1, −1, +1, +1], v2 = [−1,
+1, −1, +1], and v3 = [+1, +1, −1, −1]. For the vector x = [3, 4, 5, 6],
the sketch is [+1, +1, −1].
That is, v1.x = 3−4+5+6 = 10. Since the result is positive, the first
component of the sketch is +1. Similarly, v2.x = 2 and v3.x = 4, so
the second component of the sketch is +1 and the third component
is 1.
Consider the vector y = [4, 3, 2, 1]. We can similarly compute its
sketch to be [+1, 1, +1]. Since the sketches for x and y agree in 1/3
of the positions, we estimate th at the angle between them is 120
degrees. That is, a randomly chosen hyperplane is twice as likely to
look like the dashed line in Fig. 3.12 than like the dotted line.
The above conclusion turns out to be quite wrong. We can calculate
the cosine of the angle between x and y to be x.y, which is
6 × 1 + 5 × 2 + 4 × 3 + 3 × 4 = 40
divided by the magnitudes of the two vectors. These magnitudes are
√
62 + 52 + 42 + 32 = 9.274
and √12 + 22 + 32 + 42 = 5.477. Thus, the cosine of the angle
between x and y is 0 .7875, and this angle is about 38 degrees.
However, if you look at all 16 different vectors v of length 4 that
have +1 and 1 as components, you find that there are only four of
these whose dot products with x and y have a different sign,
namely v 2, v3, and their complements [+1, 1, +1, 1] and [ 1, 1,
+1, +1]. Thus, had we picked all sixteen of these vectors to form a
sketch, the estimate of the angle would have been 180/4 = 45 degrees.
3.9.4 LSH Families for Euclidean Distance
Now, let us t urn to the Euclidean distance (Section 3.5.2), and see if
we can develop a locality -sensitive family of hash functions for this
distance. We shall start with a 2-dimensional Euclidean space. Each
hash function f in our family F will be associated with a randomly
chosen line in this space. Pick a constant a and divide the line into
segments of length a, as suggested by Fig. 3.13, where the “random”
line has been oriented to be horizontal. munotes.in
Page 122
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
122 ≫≥ The segments of the line are the buckets into which function f hashes
points. A point is hashed to the bucket in which its projection onto the
line lies. If the distance d between two points is small compared with
a, then there is a good chance the two points hash to the same bucket,
and thus the hash function f will declar e the two points equal. For
example, if d = a/2, then there is at least a 50% chance the two points
will fall in the same bucket. In fact, if the angle θ between the
randomly chosen line and the line connecting the points is large, then
there is an even greater chance that the two points will fall in the same
bucket. For instance, if θ is 90 degrees, then the two points are
certain to fall in the same bucket.
However, suppose d is larger than a. In order for there to be any
chance of the two points falling in the same bucket, we need d cos θ ≤ a.
The diagram of Fig. 3.13 suggests why this requirement holds. Note
that even if d cos θ ≪ a it Bucket width a
Figure 3.13 : Two points at d istance d a have a small chance of being
hashed to the same bucket is still not certain that the two points will
fall in the same bucket. However, we can guarantee the following. If
d 2a, then there is no more than a 1/3 chance the two points fall in the
same bucket. The reason is that for cos θ to be less than 1/2, we need
to have θ in the range 60 to 90 degrees. If θ is in the range 0 to 60
degrees, then cos θ is more than 1/2. But since θ is the smaller angle
between two randomly chosen lines in the plane, θ is twice as likely to
be between 0 and 60 as it is to be between 60 and 90.
We conclude that the family F just described forms a (a/2, 2a, 1/2,
1/3)- sensitive family of hash functions. That is, for distances up to
a/2 the proba - bility is at least 1/2 that two points at that distance will
fall in the same bucket, while for distances at least 2a the probability
points at that distance will fall in the same bucket is at most 1/3. We
can amplify this family as we like, just as for the other examples of
locality -sensitive hash functions we have discussed.
3.9.5 More LSH Families for Euclidean Spaces
There is something unsatisfying about the family of hash functions
developed in Section 3.7.4. First, the technique was only described for Points at
distance d
munotes.in
Page 123
Shingling of Documents
123 − two-dimensional Euclidean spaces. What happens if our data is
points in a space with many dimensions? Second, for Jaccard and
cosine distances, we were able to develop locality -sensitive families
for any pair of distances d1 and d2 as long as d1 < d2. In Section 3.7.4
we appear to need the stronger condition d1 < 4d2.
However, we claim that there is a locality -sensitive family of hash
func- tions for any d 1 < d 2 and for any number of dimensions. The
family’s hash functions still derive from random lines through the
space and a bucket size a that partitions the line. We still hash
points by projecting them onto the line. Given that d1 < d2, we may
not know what the probability p1 is that two points at distance d 1
hash to the same bucket, but we can be certain that it is greater than
p2, the probability that two points at distance d 2 hash to the same
bucket. The reason is that this probability surely grows as the
distance shrinks. Thus, even if we cannot calculate p1 and p2 easily,
we know that there is a (d 1, d 2, p 1, p 2)-sensitive family of hash
functions for any d 1 < d 2 and any given number of dimensions.
Using the amplification techniques of Section 3.6.3, we can then adjust
the two probabilities to surround any particular value we like, and to
be as far apart as we like. Of course, the further apart we want the
proba bilities to be, the larger the number of basic hash functions in F
we must use.
3.9.6 Exercises for Section 3.7
Exercise 3.7.1 : Suppose we construct the basic family of six locality -
sensitive functions for vectors of length six. For each pair of the
vectors 000 000, 110011, 010101, and 011100, which of the six
functions makes them candidates?
Exercise 3.7.2 : Let us compute sketches using the following four
“random” vectors:
v1 = [+1, +1, +1, −1] v2 = [+1, +1, −1, +1]
v3 = [+1, −1, +1, +1] v4 = [−1, +1, +1, +1]
Compute the sketches of the following vectors. (a) [2, 3, 4, 5].
(b) [−2, 3, −4, 5].
(c) [2, −3, 4, −5].
For each pair, what is the estimated angle between them, according to
the sketche s? What are the true angles?
Exercise 3.7.3 : Suppose we form sketches by using all sixteen of the
vectors of length 4, whose components are each +1 or 1. Compute
the sketches of the three vectors in Exercise 3.7.2. How do the estimates
of the angle s between each pair compare with the true angles?
munotes.in
Page 124
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
124 Exercise 3.7.4 : Suppose we form sketches using the four vectors from
Exer - cise 3.7.2.
! (a) What are the constraints on a, b, c, and d that will cause the
sketch of the vector [a, b, c, d] to be [+1, +1, +1, +1]?
!! (b) Consider two vectors [a, b, c, d] and [e, f, g, h]. What are the
conditions on
a, b, . . . , h that will make the sketches of these two vectors be the
same?
Exercise 3.7.5 : Suppose we have points in a 3 -dimensional Euclidean
space: p1 = (1, 2, 3), p 2 = (0, 2, 4), and p 3 = (4, 3, 2). Consider the
three hash functions defined by the three axes (to make our
calculations very easy). Let buckets be of length a, with one bucket
the interval [0, a) (i.e., the set of points x such that 0 ≤ x < a), the next
[a, 2a), the previous one [−a, 0), and so on.
(a) For each of the three lines, assign each of the points to
buckets, assuming
a = 1.
(b) Repeat part (a), assuming a = 2.
(c) What are the candidate pairs for the cases a = 1 and a = 2?
(d) For each pair of points, for what values of a will that pair be a
candidate pair?
3.10 APPLICATIONS OF LOCALITY -SENSITIVE
HASHING
In this section, we shall explore three examples of how LSH is used
in practice. In each case, the techniques we have learned must be
modified to meet certain constraints of the problem. The three
subjects we cover are:
1. Entity Resolution : This term refers to matching data records that
refer to the same real -world entity, e.g., the same person. The principal
problem addressed here is that the sim ilarity of records does not match
exactly either the similar -sets or similar -vectors models of similarity
on which the theory is built.
2. Matching Fingerprints : It is possible to represent fingerprints as
sets. However, we shall explore a different family of locality -sensitive
hash func- tions from the one we get by minhashing.
3. Matching Newspaper Articles : Here, we consider a different notion
of shingling that focuses attention on the core article in an on -line
news - paper’s Web page, ignoring all the extrane ous material such as
ads and newspaper -specific material. munotes.in
Page 125
Shingling of Documents
125 3.10.1 Entity Resolution
It is common to have several data sets available, and to know that they
refer to some of the same entities. For example, several different
bibliographic sources provide information about many of the same
books or papers. In the general case, we have records describing
entities of some type, such as people or books. The records may all
have the same format, or they may have different formats, with
different kinds of information.
Ther e are many reasons why information about an entity may vary,
even if the field in question is supposed to be the same. For example,
names may be expressed differently in different records because of
misspellings, absence of a middle initial, use of a nickn ame, and
many other reasons. For example, “Bob S. Jomes” and “Robert Jones
Jr.” may or may not be the same person. If records come from
different sources, the fields may differ as well. One source’s records
may have an “age” field, while another does no t. The second source
might have a “date of birth” field, or it may have no information at all
about when a person was born.
3.10.2 An Entity -Resolution Example
We shall examine a real example of how LSH was used to deal with an
entity - resolution problem. Company A was engaged by Company B to
solicit cus - tomers for B. Company B would pay A a yearly fee, as
long as the customer maintained their subscription. They later
quarreled and disagreed over how many customers A had provided to
B. Each had about 1,000,000 re cords, some of which described the
same people; those were the customers A had provided to B. The
records had different data fields, but unfortunately none of those fields
was “this is a customer that A had provided to B.” Thus, the
problem was to match re cords from the two sets to see if a pair
represented the same person.
Each record had fields for the name, address, and phone number of the
person. However, the values in these fields could differ for many
reasons. Not only were there the misspellings and other naming
differences mentioned in Section 3.8.1, but there were other
opportunities to disagree as well. A customer might give their home
phone to A and their cell phone to B. Or they might move, and tell B
but not A (because they no longer had need f or a relationship with A).
Area codes of phones sometimes change.
The strategy for identifying records involved scoring the differences in
three fields: name, address, and phone. To create a score describing the
likelihood that two records, one from A and the other from B,
described the same per - son, 100 points was assigned to each of the
three fields, so records with exact matches in all three fields got a score
of 300. However, there were deductions for mismatches in each of the
three fields. As a first approximation, edit -distance (Section 3.5.5) was
used, but the penalty grew quadratically with the distance. Then, munotes.in
Page 126
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
126 certain publicly available tables were used to reduce the penalty in ap -
propriate situations. For example, “Bill” and “William” were treated as
if they differed in only one letter, even though their edit-distance is 5.
However, it is not feasible to score all one trillion pairs of records.
Thus, a simple LSH was used to focus on likely candidates. Three
“hash functions” were used. The first se nt records to the same bucket
only if they had identical names; the second did the same but for
identical addresses, and the third did the same for phone numbers. In
practice, there was no hashing; rather the records were sorted by name,
so records with id entical names would appear consecutively and get
scored for overall similarity of the name, address, and phone. Then the
records were sorted by address, and those with the same
address were scored. Finally, the records were s orted a third time by
phone, and records with identical phones were scored.
This approach missed a record pair that truly represented the same
person but none of the three fields matched exactly. Since the goal
was to prove in a court of law that th e persons were the same, it is
unlikely that such a pair would have been accepted by a judge as
sufficiently similar anyway.
3.10.3 Validating Record Matches
What remains is to determine how high a score indicates that two
records truly represent the same individ ual. In the example at hand,
there was an easy way to make that decision, and the technique can be
applied in many similar situations. It was decided to look at the
creation -dates for the records at hand, and to assume that 90 days was
an absolute ma ximum delay between the time the service was bought
at Company A and registered at B. Thus, a proposed match between
two records that were chosen at random, subject only to the constraint
that the date on the B -record was between 0 and 90 days after the date
on the A-record, would have an average delay of 45 days. When Are Record Matches Good Enough?
While every case will be different, it may be of interest to
know how the experiment of Section 3.8.3 turned out on
the data of Section 3.8.2. Fo r scores down to 185, the
value of x was very close to 10; i.e., these scores
indicated that the likelihood of the records representing
the same person was essentially 1. Note that a score of
185 in this example represents a situation where one
field is th e same (as would have to be the case, or the
records would never even be scored), one field was
completely different, and the third field had a small
discrepancy. Moreover, for scores as low as 115, the value of x was noticeably less than 45, meaning that munotes.in
Page 127
Shingling of Documents
127 It was found that of the pairs with a perfect 300 score, the average
delay was 10 days. If you assume that 300-score pairs are surely correct
matches, then you can look at the pool of pairs with a ny given score s,
and compute the average delay of those pairs. Suppose that the
average delay is x, and the fraction of true matches among those pairs
with score s is f. Then x = 10f + 45(1 − f), or x = 45 − 35f. Solving
for f, we find that the fraction of the pairs with score s that are truly
matches is (45 − x)/35.
The same trick can be used whenever:
3.10.3.1 There is a scoring system used to evaluate the likelihood that
two records represent the same entity, and
3.10.3.2 There is some field, not used in the scoring, from which
we can derive a measure that differs, on average, for true pairs and
false pairs.
For instance, suppose there were a “height” field recorded by both
companies A and B in our running example. We can compute the
average difference in height for pair s of random records, and we can
compute the average difference in height for records that have a perfect
score (and thus surely represent the same entities). For a given score s,
we can evaluate the average height difference of the pairs with that score
and estimate the probability of the records representing the same
entity. That is, if h 0 is the average height difference for the perfect
matches, h 1 is the average height difference for random pairs, and h is
the average height difference for pairs of score s, then the fraction of
good pairs with score s is (h1 − h)/(h 1 − h0).
3.10.4 Matching Fingerprints
When fingerprints are matched by computer, the usual representation
is not an image, but a set of locations in which minutiae are
located. A minutia, in the context of fingerprint descriptions, is a place
where something unusual happens, such as two ridges merging or a
ridge ending. If we place a grid over a fingerprint, we can represent the
fingerprint by the set of grid squares in which minutiae are located.
Ideally, before overlaying the grid, fingerprints are normalized for size
and orientation, so that if we took two images of the same finger, we
would find minutiae lying in exactly the same grid squares. We
shall not consider here the best ways to normalize images. Let us
assume that some combination of techniques, including choice of grid
size and placing a minutia in several adjacent grid squares if it lies close
to the border of the squares enables us to assume that grid squares
from two images have a significantly higher probability of agreeing in
the presence or absence of a minutia than if they were from images of
different fingers.
munotes.in
Page 128
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
128 Thus, fingerprints can be represented by sets of grid squares – those
where their minutiae are located – and compared like any sets, using
the Jaccard sim - ilarity or distance. There are two versions of
fingerprint comparison, however.
The many -one problem is the one we typically expect. A fingerprint
has been found on a gun, and we want to compare it with all the
fingerprints in a large database, to see which one matches.
The many -many version of the problem is to take the entire database,
and see if there are any pairs that represent the same individual.
While the many -many version matches the model that we have been
following for finding similar items, the same technolo gy can be used to
speed up the many -one problem.
3.10.5 A LSH Family for Fingerprint Matching
We could minhash the sets that represent a fingerprint, and use the
standard LSH technique from Section 3.4. However, since the sets
are chosen from a relatively small set of grid points (perhaps 1000),
the need to minhash them into more succinct signatures is not clear.
We shall study here another form of locality -sensitive hashing that
works well for data of the type we are discussing. Suppose for an
example that the pr obability of finding a minutia in a random grid
square of a random fingerprint is 20%. Also, assume that if two
fingerprints come from the same finger, and one has a minutia in a
given grid square, then the probability that the other does too is 80%.
We ca n define a locality -sensitive family of hash functions as
follows. Each function f in this family F is defined by three grid
squares. Function f says “yes” for two fingerprints if both have
minutiae in all three grid squares, and otherwise f says “no.” Put
another way, we may imagine that f sends to a single bucket all
fingerprints that have minutiae in all three of f’s grid points, and
sends each other fingerprint to a bucket of its own. In what follows,
we shall refer to the first of these buckets as “the” bucket for f and
ignore the buckets that are required to be singletons.
If we want to solve the many -one problem, we can use many functions
from the family F and precompute their buckets of fingerprints to
which they answer “yes.” Then, given a new finge rprint that we want
to match, we determine which of these buckets it belongs to and
compare it with all the fingerprints found in any of those buckets. To
solve the many -many problem, we compute the buckets for each of the
functions and compare all fingerp rints in each of the buckets.
Let us consider how many functions we need to get a reasonable
probability of catching a match, without having to compare the
fingerprint on the gun with each of the millions of fingerprints in the
database. First, the probabi lity that two fingerprints from different
fingers would be in the bucket for a function f in F is (0.2)6 =
0.000064. The reason is that they will both go into the bucket only if • • munotes.in
Page 129
Shingling of Documents
129 they each have a minutia in each of the three grid points associated
with f, and the probability of each of those six independent events is
0.2.
Now, consider the probability that two fingerprints from the same
finger wind up in the bucket for f. The probability that the first
fingerprint has minutiae in each of the three squares belonging to f is
(0.2)3 = 0.008. However, if it does, then the probability is (0.8)3 =
0.512 that the other fingerprint will as well. Thus, if the
fingerprints are from the same finger, there is a 0.008 0.512 =
0.004096 probability that they will both be in the bucket of f. That
is not much; it is about one in 200. However, if we use many
functions from F, but not too many, then we can get a good
probability of matching fingerprints from the same finger while not
having too many false positives – fingerpr ints that must be
considered but do not match.
Example 3.22 : For a specific example, let us suppose that we use 1024
functions chosen randomly from F. Next, we shall construct a
new fam- ily F1 by performing a 1024 -way OR on F. Then the
probability that F1 will put fingerprints from the same finger
together in at least one bucket is
1 (1 0.004096)1024 = 0.985. On the other hand, the probability
that two fingerprints from different fingers will be placed in the
same bucket is (1 (1 0.0000 64)1024 = 0.063. That is, we get
about 1.5% false negatives and about 6.3% false positives.
The result of Example 3.22 is not the best we can do. While it offers
only a 1.5% chance that we shall fail to identify the fingerprint on the
gun, it does force us to look at 6.3% of the entire database. Increasing
the number of functions from F will increase the number of false
positives, with only a small benefit of reducing the number of false
negatives below 1.5%. On the other hand, we can also use the AND
construction, and in so doing, we can greatly reduce the probability
of a false positive, while making only a small increase in the false -
negative rate. For instance, we could take 2048 functions from F in
two groups of 1024. Construct the bucke ts for each of the functions.
However, given a fingerprint P on the gun:
1. Find the buckets from the first group in which P belongs, and
take the union of these buckets.
2. Do the same for the second group.
3. Take the intersection of the two unions.
4. Compare P only with those fingerprints in the intersection.
Note that we still have to take unions and intersections of large sets of
finger - prints, but we compare only a small fraction of those. It is the
comparison of fingerprints that takes the bulk of the time; in steps munotes.in
Page 130
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
130 (1) and (2) fingerprints can be represented by their integer indices in
the database.
If we use this scheme, the probability of detecting a matching
fingerprint is (0.985)2 = 0.970; that is, we get about 3% false negatives.
However, the probabi lity of a false positive is (0.063)2 = 0.00397.
That is, we only have to examine about 1/250th of the database.
3.10.6 Similar News Articles
Our last case study concerns the problem of organizing a large
repository of on-line news articles by grouping together We b pages
that were derived from the same basic text. It is common for
organizations like The Associated Press to produce a news item and
distribute it to many newspapers. Each newspaper puts the story in its
on-line edition, but surrounds it by information that is special to that
newspaper, such as the name and address of the newspaper, links to
related articles, and links to ads. In addition, it is common for the
newspaper to modify the article, perhaps by leaving off the last few
paragraphs or even deletin g text from the middle. As a result, the same
news article can appear quite different at the Web sites of different
newspapers.
The problem looks very much like the one that was suggested in
Section 3.4: find documents whose shingles have a high Jaccard
similarity. Note that this problem is different from the problem of
finding news articles that tell about the same events. The latter problem
requires other techniques, typically examining the set of important
words in the documents (a concept we discussed briefly in Section
1.3.1) and clustering them to group together different articles about the
same topic.
However, an interesting variation on the theme of shingling was found
to be more effective for data of the type described. The problem is that
shingling as we described it in Section 3.2 treats all parts of a
document equally. However, we wish to ignore parts of the document,
such as ads or the headlines of other articles to which the newspaper
added a link, that are not part of the news article. It turns out that there
is a noticeable difference between text that appears in prose and text
that appears in ads or headlines. Prose has a much greater frequency of
stop words, the very frequent words such as “the” or “and.” The total
number of words that are co nsidered stop words varies with the
application, but it is common to use a list of several hundred of the
most frequent words.
Example 3.23 : A typical ad might say simply “Buy Sudzo.” On the
other hand, a prose version of the same thought that might appear in
an article is “I recommend that you buy Sudzo for your laundry.” In
the latter sentence, it would be normal to treat “I,” “that,” “you,”
“for,” and “your” as stop words.
munotes.in
Page 131
Shingling of Documents
131 Suppose we define a shingle to be a stop word followed by the next
two word s. Then the ad “Buy Sudzo” from Example 3.23 has no
shingles and would not be reflected in the representation of the Web
page containing that ad. On the other hand, the sentence from Example
3.23 would be represented by five shingles: “I recommend th at,” “that
you buy,” “you buy Sudzo,” “for your laundry,” and “your laundry x,”
where x is whatever word follows that sentence.
Suppose we have two Web pages, each of which consists of half news
text and half ads or other material that has a low density of stop words.
If the news text is the same but the surrounding material is different,
then we would expect that a large fraction of the shingles of the two
pages would be the same. They might have a Jaccard similarity of
75%. However, if the surrounding material is the same but the news
content is different, then the number of common shingles would be
small, perhaps 25%. If we were to use the conventional shingling,
where shingles are (say) sequences of 10 consecutive characters, we
would expect the two docu ments to share half their shingles (i.e., a
Jaccard similarity of 1/3), regardless of whether it was the news or the
surrounding material that they shared.
3.10.7 Exercises for Section 3.8
Exercise 3.8.1 : Suppose we are trying to perform entity resolution
among bibliographic references, and we score pairs of references based
on the similar - ities of their titles, list of authors, and place of
publication. Suppose also that all references include a year of
publication, and this year is equally likely to be any of the ten most
recent years. Further, suppose that we discover that among the pairs of
references with a perfect score, there is an average difference in the
publication year of 0.1.6 Suppose that the pairs of references with a
certain score s are found to have an average difference in their
publication dates of 2. What is the fraction of pairs with score s that
truly represent the same pub - lication? Note : Do not make the
mistake of assuming the average difference in publication date
between random pai rs is 5 or 5.5. You need to calculate it exactly, and
you have enough information to do so.
Exercise 3.8.2 : Suppose we use the family F of functions described in
Sec- tion 3.8.5, where there is a 20% chance of a minutia in an grid
square, an 80% chance of a second copy of a fingerprint having a
minutia in a grid square where the first copy does, and each function in
F being formed from three grid squares. In Example 3.22, we
constructed family F 1 by using the OR construction on 1024 members
of F. Supp ose we instead used family F 2 that is a 2048 -way OR of
members of F.
(a) Compute the rates of false positives and false negatives for F2.
(b) How do these rates compare with what we get if we organize the
same 2048 functions into a 2 -way AND of members of F 1, as w as
discussed at the end of Section 3.8.5? munotes.in
Page 132
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
132 1 1 Exercise 3.8.3 : Suppose fingerprints have the same statistics outlined
in Ex - ercise 3.8.2, but we use a base family of functions F′ defined
like F, but using only two randomly chosen grid squares. Construct
another set of functions F′ from F′ by taking the n-way OR of
functions from F′. What, as a function of n, are the false positive and
false negative rates for F′ ?
Exercise 3.8.4 : Suppose we use the functions F 1 from Example 3.22,
but we want to solve the many -many problem.
(a) If two fingerprints are from the same finger, what is the probability
that they will not be compared (i.e., what is the false negative
rate)?
(b) What fraction of the fingerprints from different fingers will be
compared (i.e., what is the false positive rate)?
! Exercise 3.8.5 : Assume we have the set of functions F as in
Exercise 3.8.2, and we construct a new set of functions F3 by an n-
way OR of functions in
F. For what value of n is the sum of the false positive and false
negative rates minimized?
6We might expect the average to be 0, but in practice, errors in
publication year do occur.
3.11 METHODS FOR HIGH DEGREES OF
SIMILARITY
LSH -based methods appear most effectiv e when the degree of
similarity we accept is relatively low. When we want to find sets that
are almost identical, there are other methods that can be faster.
Moreover, these methods are exact, in that they find every pair of items
with the desired degree o f similarity. There are no false negatives, as
there can be with LSH.
3.11.1 Finding Identical Items
The extreme case is finding identical items, for example, Web pages
that are identical, character -for-character. It is straightforward to
compare two docu - ments and tell whether they are identical, but we
still must avoid having to compare every pair of documents. Our first
thought would be to hash docu - ments based on their first few
characters, and compare only those documents that fell into the same
bucket. That scheme should work well, unless all the documents begin
with the same characters, such as an HTML header.
Our second thought would be to use a hash function that examines the
entire document. That would work, and if we use enough buckets, it
would be very rare that two documents went into the same bucket, yet
were not identical. The downside of this approach is that we must munotes.in
Page 133
Shingling of Documents
133 { examine every character of every document. If we limit our
examination to a small number of characters, then we never have to
examine a document that is unique and falls into a bucket of its own.
A better approach is to pick some fixed random positions for all
documents, and make the hash function depend only on these. This
way, we can avoid a problem where there is a common prefix for all
or most documents, yet we need not examine entire documents unless
they fall into a bucket with another document. One problem with
selecting fixed positions is that if some documents are short, they may
not have some of the selected positions. However, if we are looking for
highly similar documents, we never need to compare two documents
that differ significantly in their length. We exploit this idea in
Section 3.9.3.
3.11.2 Representing Sets as Strings
Now, let us focus on the harder problem of finding, in a large
collection of sets, all pairs that have a high Jaccard similarity, say at
least 0.9. We can represent a set by sorting the elements of the
universal set in some fixed order, and representing any set by listing
its elements in this order. The list is essentially a string of
“characters,” where the characters are the elements of the universal set.
These strings are unusual, however, in that:
3.11.2.1 No character appears more than once in a string, and
3.11.2.2 If two characters appear in two different strings, then they
appear in the same order in both strings.
Example 3.24 : Suppose the universal set consists of the 26 lower -case
letters, and we use the normal alphabetical order. Then the set d, a, b
is represented by the string abd.
In what follows, we shall as sume all strings represent sets in the
manner just described. Thus, we shall talk about the Jaccard similarity
of strings, when strictly speaking we mean the similarity of the sets
that the strings represent. Also, we shall talk of the length of a string,
as a surrogate for the number of elements in the set that the string
represents.
Note that the documents discussed in Section 3.9.1 do not exactly
match this model, even though we can see documents as strings. To
fit the model, we would shingle the do cuments, assign an order to the
shingles, and represent each document by its list of shingles in the
selected order.
3.11.3 Length -Based Filtering
The simplest way to exploit the string representation of Section 3.9.2 is
to sort the strings by length. Then, each string s is compared with
those strings t that follow s in the list, but are not too long. Suppose
the lower bound on Jaccard similarity between two strings is J. For munotes.in
Page 134
Track C Business Intelligence
and Big Data Analyti cs –II
(Mining Massive Data sets )
134 ≥ ≥ any string x, denote its length by Lx. Note that L s ≤ Lt. The
intersection of the sets represented by s and t cannot have more than Ls
members, while their union has at least Lt members. Thus, the
Jaccard similarity of s and t, which we denote SIM(s, t), is at most
Ls/Lt. That is, in order for s and t to require comparison, it must be
that J ≤ Ls/Lt, or equivalently, Lt ≤ Ls/J.
Example 3.25 : Suppose that s is a string of length 9, and we are
looking for strings with at least 0.9 Jaccard similarity. Then we have
only to compare s with strings following it in the length -based sorted
order that have length at most 9/0.9 = 10. That is, we compare s with
those strings of length 9 that follow it in order, and all strings of length
10. We have no need to compare s with any other string.
Suppose the length of s were 8 instead. Then s would be compared
with following strings of length up to 8/0.9 = 8.89. That is, a string
of length 9 would be too long to have a Jaccard similarity of 0.9 with s,
so we only have to compare s with the strings that have length 8 but
follow it in the sorted order.
3.11.4 Prefix Indexing
In addition to length, there are several other features of strings that can
be exploited to limit the number of comparisons that must be made
to identify all pairs of similar strings. The simplest of these options is
to create an index for each symbol; recall a symbol of a string is any
one of the elements of the universal set. For each string s, we select a
prefix of s consisting of the first psymbols of s. How large p must be
depends on Ls and J, the lower bound on Jaccard similarity. We
add string s to the index for each of its first p symbols. In effect, the
index for each symbol becomes a bucket of strings that must be
compared. We must be certain that any other string t such that
SIM(s, t) will have at least one symbol in its prefix that also
appears in the prefix of s.
Suppose not; rather SIM(s,t) J, but t has none of the first p symbols
of s. Then the highest Jaccard similarity that s and t can have occurs A Better Ordering for Symbols
Instead of using th e obvious order for elements of the universal
set, e.g., lexicographic order for shingles, we can order
symbols rarest first. That is, determine how many times each
element appears in the collection of sets, and order them by
this count, lowest first. The advantage of doing so is that the
symbols in prefixes will tend to be rare. Thus, they will cause
that string to be placed in index buckets that have relatively
few members. Then, when we need to examine a string for
possible matches, we shall find few other strings that are
candidates for comparison. munotes.in
Page 135
Shingling of Documents
135 ≥ ×