Introduction to Statistics

Welcome to ML for Analytics! In today’s post, we will talk about statistics for data science. Let’s begin our discussion by paying attention to an important quote:

Data by itself is useless. Data is only useful if you apply it.

-Todd Park

Truly said. The data is everywhere around us. The tweets we write, the videos we like, the posts we write, the images we click on, everything is generating heaps of data. Today’s world is filled with data and if this data is used wisely, we can make important conclusions out of it.

But how?

Well, businesses use this data to see patterns inside the data collected. These patterns can then be used to make important inferences which can generate huge profits for them. Data science is an art of analyzing the data generated by the businesses, to look for any meaningful patterns in them, which can be interpreted to make inferences and generate profit.

In doing all this, an important “grammar of science” is used by data practitioners, which is called as “statistics”.

What is statistics?

Statistics can be thought of as the science which involves collection of data, so that it can be analysed and tabulated to make inferences out of it. To understand more about it, consider the following statements:

According to U.S. Census Bureau (March 2009), the one-day travel time equals 25.3 minutes on an average.
As of 2017, Gross Domestic Product of India is 2,439.01 billion USD. (Source: www.statista.com)
As of 2017, Life expectancy of India is 68.33 years. (Source: www.statista.com)

In the above statements, the numerical figures are nothing but statistics. Statistical interpretation are carried out on a given data set. Data set comprises of data elements, variables and observations.

Data elements are the entities on which a data is collected. For example, while looking at the data set of mutual funds, each mutual fund is an element.
Variables are the names of the columns or names of different characteristics of data elements. For example, fund type or morning star rank of the mutual fund data set can be taken as variables.
Observations are the rows of the data set. It can also be seen as data about each data element.

Measurement scales

Having talked about the data set, let’s now talk about the different scales of measurements. Measurement scales are of two types:

Qualitative measurement scales
Quantitative measurement scales

Qualitative measurement scales are of two types:

Nominal: comprises of categorical data which cannot be ordered. For example, citrus fruits and sweet fruits both are categories of fruits but when it comes to categorize them, then it holds no meaning!
Ordinal: comprises of categorical data which can be ordered. For example, when asked about the quality of food in a restaurant, the customer can say excellent, good or poor. If we want to order this data in decreasing order of quality, then excellent comes first, second comes good and poor comes last. Hence, ordering can be carried out in this case.

Quantitative measurement scales are also of two types:

Interval data: This type of data is of continuous type and doesn’t hold a true zero. For example, our salary is an interval type of data. Our monthly salary can’t be zero!
Ratio data: This type of data holds a true zero point. For example, Temperature in Kelvin scale.

Samples and population

Now, let’s talk about the samples and population.

Population

Population can be termed as the set of all elements that are of our interest. For example, while taking surveys related to exit polls, all the citizens above 18 years of age are of our interest and hence this becomes our population.

Sample

Sample is a subset of population. That is, it becomes very difficult to ask every person above 18 years of age about their preferred candidate. So, it will be wise to go for some random person after every 5 minutes and ask him about his preference. Thus, here, the set of candidates becomes a sample. A random sample is the closest approximation of the population.

Sample survey

An important point to be noted here is that the survey carried out on the population is termed as census and the survey carried out on a sample is called as sample survey. Sample statistics are the estimated of the population parameters (characteristics).

So, guys, stay tuned for more informative tutorial on business analytics. In the next tutorial, we will talk about descriptive statistics. For more updates and news related to this blog as well as to data science, machine learning and data visualization, please follow our facebook page by clicking this link.

Introduction to Statistics