Data Sciences

Tuesday, June 29, 2021

Sunday, May 9, 2021

Introduction to Data Sciences

https://docs.google.com/presentation/d/18rOWO8lfiRj8yaF7_DOrZxUcKh4LmytlAv6W1-1wnmk/edit?usp=drivesdk

Wednesday, January 20, 2021

Python for MBA's- Multiple bar charts using Python

import numpy as np

import matplotlib.pyplot as plt

Agricultureproduction = [[30, 25, 50, 20],

[40, 23, 51, 17],

[35, 22, 45, 19]]

X = np.arange(4)

fig = plt.figure()

ax = fig.add_axes([0,0,1,1])

ax.bar(X + 0.00, Agricultureproduction[0], color = 'b', width = 0.25)

ax.bar(X + 0.25, Agricultureproduction[1], color = 'g', width = 0.25)

ax.bar(X + 0.50, Agricultureproduction[2], color = 'r', width = 0.25)

Python for MBA's- Barplot using Python

import numpy as np

import matplotlib.pyplot as plt

fig=plt.figure()

ax=fig.add_axes([0,0,1,1])

GDP=[8.2,8.4,7.2,6.5,8.5,9,9.2]

Year=['2010','2011','2012','2013','2014','2015','2016']

ax.bar(Year,GDP)

plt.show()

calculating linear regression using R

> # calculating the impact of digital marketing campaign cost on sales
> campaign<-c(24,26,28,29)
> sales<-c(134,145,167,172)
> realtion<-lm(campaign~sales)
> print(relation)

Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    33.8312      -0.1052  

> print(summary(relation))

Call:
lm(formula = x ~ y)

Residuals:
      1       2       3       4       5 
 -9.568  23.642   3.747  -6.148 -11.674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  33.8312    16.3441   2.070    0.130
y            -0.1052     0.6855  -0.154    0.888

Residual standard error: 16.72 on 3 degrees of freedom
Multiple R-squared:  0.007795,	Adjusted R-squared:  -0.3229 
F-statistic: 0.02357 on 1 and 3 DF,  p-value: 0.8877

Friday, July 3, 2020

Machine learning has redefined the data science world. Today, we have a huge number of tools and techniques to classify and cluster the data. One of the important classification data technique is, the support Vector machine. The support Vector machine is the supervised learning method and used in classification and regression analysis of the data. This method was developed by Bell Labs.

The fundamental question is what are support vectors?

The support vectors are data points close to the hyperplanes. These data points influence the position and orientation of the hyperplanes.
Removing support vectors alter the shape of the hyperplanes.

Obviously,the next question is what is a hyperplane? .

These are the decision boundaries classifying data points.

classes are determined by data points on either side of the hyperplanes.

The number of features determine the dimensions of the hyperplanes. If the number of features is two, then, the nature of the hyperplane is line. If the number of features or inputs are three then nature of the hyperplane is two-dimensional plane, but when the number of features or input exceeds,then nature of the plane is difficult to imagine.

The second important question that a data scientist has to answer is what is the margin in the support Vector machine?

When the data points are very close. It is very difficult to classify the objects on the support vectors. Therefore the distance between the vectors on the data points is measured. the higher the data points distance with a vector then better is the support Vector machine is. Maximization of the distance between the nearest data points is called as margin. Once we have the knowledge about a hyperplane. Let us discuss how support vectors are prepared. The first step is each data point is plotted in the n dimensional space where n is the number of features of inputs.

second create a hyperplane. The biggest challenge is how do we select the right hyperplane? We come across many situations in machine learning. The first one is choosing the right hyperplane. The thumb rule says that select the hyperplane that segregates the two classes better, but there may be a situation where multiple hyperplanes do exist in this situation. The margins between the data points and Vector has to be taken into the consideration.

There may be a situation where hyper planes are having outliers. Some of the values are outside the dimensions and one has to eliminate such outliers before taking the decisions.

what are the advantages of support Vector machines.

one the support Vector machines are sufficient in high dimensional spaces. They work. Well, even when a number of dimensional space is greater than the sample size.

The drawbacks of the vector machines are all input data should be labeled and it avoids estimating the probability on the finite data.

What are the applications of support Vector machines?

First the data scientists can do text and Hyper text categorization.

Second, image classification and finally handwritten data record mission.

Monday, June 22, 2020

Data wrangling: Definitions, tools and techniques.

Today, I will be discussing about data wrangling or Data munging

What is the data wrangling?

It is the process of cleaning,restructuring, and enriching the data.

There are different views of data wrangling

In the first View, The process begins with Discovering the data Using data mapping and It is structured using tabulations.

Further the researcher cleans the data from outliers and missing values. Added to this. If there is a data insufficiency for analysis,Then data is enriched by adding the data. Additionally, the data is validated using the normality test. After this, the data is published in databases.

Coming to the second view, The process begins with pre-processing the data,Standardizing the data,cleaning the data,consolidating the data,matching the data, and filtering the data.

One can do Data search using Google database search. There are many tools have emerged for data wrangling.

Google data preparations.
Microsoft Excel power query.
Data Wranglers
CSV kit.
Openrefine
R data cleansing Etc.

In a nutshell, what are the data wrangling techniques a data scientist has to know

Missing value analysis
Outlier analysis
Transformations
Visual binning
Multi- collinearity
Principal component analysis
Dummy variable analysis
Singular Vector decomposition
Linear discriminant analysis
Multidimensional scaling
T-SNE and
Independent component analysis.

Thursday, June 11, 2020

Descriptive statistics tools

Descriptive statistics tools include

1.Mean
2.Median
3.Mode
4.Skewness
5.Kurtosis
6.Normal plots
7.Turf analysis
8.Cross tabulation
9.Test of linearity
10.Tests of normality
11.standard deviation
12.Variance
13.Maximum value
14.Minimum value

Features of SPSS : Taking business decision effectively

Features of SPSS:

1.Data visualization features help organization to understand data behavior.
2.Regression, Neural network, and time series analysis aid researchers to forecast the demand for the business.
3 correlation analysis finds relationship between two variables like cost and sales.
4.Parametric and non parametric tests makes conclusion on hypothesis of the business research.
5.Market segmentation done using cluster analysis.
6.Conjoint analysis applied for the new product development.
7.Multi dimension scaling applied for positioning.
8.Statistical quality control charts in SPSS widely implemented for total quality management.

Wednesday, June 10, 2020

Multivariate analysis definition, and objectives

Multivariate Analysis

Definition

Multivariate analysis are set of statistical techniques used by a researcher to test the hypothesis set on multiple variables of sampling unit or sampling units of his experiment or research design.

Examples:
1. Sampling unit : Learners
variables: Grades in mathematics, statistics, big data analytics, database management.
2. Sampling unit; Patient
Variables: heart rate, Body mass index, weight, height

Learning Objectives:

After studying this chapter learner will be able to :

1. know the purposes, assumptions , and limitations of multiple techniques.

2. Identify appropriate techniques for data analysis using multivariate techniques.

3. Interpret the output of software to gain meaningful insights.

Objectives of Multivariate analysis:

1. To understand the relationship between several dependent variables and several independent variables.

2. It identify the data structures of multiple variables.

3. It helps in classifying and categorizing the data.

4. Multivariate techniques helps in data reduction.

Multivariate data analysis questions for examinations

Discuss the importance of factor analysis in data reduction.
What is the difference between varimax and equimax in factor analysis.
Explain rotated component matrix in factor analysis
What do you mean by Eigenvalue?
Define communalities.
Explain principle component analysis.
Discuss the use of maximum likelihood function in the factor analysis.
Explain multivariate normal distribution.
Discuss tests of covariance matrices.
Explain the importance of discriminant analysis.
Elaborate the application of canonical correlation.
Explain multiple regression with an example.
Discuss the cluster analysis application in segmentation.
Distinguish between hierarchical clustering and tw ostep clustering.
Explain K- mean clustering.
Write a note on MANOVA.
What is Ginearal linear model and how it is different from Genaralized linear model.
How wilki's lambda used in multivariate analysis?
What do you mean by bootstrapping?
explain the latent structure discovery.
List any five tools of data mining.
Distinguish OLS and PLS regression.
write a note on SIMCA

Monday, June 8, 2020

Python for MBA's: Chisqaure test

Chi sqaure test using Google Colab

Independent test

Problem 1:

Perform the chi-square test for the following values

14,16,12,15,17

coding

import numpy as np

from scipy.stats import chisquare

chisquare([14,16,12,15,17])

output

Power_divergenceResult(statistic=1.0, 
pvalue=0.9097959895689501)

Saturday, June 6, 2020

R for MBA's- Calculating quartiles/quantiles using R

Coding

> x<-c(97,92,104,189,156,156,125)

> quantile(x)

Output

0% 25% 50% 75% 100%
92.0 100.5 125.0 156.0 189.0

R for MBA's: Calculating Range of numbers using R

Coding:

> x<-(97,92,104,189,156,156,125)

> range(x)

Output

[1] 92 189

R for MBA's- Calculating Standard Deviation and Variance using R

Coding and output

> x<-c(97,92,104,189,156,156,125)

> sd(x)

Standard deviation: [1] 36.64112

> var(x)

variance: [1] 1342.571

Friday, June 5, 2020

Text analytics or Text mining ( Sentiment Analysis and Topic detection)( without coding) using Microsoft Azure.

Thursday, June 4, 2020

Facial Analysis without coding

R for MBA's- Chi sqaure test(Independent)

R for MBA's - Chi sqaure test (independent)

coding

x<-(1567,1233,1456,1678,1456,1111,1895)

> chisq.test(x)

Output

Chi-squared test for given probabilities

data: x

X-squared = 280.87, df = 6, p-value < 2.2e-16

Friday, May 29, 2020

Learn SPSS- Computing

Wednesday, May 27, 2020

R for MBA's- Hypothesis testing - T test

R for MBA's- Hypothesis testing - 't' test

one sample

Problem 1:

The prices of stockof M/S Nanjund Agri industries in last nine days given below

71,74,82,67,35,79,48,57,62

Test the hypothesis whether stock prices of M/S Nanjund Agri differ significantly.

Coding:

> x<-c(71,74,82,67,35,79,48,57,62)

> t.test(x)

Output

One Sample t-test

data: x
t = 12.581, df = 8, p-value = 1.495e-06

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:
52.17808 75.59970
sample estimates:
mean of x
63.88889

Two samples

Problem 2:

Test the hypothesis to ascertain the significance of sales on the profit. The sales of Ajay enterprises are given below for previous seven years.

135,132,234,213,245,267,156

Similarly, the profit of the firm for the corresponding period is as follows.

21,20,34,32,36,38,19

Solution:

Coding:

> Sales<-c(135,132,234,213,245,267,156)

> Profit<-c(21,20,34,32,36,38,19)

> t.test(Sales,Profit)

Output:

Welch Two Sample t-test

Data: Sales and Profit

t = 7.9421, df = 6.2632, p-value = 0.0001705

Alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

117.3586 220.3557

Sample estimates:

Mean of Sales mean of Profit

197.42857 28.57143

Contents of the blog.

Tuesday, June 29, 2021

Sunday, May 9, 2021

Wednesday, January 20, 2021

Friday, July 3, 2020

Monday, June 22, 2020

Thursday, June 11, 2020

Descriptive statistics tools include

Features of SPSS:

Wednesday, June 10, 2020

Multivariate Analysis

Definition

Learning Objectives:

Objectives of Multivariate analysis:

Monday, June 8, 2020

Chi sqaure test using Google Colab

Independent test

Problem 1:

coding

output

Saturday, June 6, 2020

Coding

Output

Coding:

Output

Friday, June 5, 2020

Thursday, June 4, 2020

Facial Analysis without coding

R for MBA's - Chi sqaure test (independent)

coding

Output

Wednesday, June 3, 2020

Friday, May 29, 2020

Wednesday, May 27, 2020

R for MBA's- Hypothesis testing - 't' test

one sample

Problem 1:

Coding:

Output

Two samples

Problem 2:

Followers

Contact Form