Sunday, May 9, 2021

Introduction to Data Sciences

 

https://docs.google.com/presentation/d/18rOWO8lfiRj8yaF7_DOrZxUcKh4LmytlAv6W1-1wnmk/edit?usp=drivesdk

Wednesday, January 20, 2021

Python for MBA's- Multiple bar charts using Python

 

import numpy as np

import matplotlib.pyplot as plt

Agricultureproduction = [[30255020],

[40235117],

[35224519]]

X = np.arange(4)

fig = plt.figure()

ax = fig.add_axes([0,0,1,1])

ax.bar(X + 0.00, Agricultureproduction[0], color = 'b', width = 0.25)

ax.bar(X + 0.25, Agricultureproduction[1], color = 'g', width = 0.25)

ax.bar(X + 0.50, Agricultureproduction[2], color = 'r', width = 0.25)


Python for MBA's- Barplot using Python

 

import numpy as np

import matplotlib.pyplot as plt

fig=plt.figure()

ax=fig.add_axes([0,0,1,1])

GDP=[8.2,8.4,7.2,6.5,8.5,9,9.2]

Year=['2010','2011','2012','2013','2014','2015','2016']

ax.bar(Year,GDP)

plt.show()


calculating linear regression using R

 

> # calculating the impact of digital marketing campaign cost on sales
> campaign<-c(24,26,28,29)
> sales<-c(134,145,167,172)
> realtion<-lm(campaign~sales)
> print(relation)

Call:
lm(formula = x ~ y)

Coefficients:
(Intercept)            y  
    33.8312      -0.1052  

> print(summary(relation))

Call:
lm(formula = x ~ y)

Residuals:
      1       2       3       4       5 
 -9.568  23.642   3.747  -6.148 -11.674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  33.8312    16.3441   2.070    0.130
y            -0.1052     0.6855  -0.154    0.888

Residual standard error: 16.72 on 3 degrees of freedom
Multiple R-squared:  0.007795,	Adjusted R-squared:  -0.3229 
F-statistic: 0.02357 on 1 and 3 DF,  p-value: 0.8877

Friday, July 3, 2020

Support Vector machines

Machine learning has redefined the data science world. Today, we have a huge number of tools and techniques to classify and cluster the data. One of the important classification data technique is, the support Vector machine. The support Vector machine is the supervised learning method and used in classification and regression analysis of the data. This method was developed by Bell Labs. 
The fundamental question is what are support vectors? 
  • The support vectors are data points close to the hyperplanes. These data points influence the position and orientation of the hyperplanes. 
  • Removing support vectors alter the shape of the hyperplanes.
Obviously,the next question is what is a hyperplane? . 
These are the decision boundaries classifying data points. 
 classes are determined by data points on either side of the hyperplanes.
 The number of features determine the dimensions of the hyperplanes. If the number of features is two, then, the nature of the hyperplane is line. If  the number of features or inputs are three then nature of the hyperplane is two-dimensional plane, but when the number of features or input exceeds,then nature of the plane is difficult to imagine. 

The second important question that a data scientist has to answer is what is the margin in the support Vector machine? 

When the data points are very close. It is very difficult to classify the objects on the support vectors. Therefore the distance between the vectors on the data points is measured. the higher the data points distance with a vector then better is the support Vector machine is. Maximization of the distance between the nearest data points is called as margin. Once we have the knowledge about a hyperplane.  Let us discuss how support vectors are prepared. The first step is each data point is plotted in the n dimensional space where n is the number of features of inputs. 
second create a hyperplane. The biggest challenge is how do we select the right hyperplane? We come across many situations in machine learning. The first one is choosing the right hyperplane. The thumb rule says that select the hyperplane that segregates the two classes better, but there may be a situation where multiple hyperplanes do exist in this situation. The margins between the data points and Vector has to be taken into the consideration. 

There may be a situation where hyper planes are having outliers. Some of the values are outside the dimensions and one has to eliminate such outliers before taking the decisions.
what are the advantages of support Vector machines. 
one the support Vector machines are sufficient in high dimensional spaces. They work. Well, even when a number of dimensional space is greater than the sample size.

The drawbacks of the vector machines are all input data should be labeled and it avoids estimating the probability on the finite data. 

What are the applications of support Vector machines?
 First the data scientists can do text and Hyper text categorization.
Second, image classification and finally handwritten data record mission.

Monday, June 22, 2020

Data wrangling: Definitions, tools and techniques.

Today, I will be discussing about  data wrangling or Data munging 
What is the data wrangling? 
It is the process of cleaning,restructuring, and enriching the data. 
There are different views of data wrangling 
In the first View, The process begins with Discovering the data Using data mapping and It is structured using tabulations. 
Further the researcher cleans the data from outliers and missing values. Added to this. If there is a data insufficiency for analysis,Then data is enriched by adding the data. Additionally, the data is validated using the normality test. After this, the data is published in databases. 
Coming to the second view, The process begins with pre-processing the data,Standardizing the data,cleaning the data,consolidating the data,matching the data, and filtering the data. 
One can do Data search using Google database search. There are many tools have emerged for data wrangling. 
  1. Google data preparations. 
  2. Microsoft Excel power query. 
  3. Data Wranglers 
  4. CSV kit. 
  5. Openrefine 
  6. R data cleansing Etc. 
In a nutshell, what are the data wrangling techniques a data scientist has to know 
  • Missing value analysis 
  • Outlier analysis 
  • Transformations 
  • Visual binning 
  • Multi- collinearity
  • Principal component analysis 
  • Dummy variable analysis
  • Singular Vector decomposition 
  • Linear discriminant analysis
  • Multidimensional scaling 
  • T-SNE and
  • Independent component analysis.

Thursday, June 11, 2020

Descriptive statistics tools

Descriptive statistics tools include

1.Mean
2.Median
3.Mode
4.Skewness
5.Kurtosis
6.Normal plots
7.Turf analysis
8.Cross tabulation
9.Test of linearity
10.Tests of normality
11.standard deviation
12.Variance
13.Maximum value
14.Minimum value

Features of SPSS : Taking business decision effectively

Features of SPSS:

1.Data visualization features help organization to understand data behavior.
2.Regression, Neural network, and time series analysis aid researchers to forecast the demand for the business.
3 correlation analysis finds relationship between two variables like cost and sales.
4.Parametric and non parametric tests makes conclusion on hypothesis of the business research.
5.Market segmentation done using cluster analysis.
6.Conjoint analysis applied for the new product development.
7.Multi dimension scaling applied for positioning.
8.Statistical quality control charts in SPSS widely implemented for total quality management.

Wednesday, June 10, 2020

Multivariate analysis definition, and objectives

Multivariate Analysis 

Definition

Multivariate analysis are set of statistical techniques used by a researcher to test the hypothesis set on  multiple variables of sampling unit or sampling units of his experiment or research design.

Examples:
1. Sampling unit : Learners
variables: Grades in mathematics, statistics, big data analytics, database management.
2. Sampling unit; Patient
Variables: heart rate, Body mass index, weight, height

 Learning Objectives:

After studying this chapter learner will be able to :
1. know the purposes, assumptions , and limitations of multiple techniques.
2. Identify appropriate techniques for data analysis using multivariate techniques.
3. Interpret the output of software to gain meaningful insights.


Objectives of Multivariate analysis:

1. To understand the relationship between several dependent variables and several independent variables.
2. It identify the data structures of multiple variables.
3. It helps in classifying and categorizing the data.
4. Multivariate techniques helps in data reduction.

Multivariate data analysis questions for examinations


  1.  Discuss the importance of factor analysis in data reduction.
  2. What is the difference between varimax and equimax in factor analysis.
  3. Explain rotated component matrix in factor analysis
  4. What do you mean by Eigenvalue?
  5. Define communalities.
  6. Explain principle component analysis.
  7. Discuss the use of maximum likelihood function in the factor analysis.
  8. Explain multivariate normal distribution.
  9. Discuss tests of covariance matrices.
  10. Explain the importance of discriminant analysis.
  11. Elaborate the application of canonical correlation.
  12. Explain multiple regression with an example.
  13. Discuss the cluster analysis application in segmentation.
  14. Distinguish between hierarchical clustering and tw ostep clustering.
  15. Explain K- mean clustering.
  16. Write a note on MANOVA.
  17. What is Ginearal linear model and how it is different from Genaralized linear model.
  18. How wilki's lambda used in multivariate analysis?
  19. What do you mean by bootstrapping?
  20. explain the latent structure discovery.
  21. List any five tools of data mining.
  22. Distinguish OLS and PLS regression.
  23. write a note on SIMCA

Monday, June 8, 2020

Python for MBA's: Chisqaure test

Chi sqaure test using Google Colab



Independent test

Problem 1:

Perform the chi-square test for the following values
14,16,12,15,17

coding


import numpy as np
 
from scipy.stats import chisquare
 
chisquare([14,16,12,15,17])
 

output

Power_divergenceResult(statistic=1.0, 
pvalue=0.9097959895689501)
 

Saturday, June 6, 2020

R for MBA's- Calculating quartiles/quantiles using R

Coding


> x<-c(97,92,104,189,156,156,125)


> quantile(x)


Output


   0%      25%       50%       75%     100% 
 92.0     100.5     125.0      156.0    189.0 

R for MBA's: Calculating Range of numbers using R

 Coding: 

> x<-(97,92,104,189,156,156,125)

> range(x)


Output


[1]  92 189

R for MBA's- Calculating Standard Deviation and Variance using R


Coding and output

> x<-c(97,92,104,189,156,156,125)

> sd(x)

Standard deviation: [1] 36.64112

> var(x)

 variance: [1] 1342.571

Thursday, June 4, 2020

Facial Analysis without coding

Facial Analysis without coding


R for MBA's- Chi sqaure test(Independent)


R for MBA's - Chi sqaure test (independent)


coding



x<-(1567,1233,1456,1678,1456,1111,1895)

> chisq.test(x)

Output


 Chi-squared test for given probabilities

data:  x

X-squared = 280.87, df = 6, p-value < 2.2e-16

Wednesday, May 27, 2020

R for MBA's- Hypothesis testing - T test

R for MBA's- Hypothesis testing  - 't' test

one sample

Problem 1:


The prices of stockof M/S  Nanjund Agri industries in last nine days given below
71,74,82,67,35,79,48,57,62
Test the hypothesis whether stock prices of M/S Nanjund Agri differ significantly.

Coding:

> x<-c(71,74,82,67,35,79,48,57,62)
> t.test(x)

Output

One Sample t-test

data:  x
t = 12.581, df = 8, p-value = 1.495e-06

alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 52.17808 75.59970
sample estimates:
mean of x
 63.88889 

Two samples


Problem 2:

Test the hypothesis to ascertain the significance of sales on the profit. The sales of Ajay enterprises are given below for previous seven years.
135,132,234,213,245,267,156
 Similarly, the profit of the firm for the corresponding period is as follows.
21,20,34,32,36,38,19

Solution:
Coding:
> Sales<-c(135,132,234,213,245,267,156)
> Profit<-c(21,20,34,32,36,38,19)
> t.test(Sales,Profit)
Output:
        Welch Two Sample t-test
Data:  Sales and Profit
t = 7.9421,       df = 6.2632,               p-value = 0.0001705
Alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 117.3586                    220.3557
Sample estimates:
Mean of Sales                                                              mean of Profit
197.42857                                                        28.57143