Monday, June 22, 2020

Data wrangling: Definitions, tools and techniques.

Today, I will be discussing about  data wrangling or Data munging 
What is the data wrangling? 
It is the process of cleaning,restructuring, and enriching the data. 
There are different views of data wrangling 
In the first View, The process begins with Discovering the data Using data mapping and It is structured using tabulations. 
Further the researcher cleans the data from outliers and missing values. Added to this. If there is a data insufficiency for analysis,Then data is enriched by adding the data. Additionally, the data is validated using the normality test. After this, the data is published in databases. 
Coming to the second view, The process begins with pre-processing the data,Standardizing the data,cleaning the data,consolidating the data,matching the data, and filtering the data. 
One can do Data search using Google database search. There are many tools have emerged for data wrangling. 
  1. Google data preparations. 
  2. Microsoft Excel power query. 
  3. Data Wranglers 
  4. CSV kit. 
  5. Openrefine 
  6. R data cleansing Etc. 
In a nutshell, what are the data wrangling techniques a data scientist has to know 
  • Missing value analysis 
  • Outlier analysis 
  • Transformations 
  • Visual binning 
  • Multi- collinearity
  • Principal component analysis 
  • Dummy variable analysis
  • Singular Vector decomposition 
  • Linear discriminant analysis
  • Multidimensional scaling 
  • T-SNE and
  • Independent component analysis.

Thursday, June 11, 2020

Descriptive statistics tools

Descriptive statistics tools include

1.Mean
2.Median
3.Mode
4.Skewness
5.Kurtosis
6.Normal plots
7.Turf analysis
8.Cross tabulation
9.Test of linearity
10.Tests of normality
11.standard deviation
12.Variance
13.Maximum value
14.Minimum value

Features of SPSS : Taking business decision effectively

Features of SPSS:

1.Data visualization features help organization to understand data behavior.
2.Regression, Neural network, and time series analysis aid researchers to forecast the demand for the business.
3 correlation analysis finds relationship between two variables like cost and sales.
4.Parametric and non parametric tests makes conclusion on hypothesis of the business research.
5.Market segmentation done using cluster analysis.
6.Conjoint analysis applied for the new product development.
7.Multi dimension scaling applied for positioning.
8.Statistical quality control charts in SPSS widely implemented for total quality management.

Wednesday, June 10, 2020

Multivariate analysis definition, and objectives

Multivariate Analysis 

Definition

Multivariate analysis are set of statistical techniques used by a researcher to test the hypothesis set on  multiple variables of sampling unit or sampling units of his experiment or research design.

Examples:
1. Sampling unit : Learners
variables: Grades in mathematics, statistics, big data analytics, database management.
2. Sampling unit; Patient
Variables: heart rate, Body mass index, weight, height

 Learning Objectives:

After studying this chapter learner will be able to :
1. know the purposes, assumptions , and limitations of multiple techniques.
2. Identify appropriate techniques for data analysis using multivariate techniques.
3. Interpret the output of software to gain meaningful insights.


Objectives of Multivariate analysis:

1. To understand the relationship between several dependent variables and several independent variables.
2. It identify the data structures of multiple variables.
3. It helps in classifying and categorizing the data.
4. Multivariate techniques helps in data reduction.

Multivariate data analysis questions for examinations


  1.  Discuss the importance of factor analysis in data reduction.
  2. What is the difference between varimax and equimax in factor analysis.
  3. Explain rotated component matrix in factor analysis
  4. What do you mean by Eigenvalue?
  5. Define communalities.
  6. Explain principle component analysis.
  7. Discuss the use of maximum likelihood function in the factor analysis.
  8. Explain multivariate normal distribution.
  9. Discuss tests of covariance matrices.
  10. Explain the importance of discriminant analysis.
  11. Elaborate the application of canonical correlation.
  12. Explain multiple regression with an example.
  13. Discuss the cluster analysis application in segmentation.
  14. Distinguish between hierarchical clustering and tw ostep clustering.
  15. Explain K- mean clustering.
  16. Write a note on MANOVA.
  17. What is Ginearal linear model and how it is different from Genaralized linear model.
  18. How wilki's lambda used in multivariate analysis?
  19. What do you mean by bootstrapping?
  20. explain the latent structure discovery.
  21. List any five tools of data mining.
  22. Distinguish OLS and PLS regression.
  23. write a note on SIMCA

Monday, June 8, 2020

Python for MBA's: Chisqaure test

Chi sqaure test using Google Colab



Independent test

Problem 1:

Perform the chi-square test for the following values
14,16,12,15,17

coding


import numpy as np
 
from scipy.stats import chisquare
 
chisquare([14,16,12,15,17])
 

output

Power_divergenceResult(statistic=1.0, 
pvalue=0.9097959895689501)
 

Saturday, June 6, 2020

R for MBA's- Calculating quartiles/quantiles using R

Coding


> x<-c(97,92,104,189,156,156,125)


> quantile(x)


Output


   0%      25%       50%       75%     100% 
 92.0     100.5     125.0      156.0    189.0 

R for MBA's: Calculating Range of numbers using R

 Coding: 

> x<-(97,92,104,189,156,156,125)

> range(x)


Output


[1]  92 189

R for MBA's- Calculating Standard Deviation and Variance using R


Coding and output

> x<-c(97,92,104,189,156,156,125)

> sd(x)

Standard deviation: [1] 36.64112

> var(x)

 variance: [1] 1342.571

Thursday, June 4, 2020

Facial Analysis without coding

Facial Analysis without coding


R for MBA's- Chi sqaure test(Independent)


R for MBA's - Chi sqaure test (independent)


coding



x<-(1567,1233,1456,1678,1456,1111,1895)

> chisq.test(x)

Output


 Chi-squared test for given probabilities

data:  x

X-squared = 280.87, df = 6, p-value < 2.2e-16