Comparing free statistical software
Handling missing
data
Click here
to return to the free software page
This page shows some output from Epi Info, MicrOsiris and WinIDAMS.
For comparison, there is output from Excel, using a version of the
data set with no missing. The output is correlations and regression.
I did this in November 2006 using the most recent versions of the
software at that time. I used a version of PD-Plus, available on my
data page. I updated this on March 2012 to include Instat and PSPP,
and in April 2016 to include JASP.
The main finding is that all
programs give the same results. Except, see note 8, where
Instat operates differently from the other programs in correlation.
It uses casewide deletion, while the other programs use pairwise.
MicrOsiris http://www.microsiris.com/
Epi Info http://wwwn.cdc.gov/epiinfo/
WinIDAMS http://portal.unesco.org/ci/en/ev.php-URL_ID=2070&URL_DO=DO_TOPIC&URL_SECTION=201.html
PSPP http://www.gnu.org/software/pspp/
JASP https://jasp-stats.org/
Special case
Instat http://www.reading.ac.uk/ssc/resourcepage/instat.php
(see note #8)
Data for this analysis.
Using this data set with blanks for missing:
http://gsociology.icaap.org/data/PD_data_cia.csv
listed here
http://gsociology.icaap.org/dataupload.html
and this data set with
-999 or -9 for missing (for WinIDAMS)
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.csv
saved as these WinIDAMS data set and dictionary
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.dat
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.dic
and this version, with
no missing, and only including the variables used in the
regressions below:
http://gsociology.icaap.org/methods/PD_data_cia_stat4u_nomiss.csv
I include this
data set because I used it with excel and Stat4U to compare the
results with the other programs.
NOTES
1. MicrOsiris, Instat and Epi Info read files with blanks for
missing.
2. For WinIDAMS, each variable has to have a 'missing' indicator. I
used -999 or -9, and these have to be clearly defined in the file
definition. See the .dic file listed above.
3. MicrOsiris uses a .csv file and Epi Info can read excel or
csv. Instat reads excel or csv files.
4. When using MicrOsiris,
a.
import the .csv file, then call up commands.
b.
for blanks, Microsiris assigns 1.5 and 1.6 billion, but
automatically recognises these values as missing.
c.
the data dictionary shows 0 decimal places, but if the data
actually have decimal places, like 1.23, the number is read as
1.23, with the decimal place. The data dictionary shows how
many decimal places are implied,
if there isn't one.
5. Epi Info doesn't do
correlation (at least in a version I used in 2008). You
need to use regression with 2 variables to get the correlation
coefficient.
6. Regression: The basic
output for Epi Info and MicrOsiris seems to be the step regression
for WinIDAMS.
7. For an old version of Instat that I used, it gives the same
results, except
that it seems to operate differently than does the others.
For correlation, Instat deletes all cases with any missing
values (casewise deletion). All the other programs do
pairwise deletion, that is, they do correlations for variables,
pairs at a time and only
exclude missing for that pair.
I'm not sure if the same thing applies to the
current version.
Return
to
top
Using these variables:
gini (inequality), phone_kpop (phone lines per 1,000 population,
c-arable (land cultivated for crops like wheat, maize, and rice
that are replanted after each harvest), gdp per capita, infant
mortality rate and literacy rate.
Pairwise
deletion
of cases.
*****************
MicrOsiris
*****************
V10
V15
V16
V19
V36
gini
phone_kpop c-arable
gdpcap IMR
phone_kpop V15 -0.4101
c-arable V16
-0.4310 0.0236
gdpcap V19
-0.3597 0.7845
0.0014
IMR
V36 0.3555
-0.6597 -0.1165 -0.5196
literacy V37
-0.2679 0.5955
0.1046 0.4671 -0.7191
*****************
WinIDAMS
*****************
use this setup
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.set
VAR
10
15
16
19 28
phone_kpop
15
-0.4101
c-arable
16
-0.4310 0.0236
gdpcap
19
-0.3597 0.7845 0.0014
IMR
28
0.3555 -0.6597 -0.1165 -0.5196
literacy
29
-0.2676 0.5951 0.1051
0.4671 -0.7195
(V10 is gini)
The correlation coeffients are slightly different because I had to
format the data set slightly differently, e.g., different number of
decimal places.
*****************
PSPP
*****************
Correlations
|------------------------------|----|----------|--------|------|----|--------|
|
|gini|phone_kpop|c_arable|gdpcap|IMR |literacy|
|----------+-------------------|----+----------+--------+------+----+--------|
|gini
|Pearson Correlation|1.00|
-.41| -.43| -.36| .36|
-.27|
|
|Sig. (2-tailed) |
|
.00| .00| .00|
.00| .00|
|
|N
| 122|
121| 122| 122|
122| 120|
|----------+-------------------|----+----------+--------+------+----+--------|
|phone_kpop|Pearson
Correlation|-.41|
1.00| .02|
.78|-.66| .60|
|
|Sig. (2-tailed) |
.00|
| .72| .00|
.00| .00|
|
|N
| 121|
228| 225| 227|
220| 212|
|----------+-------------------|----+----------+--------+------+----+--------|
|c_arable
|Pearson Correlation|-.43|
.02| 1.00|
.00|-.12| .10|
|
|Sig. (2-tailed) |
.00|
.72| |
.98| .08| .13|
|
|N
| 122|
225| 232| 227|
221| 215|
|----------+-------------------|----+----------+--------+------+----+--------|
|gdpcap
|Pearson Correlation|-.36|
.78| .00|
1.00|-.52| .47|
|
|Sig. (2-tailed) |
.00|
.00| .98|
| .00| .00|
|
|N
| 122|
227| 227| 230|
223| 215|
|----------+-------------------|----+----------+--------+------+----+--------|
|IMR
|Pearson Correlation| .36|
-.66| -.12| -.52|1.00|
-.72|
|
|Sig. (2-tailed) |
.00|
.00| .08|
.00| | .00|
|
|N
| 122|
220| 221| 223|
223| 211|
|----------+-------------------|----+----------+--------+------+----+--------|
|literacy
|Pearson Correlation|-.27|
.60| .10|
.47|-.72| 1.00|
|
|Sig. (2-tailed) |
.00|
.00| .13| .00|
.00| |
|
|N
| 120|
212| 215| 215|
211| 215|
|----------|-------------------|----|----------|--------|------|----|--------|
*****************
JASP
*****************
Correlation Matrix
|
|
gini |
phone_kpop |
c-arable |
gdpcap |
IMR |
literacy |
gini |
|
— |
|
-0.410 |
|
-0.431 |
|
-0.360 |
|
0.355 |
|
-0.268 |
|
phone_kpop |
|
|
|
— |
|
0.024 |
|
0.785 |
|
-0.660 |
|
0.596 |
|
c-arable |
|
|
|
|
|
— |
|
0.001 |
|
-0.116 |
|
0.105 |
|
gdpcap |
|
|
|
|
|
|
|
— |
|
-0.520 |
|
0.467 |
|
IMR |
|
|
|
|
|
|
|
|
|
— |
|
-0.719 |
|
literacy |
|
|
|
|
|
|
|
|
|
|
|
— |
|
|
JASP
Casewise
deletion
of cases.
*****************
Instat
*****************
gini
phone_k c_arabl
gdpcap
IMR literac
gini
1.0000
phone_k
-0.4159 1.0000
c_arabl
-0.4388 0.1867 1.0000
gdpcap
-0.3681
0.9229 0.0961
1.0000
IMR
0.3580
-0.7194 -0.2125
-0.6610 1.0000
literac
-0.2720 0.6118
0.1418 0.5440
-0.7388 1.0000
*****************
Excel
*****************
Excel doesn't do casewise deletion. I just created a comparison data
set with no missing, to compare results with Instat.
gini
phone_kpop
c-arable
gdpcap
IMR literacy
gini
1
phone_kpop
-0.415920344 1
c-arable
-0.438828794 0.186734057
1
gdpcap
-0.368095742 0.922881434
0.096105248 1
IMR
0.35798348
-0.719421668 -0.212493601
-0.660980617 1
literacy
-0.271988365 0.611754978
0.141757381 0.543983666
-0.738766709 1
Return
to
top
Predicting gini
(inequality) from phone_kpop (phone lines per 1,000 population),
c-arable (land cultivated for crops like wheat, maize, and rice
that are replanted after each harvest), climate and North (degrees
from the equator).
*****************
Epi Info
*****************
Linear Regression
Variable |
Coefficient |
Std Error |
F-test |
P-Value |
c_arable |
-0.161 |
0.057 |
7.8573 |
0.006030 |
climate |
-1.190 |
1.149 |
1.0717 |
0.302945 |
North |
-0.217 |
0.033 |
43.7577 |
0.000000 |
phone_kpop |
-0.004 |
0.004 |
0.7065 |
0.402507 |
CONSTANT |
51.095 |
2.212 |
533.3821 |
0.000000 |
Correlation Coefficient: r^2= |
0.52 |
Source |
df |
Sum of Squares |
Mean Square |
F-statistic |
Regression |
4 |
6447.825 |
1611.956 |
28.315 |
Residuals |
106 |
6034.414 |
56.928 |
|
Total |
110 |
12482.239 |
|
|
*****************
MicrOsiris
*****************
Return
to
top
Total case count:
111
STANDARD REGRESSION
THE DEPENDENT VARIABLE IS V1:
gini
STANDARD ERROR OF
ESTIMATE
7.55
F-RATIO FOR THE
REGRESSION
28.315
PROBABILITY 0.00
MULTIPLE CORRELATION
COEFFICIENT
0.7187 ADJUSTED 0.7059
FRACTION OF EXPLAINED
VARIANCE
0.5166 ADJUSTED 0.4983
DETERMINANT OF THE CORRELATION MATRIX 0.43109
RESIDUAL DEGREES OF FREEDOM
(N-K-1) 106
CONSTANT TERM
51.095
STD.
ERROR 2.21236
VARIABLE
NAME
B
SIGMA(B)
BETA
SIGMA(BETA)
V15
phone_kpop
-0.37364E-02
0.44452E-02 -0.70633E-01 0.84032E-01
V16
c-arable
-0.16091
0.57403E-01 -0.21265 0.75862E-01
V20
climate
-1.1896
1.1492
-0.88728E-01 0.85710E-01
V29
North
-0.21671
0.32760E-01 -0.52881 0.79941E-01
MicrOsiris
Nov 15,
2006
REGRESSION
2
PARTIAL
PART
MARGINAL
COVARIANCE
VARIABLE
NAME
R
R RSQD
T-RATIO(PROB) RATIO
V15
phone_kpop
-0.081
0.057 0.0032 0.8406 (.407) 0.354
V16
c-arable
-0.263
0.189 0.0358 2.8031 (.006) 0.207
V20
climate
-0.100
0.070 0.0049 1.0352 (.303) 0.379
V29
North
-0.541
0.447 0.1996 6.6150 (.000) 0.286
********************
WinIDAMS
********************
Return
to
top
using this setup
http://gsociology.icaap.org/methods/pd_cia_giniregress.set
this is the last step
Step no 4
Variable entered
15
phone_kpop
F-level
0.706
T-level
0.841
Standard
error of
estimate
7.545
F
ratio for the
regression
28.315
Multiple
correlation
coefficient
0.71872
adjusted 0.70592
Fraction
of explained variance (RSQD)
0.51656
adjusted 0.49832
Determinant
of the correlation matrix
0.43109
Residual
degrees of freedom
(N-p-1) 106
Constant
term
51.095
Partial
Var.
no.
B
Sigma(B) Beta
Sigma(Beta) RSQD Marg
RSQD T-ratio Cov. ratio Variable name
15
-0.0037 0.0044
-0.0706 0.0840
0.0066 0.0032
0.8405 0.3541
phone_kpop
16
-0.1609 0.0574
-0.2126 0.0759
0.0690 0.0358
2.8031 0.2075
c-arable
20
-1.1896 1.1492
-0.0887 0.0857
0.0100 0.0049
1.0352 0.3792
climate
26
-0.2167 0.0328
-0.5288 0.0799
0.2922 0.1996
6.6150 0.2863
North
**************** Listing
of marginal R-squares for all potential predictors ***
Step
no. Var. no.
Variable
name
Marg
rsqd Categorical variables (all
codes) Previously in
(*)
Marg
RSQD T-ratio
4
15
phone_kpop
0.0032
*
4
16
c-arable
0.0358
*
4
20
climate
0.0049
*
4
26
North
0.1996
*
********************
Instat
********************
ANOVA
for regression of gini
on
phone_k c_arable climate North
-------------------------------------------------------------------
Source
df
SS
MS F value
Prob>F
-------------------------------------------------------------------
Regression
4
6447.83
1612
28.32 0.0000
Residual
106
6034.41 56.928
-------------------------------------------------------------------
Total
110 12482.2
-------------------------------------------------------------------
124
missing or zero-weighted cases
R-squared
= 0.5166 (adjusted = 0.4983)
REGRESSION
COEFFICIENTS
Y-variate:
gini
-----------------------------------------------------------------------------------------
Param.
Estimate
SE
t
Prob>|t| 95% CI
-----------------------------------------------------------------------------------------
Const
51.095
2.212
23.10
0.0000
46.71 55.48
phone_k
-0.00374
0.0044
-0.84
0.4025 -0.0125 0.0051
c_arabl
-0.16091
0.0574
-2.80
0.0060 -0.2747 -0.0471
climate
-1.1896
1.149
-1.04
0.3029 -3.468
1.089
North
-0.21671
0.0328
-6.61
0.0000 -0.2817 -0.1518
---------------------------------------------------------------------------------------
Just to note, I had to search to get these regression coefficients.
After you do regression, then you can get the coefficients.
********************
PSPP
********************
Model Summary
|---|--------|-----------------|--------------------------|
| R |R
Square|Adjusted R Square|Std. Error of the Estimate|
|---|--------|-----------------|--------------------------|
|.72|
.52|
.50|
7.55|
|---|--------|-----------------|--------------------------|
ANOVA
|----------|--------------|---|-----------|-----|------------|
|
|Sum of Squares| df|Mean Square| F |Significance|
|----------|--------------|---|-----------|-----|------------|
|Regression|
6447.83| 4|
1611.96|28.32|
.00|
|Residual
|
6034.41|106|
56.93|
|
|
|Total
|
12482.24|110|
|
|
|
|----------|--------------|---|-----------|-----|------------|
Coefficients
|----------|-----|----------|----|-----|------------|
|
| B |Std. Error|Beta| t |Significance|
|----------|-----|----------|----|-----|------------|
|(Constant)|51.09|
2.21| .00|23.10|
.00|
|phone_kpop|
.00| .00|-.07|
-.84| .40|
|
c_arable | -.16|
.06|-.21|-2.80|
.01|
|
climate |-1.19|
1.15|-.09|-1.04|
.30|
|
North | -.22|
.03|-.53|-6.61|
.00|
|----------|-----|----------|----|-----|------------|
********************
JASP
********************
Linear Regression
|
Model |
R |
R² |
Adjusted R² |
RMSE |
1 |
|
0.719 |
|
0.517 |
|
0.498 |
|
7.545 |
|
|
|
Model |
|
Sum of Squares
|
df |
Mean Square |
F |
p |
1 |
|
Regression |
|
6448 |
|
4 |
|
1611.96 |
|
28.32 |
|
< .001 |
|
|
|
Residual |
|
6034 |
|
106 |
|
56.93 |
|
|
|
|
|
|
|
Total |
|
12482 |
|
110 |
|
|
|
|
|
|
|
|
|
Model |
|
Unstandardized
|
Standard Error
|
Standardized |
t |
p |
1 |
|
intercept |
|
51.095 |
|
2.212 |
|
|
|
. |
|
< .001 |
|
|
|
phone_kpop |
|
-0.004 |
|
0.004 |
|
-0.071 |
|
. |
|
0.402 |
|
|
|
c-arable |
|
-0.161 |
|
0.057 |
|
-0.213 |
|
. |
|
0.006 |
|
|
|
climate |
|
-1.190 |
|
1.149 |
|
-0.089 |
|
. |
|
0.303 |
|
|
|
North |
|
-0.217 |
|
0.033 |
|
-0.529 |
|
. |
|
< .001 |
|
|
JASP
Return
to
top
*****************
Instat
*****************
Simple Models - Normal Distribution, Two Samples
TINt
'phone_kpop' 'c_arable';test 0
Normal
model, two samples
Column
phone_kpop c_arable
Sample
size
228 232
Minimum
0.16917 0
Maximum
1385.1 62.11
Range
1385 62.11
Mean
244.91 13.447
Std.
deviation 241.57
13.028
Pooled
standard deviation = 170.319
Difference
between means = 231.47 s.e. of difference = 15.883 with 458
d.f.
95%
confidence interval for the difference between means
200.25 to 262.68
t value
testing mean
difference=0
is 14.57
Significance
level is 0.0000 (0.00%) for 2 sided test
*****************
Excel
*****************
t-Test: Two-Sample Assuming Equal
Variances
phone_kpop c-arable
Mean
244.9124484 13.44711207
Variance
58355.36494 169.7312336
Observations
228
232
Pooled
Variance
29008.46235
Hypothesized
Mean Difference 0
df
458
t
Stat
14.57323966
P(T<=t)
one-tail
4.25214E-40
t
Critical one-tail
1.648187415
P(T<=t)
two-tail
8.50428E-40
t
Critical two-tail
1.965157018
Return
to
top
Click here
to return to the free software page
last validated 12/25/08