Comparing free statistical software
Handling missing data

Click here to return to the free software page

This page shows some output from Epi Info, MicrOsiris and WinIDAMS. For comparison, there is output from Stat4U and Excel, using a version of the data set with no missing. The output is correlations and regression. I did this in November 2006 using the most recent versions of the software at that time. I used a version of PD-Plus, available on my data page.

The main finding is that all programs give the same results.


MicrOsiris  http://www.microsiris.com/
Epi Info   http://www.cdc.gov/epiinfo/index.htm
Stat4U, replaced by openstat   http://www.statpages.org/miller/openstat/   which I haven't reviewed yet.
WinIDAMS   http://www.unesco.org/webworld/idams/ 


Using this data set with blanks for missing:
http://gsociology.icaap.org/data/PD_data_cia.csv
listed here   http://gsociology.icaap.org/dataupload.html

and this data set with -999 or -9 for missing (for WinIDAMS)
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.csv
saved as these WinIDAMS data set and dictionary
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.dat
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.dic

and this version, with no missing, and only including the variables used in the regressions below:
http://gsociology.icaap.org/methods/PD_data_cia_stat4u_nomiss.csv

I include this data set because I used it with excel and Stat4U to compare the results with the other programs.


1. MicrOsiris and Epi Info read files with blanks for missing. Stat4U needs something for the missing, like -9 or -9.99
For Stat4U, all variables can have the same value for missing, e.g., -9.99.

2. For WinIDAMS, each variable has to have a 'missing' indicator. I used -999 or -9, and these have to be clearly defined in the file definition. See the .dic file listed above.

3. I can get Stat4U to do correlations, when there are missing data. I'm having problems getting it to do regression, when there are missing data. I've been told by the author that it's because my data set has some unusual characteristics, so other folks may have better luck.

4.  MicrOsiris and Stat4U use a .csv file and Epi Info can read excel or csv.

5. When using MicrOsiris, import the .csv file, then call up commands.

6. Epi Info doesn't do correlation. You need to use regression with 2 variables to get the correlation coefficient.

7. MicrOsiris gives the same correlation as Stat4U. Stat4U and MicrOsiris also gives significance levels, t-tests. 8. Regression: The basic output for Epi Info and MicrOsiris seems to be block entry regression that Stat4U has. The step regression for WinIDAMS also seems to be the same output.



Just Correlations
Return to top

Using these variables: gini (inequality), phone_kpop (phone lines per 1,000 population, c-arable (land cultivated for crops like wheat, maize, and rice that are replanted after each harvest), gdp per capita, infant mortality rate and literacy rate.

Pairwise deletion of cases.

*****************
STAT4U
*****************

             Correlations
Variables       gini  phone_kpop    c-arable      gdpcap         IMR    literacy 

      gini      1.000     
phone_kpop     -0.410       1.000   
  c-arable     -0.431       0.024       1.000
    gdpcap     -0.360       0.785       0.001       1.000
       IMR      0.355      -0.660      -0.116      -0.520       1.000 
  literacy     -0.268       0.596       0.105       0.467      -0.719       0.000





*****************
MicrOsiris
*****************

                      V10        V15        V16        V19        V36
                     gini phone_kpop   c-arable     gdpcap        IMR
phone_kpop V15    -0.4101
c-arable   V16    -0.4310     0.0236
gdpcap     V19    -0.3597     0.7845     0.0014
IMR        V36     0.3555    -0.6597    -0.1165    -0.5196
literacy   V37    -0.2679     0.5955     0.1046     0.4671    -0.7191
 



*****************
WinIDAMS
*****************
use this setup
http://gsociology.icaap.org/methods/PD_data_cia_nine_comma_w3.set

                           VAR     10       15       16       19       28

 phone_kpop                 15  -0.4101
 c-arable                   16  -0.4310   0.0236
 gdpcap                     19  -0.3597   0.7845   0.0014
 IMR                        28   0.3555  -0.6597  -0.1165  -0.5196
 literacy                   29  -0.2676   0.5951   0.1051   0.4671  -0.7195
 
(V10 is gini)
The correlation coeffients are slightly different because I had to format the data set slightly differently, e.g., different number of decimal places.



Regressions
Return to top

Predicting gini (inequality) from phone_kpop (phone lines per 1,000 population, c-arable (land cultivated for crops like wheat, maize, and rice that are replanted after each harvest), climate and North (degrees from the equator).

*****************
Epi Info
*****************

Linear Regression


Variable Coefficient Std Error F-test P-Value
c_arable -0.161 0.057 7.8573 0.006030
climate -1.190 1.149 1.0717 0.302945
North -0.217 0.033 43.7577 0.000000
phone_kpop -0.004 0.004 0.7065 0.402507
CONSTANT 51.095 2.212 533.3821 0.000000


Correlation Coefficient: r^2= 0.52


Source df Sum of Squares Mean Square F-statistic
Regression 4 6447.825 1611.956 28.315
Residuals 106 6034.414 56.928  
Total 110 12482.239    



*****************
MicrOsiris
*****************
Return to top

Total case count:       111
 
STANDARD REGRESSION
 
THE DEPENDENT VARIABLE IS V1: gini
 
     STANDARD ERROR OF ESTIMATE                7.55
     F-RATIO FOR THE REGRESSION              28.315    PROBABILITY  0.00
     MULTIPLE CORRELATION COEFFICIENT        0.7187    ADJUSTED   0.7059
     FRACTION OF EXPLAINED VARIANCE          0.5166    ADJUSTED   0.4983
     DETERMINANT OF THE CORRELATION MATRIX  0.43109
     RESIDUAL DEGREES OF FREEDOM (N-K-1)        106
 
     CONSTANT TERM    51.095                           STD. ERROR   2.21236
 
 VARIABLE     NAME                   B         SIGMA(B)      BETA       SIGMA(BETA)
 
   V15  phone_kpop              -0.37364E-02  0.44452E-02 -0.70633E-01  0.84032E-01
   V16  c-arable                -0.16091      0.57403E-01 -0.21265      0.75862E-01
   V20  climate                  -1.1896       1.1492     -0.88728E-01  0.85710E-01
   V29  North                   -0.21671      0.32760E-01 -0.52881      0.79941E-01
MicrOsiris
Nov 15, 2006                                                                                                    REGRESSION    2
 
 
                               PARTIAL  PART  MARGINAL               COVARIANCE
 VARIABLE     NAME                R       R     RSQD    T-RATIO(PROB)   RATIO
 
   V15  phone_kpop              -0.081  0.057  0.0032   0.8406 (.407)   0.354
   V16  c-arable                -0.263  0.189  0.0358   2.8031 (.006)   0.207
   V20  climate                 -0.100  0.070  0.0049   1.0352 (.303)   0.379
   V29  North                   -0.541  0.447  0.1996   6.6150 (.000)   0.286


********************
WinIDAMS
********************
Return to top

using this setup
http://gsociology.icaap.org/methods/pd_cia_giniregress.set
this is the last step

  Step no   4

      Variable entered     15     phone_kpop             

      F-level            0.706
      T-level            0.841

          Standard error of estimate                 7.545   
          F ratio for the regression                28.315
          Multiple correlation coefficient         0.71872          adjusted        0.70592
          Fraction of explained variance (RSQD)    0.51656          adjusted        0.49832
          Determinant of the correlation matrix    0.43109   
          Residual degrees of freedom (N-p-1)          106
          Constant term                             51.095   


                                                            Partial
  Var. no.        B       Sigma(B)     Beta    Sigma(Beta)   RSQD     Marg RSQD  T-ratio  Cov. ratio  Variable name
    15         -0.0037     0.0044    -0.0706     0.0840     0.0066     0.0032     0.8405     0.3541   phone_kpop             
    16         -0.1609     0.0574    -0.2126     0.0759     0.0690     0.0358     2.8031     0.2075   c-arable               
    20         -1.1896     1.1492    -0.0887     0.0857     0.0100     0.0049     1.0352     0.3792   climate                
    26         -0.2167     0.0328    -0.5288     0.0799     0.2922     0.1996     6.6150     0.2863   North                  

 **************** Listing of marginal R-squares for all potential predictors ***


    Step no.     Var. no.     Variable name              Marg rsqd     Categorical variables (all codes)        Previously in (*)
                                                                             Marg RSQD         T-ratio

        4          15      phone_kpop                      0.0032                                                       *
        4          16      c-arable                        0.0358                                                       *
        4          20      climate                         0.0049                                                       *
        4          26      North                           0.1996                                                       *







********************
Stat4u block entry
********************
Using data set with no missing.
Return to top

Dependent variable: gini

Variable       Beta      B         Std.Err.  t         Prob.>t   VIF       TOL
phone_kpop    -0.071    -0.004     0.004    -0.841     0.402     1.548     0.646
  c-arable    -0.213    -0.161     0.057    -2.803     0.006     1.262     0.792
   climate    -0.089    -1.190     1.149    -1.035     0.303     1.611     0.621
     North    -0.529    -0.217     0.033    -6.615     0.000     1.401     0.714
 Intercept     0.000    51.095     2.212    23.095     0.000

SOURCE      DF        SS        MS        F      Prob.>F
Regression  4   6447.825  1611.956    28.315     0.0000
Residual  106   6034.414    56.928
Total     110  12482.239

R2 = 0.5166, F =    28.32, D.F. = 4 106, Prob>F = 0.0000
Adjusted R2 = 0.4983

Standard Error of Estimate =     7.55
F = 28.315 with probability =  0.000
Block 1 met entry requirements

********************
Excel regression
********************
Using data set with no missing.
Return to top

Regression Statistics






Multiple R 0.718721






R Square 0.51656






Adjusted R Square 0.498317






Standard Error 7.545093






Observations 111















ANOVA







  df SS MS F Significance F


Regression 4 6447.825 1611.956 28.31549 5.29E-16


Residual 106 6034.414 56.92843




Total 110 12482.24      











  Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 51.09461 2.212361 23.09507 3.65E-43 46.70839 55.48084 46.70839 55.48084
X Variable 1 -0.00374 0.004445 -0.84056 0.402489 -0.01255 0.005077 -0.01255 0.005077
X Variable 2 -0.16091 0.057403 -2.80309 0.00602 -0.27471 -0.0471 -0.27471 -0.0471
X Variable 3 -1.18963 1.149161 -1.03522 0.302922 -3.46796 1.08869 -3.46796 1.08869
X Variable 4 -0.21671 0.03276 -6.61496 1.56E-09 -0.28166 -0.15176 -0.28166 -0.15176




Return to top
Click here to return to the free software page
last validated 1/13/08