Short IT recipes: Statistics

Showing posts with label Statistics. Show all posts

Sunday, February 18, 2018

Lowess smooth in logistic regression using R based on Applied Logistic Regression, 3th ed, HOSMER

To to determine linearity in the logit for a continuous variable, one can use lowess smoothed scatterplot as described in Applied Logistic Regression, 3rd Edition, in chapter 4.2.1 Methods to Examine the Scale of a Continuous Covariate in the Logit.

In the book this is illustrated using Scale_Example data from the book and STATA. I dont have stata, but this can be replicated with a good agreement in R.

For this, I will use logitloess function described here.

The data sets used in the book can be obtained here: http://wiley.mpstechnologies.com/wiley/BOBContent/searchLPBobContent.do


# read in Scale_Example.txt that contains data for the example

mydata = read.table("~/Applied_Logistic_Regression_by_Hosmer_Lemeshow_3th_ed_2013/Scale Example/Scale_Example.txt")



# seperate data into x and y

y=mydata[,1]

x=mydata[,2]



# plot lowess graph

logitloess(x,y)

The resulting plot is very similar to the one provided in the book (Figure 4.1). Its not exactly same, but this could be due to differences in the implementation of the logitloess methods between R and STATA. Nevertheless, the conclusions from the plot obtained in R are same as those in the book.

Thursday, November 26, 2009

Paired t-test

I always forget what P value must be to reject or not to reject null hypothesis in paired t-test. So lets explain by the example.



H0 - null hypothesis - there is no

significant difference

between method A and B

H1 - alternative hypothesis - there is difference

(two tail test)

For example [with 5% (a=0.05) level of significance ]:

Based on the above results I can say that: since P=0.009103483 and this is lover than a=0.05 (P<0.05), I can claim that:



I'm 95% (a=0.05) sure that there is significant

difference between A and B, because (P<0.05).

On the other hand, we can have:

In this case P=0.649507752, hence I can claim that:



I'm 95% (a=0.05) sure that there is no significant

difference between A and B, because (P>0.05).

Above paired t-test was performed in Excel.

Monday, November 09, 2009

Effect size of matched and unmatched data in Maltab

"In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity" from here.

The script below calculates effect size for matched (paired) and unmatched (unpaired) data as described in [1].

So, if C and E sets are matched than effect size d is 1.265. If C and E sets are unmatched than effect size d is also 1.265. The two sizes are equal for this example only.

References

[1] Dunlop, W. P., Cortina, J. M., Vaslow, J. B., & Burke, M. J. (1996).
Meta-analysis of experiments with matched groups or repeated measures designs. Psychological Methods, 1, 170-177.

Tuesday, June 02, 2009

Paired t-test assumption of normality

It is important to remember that paired t-test is valid if the differences of pairs of variables are normally distributed. It appears that the variables itself do not need to be normally distributed, as paired t-test only operates on the differences. This observation is based on [Bland2000, p. 161 and 260].

Normality of a sample can be evaluated using Q-Q plots [Bland2000, ch. 7.5]. However, the problem with these plots is that they are not objective, i.e. for one person a sample is normally distributed, for other person the same sample may be only approximately normally distributed while for the third person it is not normally distributed. For that reason quantitative methods are better. One of such methods is Shapiro-Wilk test. The other method is Kolmogorov-Smirnov test. For small samples (i.e. 50) Shapiro-Wilk is generally more accurate [Marques2003 p.157].

If sample was found to be non Normal, Wilcoxon signed-rank test (called also Wilcoxon matched pairs test) can be used [Bland2000, ch.12.3].

All the above tests can be performed easily in SPSS.

References
Bland, Martin. An Introduction to Medical Statistics 3th ed, Oxford University Press, 2000
J. P. Marques de Sá Applied statistics: using SPSS, STATISTICA, and MATLAB, Springer, 2003

Tuesday, January 27, 2009

Matlab: making Bland and Altman plots

To do the Bland and Altman plot, we have to compute the differences between the instruments and the mean of both instruments for all the paired values.

For instance

A=[749   583   740   235   735   971   867    86   366   369]
B=[685   598   789   368   206    87   772   206   388   552]
blandAltmanPlot(A,B);

function blandAltmanPlot(A,B)
%reference: Y H Chan, Biostatistics 104:
%Correlational Analysis,
%Singapore Med J 2003 Vol 44(12) : 614-619

meanAB=(A+B)./2;
difff=A-B;
meanDiff=mean(difff);
stdDiff=std(difff);

meanp2D=meanDiff+2*stdDiff;
meanm2D=meanDiff-2*stdDiff;
n=length(difff);
minD=min(meanAB)-0.1;
maxD=max(meanAB)+0.1;

figure;
plot(meanAB,difff,'.k')
hold on;
plot([minD; maxD],ones(1,2)*meanp2D,'--k');
text(minD+0.01,meanp2D+0.01,'Mean + 2*SD');
hold on;
plot([minD; maxD],ones(1,2)*meanm2D,'--k');
text(minD+0.01,meanm2D+0.01,'Mean - 2*SD');
hold on;
plot([minD; maxD],ones(1,2)*meanDiff,'--k');
xlim([minD maxD]);
xlabel('(A+B)/2');
ylabel('A-B');

The excel spreadsheet with the example of Bland and Altman plots is here. These two programs calculated the same statistics.

Excel: Confidence interval

You can use both t and z distribution to create confidence interval around sample mean. You use t when the sample size is less than 30 (n<30).

Hence with 0.05 (95%) level of confidence our interval is: mean±interval i.e. 182.4±20.719 and 45±4.6924, respectively.
For illustration only there is also z statistic interval shown (19.4 and 4.39 respectively). As can be seen, using z instead of t makes the interval to be smaller.

If one wants to use for example, 0.01 (99%) level of confidence one must use TINV(0.01,n-1) for t distribution and CONFIDENCE(0.01,STDEV,n) for z distribution.
In our case if we use 0.01 we get interval for t distribution equal to 28.3207 for the first example and 6.4141 for the second example. Interval for z is 25.50 and 5.77 respectively.

Briefly:

Interval using z distribution is narrower
than for t distribution

The same but in R:


>X1 <- scan()
1: 205 179 185 210 128 145 177 117 221 159 205 128 165 180 198 158 132 283 269 204
> a <- mean(X1)
> s <- sd(X1)
> n <- length(X1)
> errZ <- qnorm(0.975)*s/sqrt(n)
> errT <- qt(0.975,df=n-1)*s/sqrt(n)

Excel: Paired t-test

I always forget what must be the value of P must be to reject or not to reject the null hypothesis in paired t-test. So lets explain by the example.

H0 - null hypothesis - there is no
significant difference
between method A and B
H1 - alternative hypothesis - there is difference
(two tail test)

For example [with 5% (a=0.05) level of significance ]:

Based on the above results I can say that: "since P=0.009103483 and this is lover than a=0.05 (P<0.05), I can claim that":

I'm 95% (a=0.05) sure that there is significant
difference between A and B, because (P<0.05).

On the other hand, we can have:

In this case P=0.649507752, and I can claim that:

I'm 95% (a=0.05) sure that there is no significant
difference between A and B, because (P>0.05).

Above paired t-test was performed in Excel.

Sunday, January 25, 2009

Matlab: Simple linear regression analysis

Simple linear regression analysis is used to predict the values of one variable (dependent variable - X) based on the values of one other variable (independent variable - Y). In simple linear regression analysis, relationship between two variables is represented by the straight line (prediction line) fitted (Y = aX + b) to the scatter plot of those variables.

First, lets start by defining our working example. Lets assume that we want to examine relationship between the store size (in thousands of square feet) (X variable) and its annual sales (in million of dollars) (Y variable). We have sample of 14 stores, and data about them:

X= [1.7 1.6 2.8 5.6 1.3 2.2 1.3 1.1 3.2 1.5 5.2 4.6 5.8 3.0];
Y= [3.7 3.9 6.7 9.5 3.4 5.6 3.7 2.7 5.5 2.9 10.7 7.6 11.8 4.1];

Determining prediction line (Y=aX+b)

Prediction line (Y = aX + b) where: a is slope of the fitted line, and b is intercept of the line with Y axis. To determine those parameters we can use Matlab/Octave code:

p=polyfit(X,Y,1);
a=p(1); %slop
b=p(2); %icept

In our case a=1.67 and b=0.9645; hence the line is: Y = 1.67X + 0.9645.

What does it mean? The slop means that for each increase of 1 unit in X, the mean value of Y is estimated to increase by 1.67 units. In our case it means that if size of store increases by 1 thousands square feet, the mean annual sales is estimated to increase by 1.67 millions of dollars. On the other hand, the intercept represents the mean value of Y when X equals 0. In our case intercept has no practical meaning because size of store cannot be 0.

Scatter diagram and fitted regression line are show below:

Standard error of the estimate - Syx

Prediction using simple regression line is not perfect; hence, statistic that measure the variability of the actual Y values from the predicted Y values must be developed, in the same way as standard deviation was developed to measure the variability of value around the mean. The standard deviation around the prediction line is called standard error of the estimate - Syx.
Calculation of Syx in Matlab/Octave is:

x=X';y=Y'; %from row vector to column vector.
Sxx=sum((x-mean(x)).^2);
Syy=sum((y-mean(y)).^2);
Sxy=sum((x-mean(x)).*(y-mean(y)));
SSE=Syy-a*Sxy;
S2yx=SSE/(n-2);

%Standard error of estimate:
Syx=sqrt(SSE/(n-2));

In our case we get Syx = 0.9664. The standard error of the estimate is measured in the same units as the dependent variable (Y). In our case defendant variable Y is Annual sales, and our standard error of the estimate is equal to 0.9664 millions of dollars. We can express this value on the scatter plot as below:

Test whether there is a significant linear relationship

In this section we want to test if there exist a significant (with confidence of 95%; hence, alpha=0.05) linear relationship between X and Y. To do this we can perform t-test for the slop. We do this by stating null and alternative hypothesis as fallows:H0 : A = 0 (There is no linear relationship)
H1 : A = 0 (There is a linear relationship) where A is the slop of the relationship in the population of X and Y. I remained that we have only sample of size 14 from the population. We perform regression analysis on the sample trying to infer relationship in the populations X and Y.

The t statistics equals:

.
We calculate t and P-value and check whether P-value>alpha=0.05; hence, whether we can reject or not our null hypothesis.

%we use date from the earlier code above
Sb=sqrt(S2yx/Sxx);
A=0;
t=(a-A)/Sb;
Pvalue=2*(1-tcdf(abs(t),n-2));
% in older Octave (2.1) use t_cdf!!!

hence t=10.64 and P-value=1.8227e-07 lower than alpha=0.05 so we reject null hypothesis and conclude that for 95% there is significant linear relationship between mean annual sales and the size of the store.

The confidence interval estimate

Regression analysis can be used to predict values of Y based on X. In our case we want to be able to predict the annual sales based on the size of the store. The simplest way to do it, just just point estimate using our line equation Yy = a*Xx + b = 1.67Xx + 0.9645, where Yy is our prediction and Xx is the value of X that we make prediction for. For example, if one asks what is estimated mean annual sales for the store of size 4000 square feet ( Xx=4.0), we calculate
Yy = 1.67*4.0 + 0.9645 = 7.644
Hence, the predicted mean annual sales for the store of size of 4000 square fit
is 7.644 million dollars.
This kind of prediction is called point rediction, and we know that our regression line is not perfect due to standard error of the estimate (Syx). We would prefer to be more statistically correct; hence, we should create confidence interval for our predictions.
Confidence interval estimate for the mean of Y:


Xx=4.0; %predict Y for X=4.
Yy=b+a.*(Xx); %Y predicted from regression line.
SSX=sum((X-mean(X)).^2);
h=1/n + ((Xx-mean(X)).^2)/SSX;
tn2= tinv(1-alpha/2,n-2); %tn-2 - t critical
interval=tn2.*Syx.*sqrt(h);

hence the interval is 0.673.

For example if we want to know the confidence interval of the mean annual sales for the store of size 4.000 square feet ( Xx=4.0), after calculations we get: 7.644 ± 0.673. In other words the mean annual sales are between 6.971 and 8.317 (million of
dollars) for the population of stores with 4000 square feet. If we calculate interval for all the X values in our sample, we can show this interval graphically:

For now that is everything what I have to say about this topic; however, this topic is much more longer and complex. But for most situations, it is a good start.

Sunday, April 29, 2007

Coefficient of variation

The coefficient of variation is a dimensionless number that allows comparison of the variation of populations that have significantly different mean values. It is defined as:cv=std(X)/mean(X);
For example if:
X= [1.7 1.6 2.8 5.6 1.3 2.2 1.3 1.1 3.2 1.5 5.2 4.6 5.8 3.0];
then cv=0.5846.

P-value and t-critical values in Matlab and Octave

If one use to calculate t-critical and P-values using t statistis, those functins in Matlab/Octave do it:

t_critical= tinv(1-alpha/2,n-2);

where alpha is confidence level (e.g. alpha=0.05 for 95% confidence level);

Pval=2*(1-tcdf(abs(t),n-2))

where t is our calculated value of t statistics.

With regard to Octave, it depends which version you use. You may find that instead of tint and tcdf there are t_int and t_cdf functions.