Statistics for Beginners: Make Sense of Basic Concepts and Methods of Statistics and Data Analysis f

The main thrust of the site is to explain various topics in statistical analysis such Chapter 1: Towards Statistical Thinking for Decision Making; Chapter 2: Exponential Density Function; F-Density Function; Gamma Density Function with the basic concepts and methods of statistical analysis for processes and products.
Table of contents

The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers. The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity. It turned out that productivity indeed improved under the experimental conditions.

Statistics

However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindness. The Hawthorne effect refers to finding that an outcome in this case, worker productivity changed due to observation itself. Those in the Hawthorne study became more productive not because the lighting was changed but because they were being observed. An example of an observational study is one that explores the association between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis.

In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a cohort study , and then look for the number of cases of lung cancer in each group. Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales.

What are Basic Statistics

Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary as in the case with longitude and temperature measurements in Celsius or Fahrenheit , and permit any linear transformation.

Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation. Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables , whereas ratio and interval measurements are grouped together as quantitative variables , which can be either discrete or continuous , due to their numerical nature.


  1. Languages Fast and Easy ~ French.
  2. My Lords Secrets Revealed.
  3. Cyberevolution III: Abiogenesis;
  4. Vulnerability to Psychopathology, Second Edition: Risk across the Lifespan.
  5. Shadows on the Seine: Paris 1952.
  6. What Would Osho Say?.
  7. Navigation menu;

Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type , polytomous categorical variables with arbitrarily assigned integers in the integral data type , and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example, Mosteller and Tukey [18] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder [19] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman , [20] van den Berg The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions.

Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" Hand, , p. Consider independent identically distributed IID random variables with a given probability distribution: A statistic is a random variable that is a function of the random sample, but not a function of unknown parameters. The probability distribution of the statistic, though, may have unknown parameters. Consider now a function of the unknown parameter: Commonly used estimators include sample mean , unbiased sample variance and sample covariance.

A random variable that is a function of the random sample and of the unknown parameter, but whose probability distribution does not depend on the unknown parameter is called a pivotal quantity or pivot. Widely used pivots include the z-score , the chi square statistic and Student's t-value. Between two estimators of a given parameter, the one with lower mean squared error is said to be more efficient.

Furthermore, an estimator is said to be unbiased if its expected value is equal to the true value of the unknown parameter being estimated, and asymptotically unbiased if its expected value converges at the limit to the true value of such parameter. Other desirable properties for estimators include: UMVUE estimators that have the lowest variance for all possible values of the parameter to be estimated this is usually an easier property to verify than efficiency and consistent estimators which converges in probability to the true value of such parameter.

This still leaves the question of how to obtain estimators in a given situation and carry the computation, several methods have been proposed: Interpretation of statistical information can often involve the development of a null hypothesis which is usually but not necessarily that no relationship exists among variables or that no change occurred over time.

The best illustration for a novice is the predicament encountered by a criminal trial. The null hypothesis, H 0 , asserts that the defendant is innocent, whereas the alternative hypothesis, H 1 , asserts that the defendant is guilty. The indictment comes because of suspicion of the guilt.

The H 0 status quo stands in opposition to H 1 and is maintained unless H 1 is supported by evidence "beyond a reasonable doubt". However, "failure to reject H 0 " in this case does not imply innocence, but merely that the evidence was insufficient to convict. So the jury does not necessarily accept H 0 but fails to reject H 0. While one can not "prove" a null hypothesis, one can test how close it is to being true with a power test , which tests for type II errors.

What statisticians call an alternative hypothesis is simply a hypothesis that contradicts the null hypothesis. Working from a null hypothesis , two basic forms of error are recognized:. Standard deviation refers to the extent to which individual observations in a sample differ from a central value, such as the sample or population mean, while Standard error refers to an estimate of difference between sample mean and population mean.

A statistical error is the amount by which an observation differs from its expected value , a residual is the amount an observation differs from the value the estimator of the expected value assumes on a given sample also called prediction. Mean squared error is used for obtaining efficient estimators , a widely used class of estimators.

Root mean square error is simply the square root of mean squared error. Many statistical methods seek to minimize the residual sum of squares , and these are called " methods of least squares " in contrast to Least absolute deviations. The latter gives equal weight to small and big errors, while the former gives more weight to large errors.

Residual sum of squares is also differentiable , which provides a handy property for doing regression. Least squares applied to linear regression is called ordinary least squares method and least squares applied to nonlinear regression is called non-linear least squares. Also in a linear regression model the non deterministic part of the model is called error term, disturbance or more simply noise. Both linear regression and non-linear regression are addressed in polynomial least squares , which also describes the variance in a prediction of the dependent variable y axis as a function of the independent variable x axis and the deviations errors, noise, disturbances from the estimated fitted curve.

Most studies only sample part of a population, so results don't fully represent the whole population. Any estimates obtained from the sample only approximate the population value. Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. From the frequentist perspective, such a claim does not even make sense, as the true value is not a random variable. Either the true value is or is not within the given interval. One approach that does yield an interval that can be interpreted as having a given probability of containing the true value is to use a credible interval from Bayesian statistics: In principle confidence intervals can be symmetrical or asymmetrical.

An interval can be asymmetrical because it works as lower or upper bound for a parameter left-sided interval or right sided interval , but it can also be asymmetrical because the two sided interval is built violating symmetry around the estimate. Sometimes the bounds for a confidence interval are reached asymptotically and these are used to approximate the true bounds. Interpretation often comes down to the level of statistical significance applied to the numbers and often refers to the probability of a value accurately rejecting the null hypothesis sometimes referred to as the p-value.

The standard approach [23] is to test a null hypothesis against an alternative hypothesis. A critical region is the set of values of the estimator that leads to refuting the null hypothesis. The probability of type I error is therefore the probability that the estimator belongs to the critical region given that null hypothesis is true statistical significance and the probability of type II error is the probability that the estimator doesn't belong to the critical region given that the alternative hypothesis is true.

The statistical power of a test is the probability that it correctly rejects the null hypothesis when the null hypothesis is false. Referring to statistical significance does not necessarily mean that the overall result is significant in real world terms. For example, in a large study of a drug it may be shown that the drug has a statistically significant but very small beneficial effect, such that the drug is unlikely to help the patient noticeably.

Although in principle the acceptable level of statistical significance may be subject to debate, the p-value is the smallest significance level that allows the test to reject the null hypothesis. This test is logically equivalent to saying that the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic.

Therefore, the smaller the p-value, the lower the probability of committing type I error. Some problems are usually associated with this framework See criticism of hypothesis testing:. Some well-known statistical tests and procedures are:. Misuse of statistics can produce subtle, but serious errors in description and interpretation—subtle in the sense that even experienced professionals make such errors, and serious in the sense that they can lead to devastating decision errors. For instance, social policy, medical practice, and the reliability of structures like bridges all rely on the proper use of statistics.

Even when statistical techniques are correctly applied, the results can be difficult to interpret for those lacking expertise. The statistical significance of a trend in the data—which measures the extent to which a trend could be caused by random variation in the sample—may or may not agree with an intuitive sense of its significance.

Definition of Degrees of Freedom

The set of basic statistical skills and skepticism that people need to deal with information in their everyday lives properly is referred to as statistical literacy. There is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter.

Misuse of statistics can be both inadvertent and intentional, and the book How to Lie with Statistics [28] outlines a range of considerations. In an attempt to shed light on the use and misuse of statistics, reviews of statistical techniques used in particular fields are conducted e.

Warne, Lazo, Ramos, and Ritter Ways to avoid misuse of statistics include using proper diagrams and avoiding bias. Thus, people may often believe that something is true even if it is not well represented. To assist in the understanding of statistics Huff proposed a series of questions to be asked in each case: The concept of correlation is particularly noteworthy for the potential confusion it can cause. Statistical analysis of a data set often reveals that two variables properties of the population under consideration tend to vary together, as if they were connected.

For example, a study of annual income that also looks at age of death might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated; however, they may or may not be the cause of one another. The correlation phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable. For this reason, there is no way to immediately infer the existence of a causal relationship between the two variables. See Correlation does not imply causation.

Some scholars pinpoint the origin of statistics to , with the publication of Natural and Political Observations upon the Bills of Mortality by John Graunt.

The scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general. Today, statistics is widely employed in government, business, and natural and social sciences.

What are Basic Statistics?

Its mathematical foundations were laid in the 17th century with the development of the probability theory by Gerolamo Cardano , Blaise Pascal and Pierre de Fermat. Mathematical probability theory arose from the study of games of chance, although the concept of probability was already examined in medieval law and by philosophers such as Juan Caramuel. The modern field of statistics emerged in the late 19th and early 20th century in three stages.

Galton's contributions included introducing the concepts of standard deviation , correlation , regression analysis and the application of these methods to the study of the variety of human characteristics—height, weight, eyelash length among others. Ronald Fisher coined the term null hypothesis during the Lady tasting tea experiment, which "is never proved or established, but is possibly disproved, in the course of experimentation". The second wave of the s and 20s was initiated by William Gosset , and reached its culmination in the insights of Ronald Fisher , who wrote the textbooks that were to define the academic discipline in universities around the world.

Fisher's most important publications were his seminal paper The Correlation between Relatives on the Supposition of Mendelian Inheritance , which was the first to use the statistical term, variance , his classic work Statistical Methods for Research Workers and his The Design of Experiments , [44] [45] [46] [47] where he developed rigorous design of experiments models. He originated the concepts of sufficiency , ancillary statistics , Fisher's linear discriminator and Fisher information.

Edwards has remarked that it is "probably the most celebrated argument in evolutionary biology ". The final wave, which mainly saw the refinement and expansion of earlier developments, emerged from the collaborative work between Egon Pearson and Jerzy Neyman in the s. They introduced the concepts of " Type II " error, power of a test and confidence intervals. Jerzy Neyman in showed that stratified random sampling was in general a better method of estimation than purposive quota sampling.


  • Greedy, Cowardly, and Weak: Hollywoods Jewish Stereotypes.
  • Holman New Testament Commentary - Acts: 5?
  • Old Books and New Histories: An Orientation to Studies in Book and Print Culture.
  • Lifes Daily Dose of Reality: Statistics, facts and advice on Drunk or Drugged Driving for every day .
  • Data Analysis & Exploratory Data Analysis (EDA) - Statistics How To.
  • Fair Play and A Little Off The Top (Menage, Steamy Romance)!
  • Today, statistical methods are applied in all fields that involve decision making, for making accurate inferences from a collated body of data and for making decisions in the face of uncertainty based on statistical methodology. The use of modern computers has expedited large-scale statistical computations, and has also made possible new methods that are impractical to perform manually. Statistics continues to be an area of active research, for example on the problem of how to analyze Big data. Applied statistics comprises descriptive statistics and the application of inferential statistics.

    Mathematical statistics includes not only the manipulation of probability distributions necessary for deriving results related to methods of estimation and inference, but also various aspects of computational statistics and the design of experiments. There are two applications for machine learning and data mining: Statistics tools are necessary for the data analysis. Statistics is applicable to a wide variety of academic disciplines , including natural and social sciences , government, and business. Statistical consultants can help organizations and companies that don't have in-house expertise relevant to their particular questions.

    The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science. Early statistical models were almost always from the class of linear models , but powerful computers, coupled with suitable numerical algorithms , caused an increased interest in nonlinear models such as neural networks as well as the creation of new types, such as generalized linear models and multilevel models.

    Increased computing power has also led to the growing popularity of computationally intensive methods based on resampling , such as permutation tests and the bootstrap , while techniques such as Gibbs sampling have made use of Bayesian models more feasible. The computer revolution has implications for the future of statistics with new emphasis on "experimental" and "empirical" statistics. A large number of both general and special purpose statistical software are now available.

    Traditionally, statistics was concerned with drawing inferences using a semi-standardized methodology that was "required learning" in most sciences. What was once considered a dry subject, taken in many fields as a degree-requirement, is now viewed enthusiastically. Statistical techniques are used in a wide range of types of scientific and social research, including: Some fields of inquiry use applied statistics so extensively that they have specialized terminology.

    In addition, there are particular types of statistical analysis that have also developed their own specialised terminology and methodology:. Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes as in statistical process control or SPC , for summarizing data, and to make data-driven decisions. In these roles, it is a key tool, and perhaps the only reliable tool. In other words, the evidence in your sample is strong enough to be able to reject the null hypothesis at the population level.

    The graph below shows the t-distribution for several different degrees of freedom. Because the degrees of freedom are so closely related to sample size, you can see the effect of sample size. As the degrees of freedom decreases, the t-distribution has thicker tails. This property allows for the greater uncertainty associated with small sample sizes. To dig into t-tests, read my post about How t-Tests Work. I show how the different t-tests calculate t-values and use t-distributions to calculate p-values. It uses the F-distribution, which is defined by the degrees of freedom. However, you calculate the DF for an F-distribution differently.

    How to Interpret P-values Correctly. The chi-square test of independence determines whether there is a statistically significant relationship between Categorical variables A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. For a categorical variable, you can assign categories but the categories have no natural order. If the variable has a natural order, it is an ordinal variable. Categorical variables are also called qualitative variables or attribute variables.

    For example, college major is a categorical variable that can have values such as psychology, political science, engineering, biology, etc. Just like other hypothesis tests, this test incorporates degrees of freedom. For a table with r rows and c columns, the general rule for calculating degrees of freedom for a chi-square test is r-1 c However, we can create tables to understand it more intuitively. The degrees of freedom for a chi-square test of independence is the number of cells in the table that can vary before you can calculate all the other cells.

    In a chi-square table, the cells represent the observed frequency for each combination of categorical variables. The constraints are the totals in the margins. For example, in a 2 X 2 table, after you enter one value in the table, you can calculate the remaining cells. In the table above, I entered the bold 15, and then I can calculate the remaining three values in parentheses. Therefore, this table has 1 DF. The table below illustrates the example that I use in my post about the chi-square test of independence.

    In that post, I determine whether there is a statistically significant relationship between uniform color and deaths on the original Star Trek TV series. In the table, one categorical variable is shirt color, which can be blue, gold, or red. The other categorical variable is status, which can be dead or alive. After I entered the two bolded values, I can calculate all the remaining cells.

    Consequently, this table has 2 DF. Read my post, Chi-Square Test of Independence and an Example , to see how this test works and how to interpret the results using the Star Trek example. Like the t-distribution, the chi-square distribution is a family of distributions where the degrees of freedom define the shape. Chi-square tests use this distribution to calculate p-values. The graph below displays several chi-square distributions. In a regression model, each term is an estimated parameter that uses one degree of freedom. In the regression output below, you can see how each term requires a DF.

    There are 28 observations and the two Predictor variables Predictor variables are also known as independent variables, x-variables, and input variables. A predictor variable explains changes in the response. Typically, you want to determine how changes in one or more predictors are associated with changes in the response. For example, in a plant growth study, the predictors might be the amount of fertilizer applied, the soil moisture, and the amount of sunlight.

    The remaining 26 degrees of freedom are displayed in Error. The error degrees of freedom are the independent pieces of information that are available for estimating your Regression coefficients Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response.

    In linear regression, coefficients are the values that multiply the predictor values. Suppose you have the following regression equation: The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable. A positive sign indicates that as the predictor variable increases, the response variable also increases. For precise coefficient estimates and powerful hypothesis tests in regression, you must have many error degrees of freedom. This equates to having many observations for each model term.

    Independent Information and Constraints on Values

    As you add terms to the model, the error degrees of freedom decreases. You have fewer pieces of information available to estimate the coefficients. This situation reduces the precision of the estimates and the power of the tests. For more information about the problems that occur when you use too many degrees of freedom and how many observations you need, read my blog post about overfitting your model. Even though they might seem murky, degrees of freedom are essential to any statistical analysis!

    In a nutshell, DF define the amount of information you have relative to the number of properties that you want to estimate. Journal of Educational Psychology. F, am i right dear Jim? For instance, take a look at the chi-square examples. I used Minitab software for the graphs. The topic clearity is in very good format. But please explain this through R programming. Do that we can feel confidence while prediction. My blog is designed to teach statistical concepts, analyses, interpretation, etc rather than teaching a specific software package.

    The software package can supply the documentation that describes how to obtain the specific results that you need. The minitab software you are using is free or paid…if it is free please provide me its link… thank you. Hi Akhilesh, Minitab is not free. A statistic is a piece of information based on data. For example, the crime rate, median income, mean height, etc. A test statistic is a statistic that summarizes the sample data and is used in hypothesis testing to determine whether the results are statistically significant. The hypothesis test takes all of the sample data, reduces it to a single value, and then calculates probabilities based on that value to determine significance.

    For more information about how test statistics work, read my posts about t-values and F-values. Both of those are test statistics. And also we are confused in the diference between sample size and degree of freedom……. Sample size is the number of data points in your study. Degrees of freedom are often closely related to sample size yet are never quite the same. The relationship between sample size and degrees of freedom depends on the specific test.

    Hypothesis tests actually use the degrees of freedom in the calculations for statistical significance. Typically, DF define the probability distribution of the test statistic. Jim thanks for the core area in stat that you always state. I dwnld the hurd Are you referring to the PSPP software? If you, I believe the correct file for Windows is psppdailybits-setup. That is a file you can run to install the program. Thanks Jim, I have probably found the first person with such clear basics.

    Hope to learn much more with you. Hi Eajaz, thanks so much for the kind words! You made my day because I strive to find ways to teach statistics using easy to understand language! The table just provided the degrees of freedom for 30 and Which one shall I choose? In the t-distribution, after you get past about 30 df, the differences between the t-values for different probabilities become miniscule. For example, if d. For a test statistic, this is equivalent to picking the DF that is associated with a larger absolute value of the statistic—and that means choosing a lower DF.

    The choice you make should require stronger evidence rather than weaker evidence to be statistically significant. But, you raise an excellent point. In some cases, such as how I described 39 DF for the t-distribution, the difference is minute. You have to go out three decimal places to see a difference. To see why, read my post about correctly interpreting p-values. Near the end of that post, I discuss strength of evidence.

    In a nutshell, I would not consider results with a p-value of 0. In either case, both results are fairly weak evidence to build a case on. Changing the DF affects these borderline cases. However, I do agree with the approach of choosing the DF that requires stronger evidence to produce statistically significant results. If you have to make a choice, make a choice in the direction of requiring stronger evidence.

    That approach indicates choosing the lower DF. Thanks for raising this issue! It was good to think through this! Hi Dr Jim, I really appreciate your reply. It is really a great one. Some articles do mentioned that we shall use interpolation method to find the t-value if it is not given in the table. But none discuss like what you have explained which can convince us to use the lower DF instead of using the standard rounding rules.