Introduzione alla statistica: variabili e distribuzioni di frequenza

Slide da Unisr.it su introduzione alla statistica: variabili e distribuzioni di frequenza. Il Pdf, utile per l'università in matematica, offre una panoramica chiara sui concetti fondamentali della statistica, inclusi esempi ed esercizi con soluzioni per l'autovalutazione.

Mostra di più

46 pagine

1 / 46

Lecture 1

Topics:

• Brief introduction

• Types of variables (Lecture notes 1; Book chapters: 1.2-1.3)

• Frequencies, relative frequencies and cumulative relative frequen-

cies (Lecture notes 1; Book chapter: 2.3)

P. Rancoita

rancoita.paolamaria@unisr.it

2 / 46

WHAT IS STATISTICS?

Anteprima

Introduzione alla statistica

Argomenti trattati

Brief introduction
Types of variables (Lecture notes 1; Book chapters: 1.2-1.3)
Frequencies, relative frequencies and cumulative relative frequen- cies (Lecture notes 1; Book chapter: 2.3)

P. Rancoita rancoita. paolamaria@unisr.it 1 / 46

Cos'è la statistica?

Statistica (1)

Statistics is a discipline of study dealing with scientific methods for the collection, analysis, interpretation, and presentation of data, as well as with method for drawing valid conclusions on the basis of such analysis.
Statistical methodologies can be classified into two groups:
- Descriptive statistics: it seeks only to describe and summarize the data of a sample.
- Inferential statistics: it consists of techniques for reaching conclusions about a population based upon information contained in a sample. Inferential statistics is based on probability theory.

3 / 46

Statistica (2)

Population parameters ‹ Statistical inference generalization of the results Sampling Descriptive statistics Sample estimates of the parameters

4 / 46

Esempio di popolazione e campione

Example. We select 100 lung cancer patients from the national cancer registry (CR), in order to estimate the mean number of cigarettes smoked per day by a lung cancer patient, before the diagnosis of the disease.

target population = all lung cancer patients parameter = mean number of cigarettes smoked per day before the diagnosis sample = 100 lung cancer patients from the CR

5/ 46

Importanza della statistica in medicina

Alcuni esempi

establishing risk factors for a disease or other health events (for prevention or for increasing the understanding of the phenomenon);
establishing prognostic factor for a disease (thus, for example different treatment strategies can be adopted depended on them);
assessing the benefits of new therapies;
comparing the benefits of competing therapies.

6 / 46

Il ruolo della statistica in uno studio scientifico

Statistics is necessary (or must be accounted for) in every phase of a study:

the design of the study;
the data collection;
the statistical analysis of the collected data;
the interpretation of the results of the analysis.

About 50% of the literature is thought to have some lack from a statistical point of view (Ercan et al, Eur. J. Gen. Med. 2007).

7 / 46

Disegno di uno studio (1)

A correct design of the study allows:

to draw conclusions about the population upon the results obtained in the sample;
a correct interpretation of the results with respect to the aim of the study.

8 / 46

Disegno di uno studio (2)

Esempi di errori comuni

In a case-control study (which compares diseased and healthy subjects), the selected healthy subjects do not have characteristics (i.e. prognostic factors different from the one under study) similar to the patients. Thus, the two groups are not comparable and it is not possible to assess the effect of interest excluding confounding effects.

Example. When studying a disease for which the age is a prognostic factor, the two groups are not comparable if they have not the same age distribution (e.g. one group presents a higher number of young subject than the other one).

9 / 46

Disegno di uno studio (3)

In a study about a specific (target) population, the selected sample is not representative of that population. Thus, the results cannot be generalized to the target population.

Example. In Emotional category data on images from the International Affective Picture System (Mikels et al, 2005), the authors wanted to identify which images were able to elicit a particular emotion more than others. They used samples of students with mean age 18-19 years, thus their findings are not generalizable to older individuals.

10 / 46

Raccolta dati (1)

A precise data collection is the base for a good research.

Esempi di possibili errori

Variables are poorly measured.
The data of a variable are recorded with different units of measurement (for the different patients), without specifying them.
Data are poorly written in the medical record of the patient, thus leading to errors when recording the information in the electronic database.

11 / 46

Raccolta dati (2)

Example. The presence of only a poor measured or reported data may completely alter the result of the analysis (on the right-hand side).

120 8 100 80 60 O 40 O O 20 0 0 20 40 60 80 100 120 O 120 8 100 80 60 O 40 20 0 0 20 40 60 80 100 120

12 / 46

Analisi statistica (1)

Any kind of statistical analysis makes assumptions about the data.
- Before performing any analysis, it is necessary to verify if the corresponding assumptions are met.

Example. Several statistical methods assume that the observations are all independent. In some studies, measurements are taken before and after the treatment in order to assess its efficacy. But data referring to the same subj. are dependent, thus appropriate methods need to be applied for the analysis.

The results of the analysis must be reported in a correct and precise manner in order to avoid misinterpretations.

13 / 46

Analisi statistica (2)

Example. We want to represent the weights of a group of patients together with their mean.

Wrong solution: A graph like the ones below may give a misleading interpretation of the data, especially if the weights show a particular trend with respect to the order of the patients.

80 Mean 70 Weight (Kg) 60 50 - - 2 4 6 8 10 Patient 80 Mean 70 Weight (Kg) 60 50 - - 2 4 6 8 10 Patient

14 / 46

Analisi statistica (3)

Correct solution: A graph like the one below gives better the idea of the weights that are mostly represented in the sample.

3.0 2.5 Mean 2.0 - Frequency 1.5 1.0 L 0.5 1 0.0 L 40 50 60 70 80 90 Weight (Kg)

15 / 46

Interpretazione dei risultati

For a correct interpretation of the results, it is necessary to account for:

the exact meaning of the statistical analysis that was employed;
the representativeness of the sample with respect to the target population.

Esempio di errata interpretazione

When a statistical analysis shows a significant association between two variables, the interpretation of this association as causality is beyond the meaning of the standard statistical analysis and can be supported only by clinical/biological knowledge of the phenomenon.

16 / 46

Interpretazione dei risultati (2)

Example (possible misinterpretation of association results). Analyzing the data about coronary heart disease (CHD), it can be usually found that there is an association between heavy coffee drinking and CHD mortality. Nevertheless, the real risk factor for CHD is heavy smoking (which is also associated with CHD mortality).

Heavy coffee drinking is associated with CHD mortality (although it is not the cause), because often heavy smokers are also heavy coffee drinkers.

Cigarette Smoking (Confounding Factor) Coffee Drinking I I I I CHD Mortality

17 / 46

Statistica

Population parameters ‹ Statistical inference generalization of the results Sampling Descriptive statistics Sample estimates of the parameters

18 / 46

Popolazione e campione

population = collection of subjects or objects of interest (target) that share common observable characteristics unit = any individual or element of the target population ⇓ we select sample = subset of the (target) population which is representative of the entire population

19 / 46

Variabili e dati

Definition. A variable is any kind of observable characteristic that can vary among the units of a population.

Example. Examples of variables are: sex and age.

Definition. A parameter of the population is a numerical characteristics related to a variable of the (target) population.

Example. Examples of parameters related to the previous variables are: the percentage of females and the mean age.

Definition. A data is the observed value of a variable for one particular unit of the sample.

Example. In a study, the reported values of sex and age of the patients are data.

20 / 46

Tipi di variabili

categorical (or qualitative)
- nominal
- ordinal
numerical (or quantitative)
- discrete
- continuous

21 / 46

Variabili: categoriche vs numeriche (1)

Definition. A variable is called categorical (or qualitative) if its values denote the membership to a category/group, that is its values represent a particular quality of the units of the population.

The possible categories of a variable must be mutual exclusive, that is a unit cannot belong to more than one category.

Example. The variable sex is categorical, since its values are: male and female.

22 / 46

Variabili: categoriche vs numeriche (2)

Definition. A variable is called numerical (or quantitative) if its values represent quantities that can be measured or counted.

Example. The variable age is numerical.

Remark. A numerical variable can be transformed into a categorical one by dividing the interval of all its possible values in subintervals, which then define the categories of the new variable.

Example. The age can be divided in three classes: < 30 , between 30 and 60 (both inclusive), > 60. The resulting variable is categorical (and, in particular, ordinal).

23 / 46

Variabili: categoriche vs numeriche (3)

Remark. In a database or in a case report form (CRF), often the categories of the categorical values are labeled with numbers. Therefore, it is necessary to understand the "real meaning" of the labels (numbers), in order to define the type of variable.

Example. The values of the level of satisfaction can be denoted as: 1(=low), 2(=medium) and 3(=high).

Although the values of the variable are labeled with numbers, they represent three categories (low, medium, high) and thus the variable is categorical (and not numerical).

24 / 46

Tipi di variabili: nominali vs ordinali

categorical (or qualitative)
- nominal
- ordinal
numerical (or quantitative)
- discrete
- continuous

25 / 46

Variabili categoriche: nominali vs ordinali

Definition. A categorical variable is called ordinal if its values (or categories) have an intrinsic (and not simply "aesthetic") order. Otherwise, the variable is called nominal. If a nominal variable assume only two values is called dichotomic.

Example (1). The level of satisfaction (which assumes the values: low, medium, high) is an ordinal variable. The categories of the variable can be ordered in the following way: low - medium - high.

Example (2). The presence of fever (which assumes the values: no/yes) is a nominal (dichotomic) variable. In fact, it is not possible to order the values no and yes.

1 26 / 46

Non hai trovato quello che cercavi?

Esplora altri argomenti nella Algor library o crea direttamente i tuoi materiali con l’AI.