ECD - Exploring Categorical Data Lesson
Exploring Categorical Data Lesson
What is the favorite ice cream flavor of high school students? Does it make a difference what grade they are in? The question presents two variables that are categorical in nature, flavor of ice cream and grade in school. Quite possibly data was gathered through asking questions or using a survey to find out the answer. Assuming that students answered honestly and careful records were kept, the data should answer the questions. Of course, we need a new data display for this type of variable. Using a table with well defined labels on the rows and columns is exactly what is needed.
To find out how this will work, please watch this presentation for an introduction to Categorical Data. You will get to "see" the answers to the questions posed and MUCH MORE. Remember to advance through the presentation using the forward button and take notes as needed. Links to an external site.
Please view the video below.
Exploring Categorical Data Review
CATEGORICAL DATA AND TWO WAY TABLES
Some variables like sex, race, and occupation are inherently categorical. Other categorical variables are created by grouping values of a quantitative variable into classes. Published data, like in magazines or online articles, are often reported in grouped form to save space. To analyze categorical data, we use the counts or percents of individuals that fall into various categories. Raw data is often presented in a two-way table because it describes two categorical variables...one is a row variable and one is a column variable (similar to a matrix design). The entries in the table are the counts. Later we will learn that tables can take on other sizes depending on the number of categories involved in the variables. Tables are always a great way to sort and organize lots of data.
First look at the distribution of each variable separately. The distribution of a categorical variable just says how often each outcome occurred. Make certain you create a "Total" column at the right of the table that contains the totals for each of the rows, if that information is not already given. These row totals give the distribution for the row variable. Do the same for the columns - create a "Total" row at the bottom containing the distribution for the column variable. Remember if the row and column totals are missing, your first order of business is to fill them in. The distributions in the totals rows and columns are called "marginal distributions" because they are in the margins of the table. These are always expressed as percents. Sometimes minor errors in calculation are observed in the totals due to round-off error. We can use a bar graph or pie chart to display marginal distributions. A two-way table contains a great deal of information in compact form. Making the information clear always requires finding percents.
To describe relationships between or among categorical variables, calculate appropriate percents from the counts given. Although graphs are NOT as useful for describing categorical variables as they are for quantitative variables, a graph still helps an audience grasp the data quickly. Although bar graphs look a bit like histograms, their details and uses are different. A histogram shows the distribution of the values of a quantitative variable with numerically scaled axes while a bar graph compares the sizes of different items that are non-numerical. The horizontal axis of a bar graph need not have any measurement scale but may simply identify the items by name. The vertical axis would represent the count or percent.
Conditional distributions are the percents for each entry in a row or column, with that row or column describing a specific condition. The percents in each row and column add up to100%. Although we did not work with this, statistical software can speed the task of finding each entry in a two-way table as a percent of its column. It can also calculate row percents and totals. You will be doing all that by hand. Each conditional distribution could be turned into side-by-side bar graphs or even side by side pie charts. Side-by-side segmented bar graphs give a clear indication of independence or not.
As is the case with quantitative variables, the effects of lurking variables can change or even reverse relationships between two categorical variables. Surprises can await the unsuspecting user of data. Comparison of several groups where data are combined to form a single data set can be very misleading. Simpson's Paradox refers to the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group. It is an extreme form of how lurking variables can be misleading and would be worth researching.
IMAGES CREATED BY FREE TO USE