DAT - Pattern Recognition (Lesson)

Pattern Recognition

Introduction

Penrose PATTERN DepictionPattern recognition is an important concept with machine learning and the recognition of data that is "like" rather than exact in all cases.  Pattern recognition algorithms look for similarities in data, the best fit, rather than the exact answer.  The image to the left is called a penrose tiling as its pattern continues in a circle. Note that the pattern provides a three dimensional effect. The pattern is repetitious in this example. 

Look at the picture to the left.  What patterns do you see there?  

Create of red, navy, and gray pentagons, green five-leg star, blue three-leg star, yellow star.

  1. All green stars have 5 red pentagons surrounding it - a red flower
  2. All navy pentagons have grey pentagons surrounding it - a grey flower
  3. All grey flowers have 5 balls surrounding it
  4. All balls are made up of 2 grey pentagons, 2 red pentagons, 2 yellow square (Also looks like rhombus or diamond), and a 3 leg part start
  5. Patterns repeat and the circle can be enlarged to create another ring using the same pattern technique of enlargement using the concepts defined.

Patterns and Words

A simple pattern is finding the odd numbers in a list of numbers.  Note that odd numbers are numbers that are not divisible by 2, a pattern.  So in the following list 1, 2, 3, 4, 5, 6, 7, ... where ... means continuing on in the same manner forever, 1, 3, 5, 7, . . . would mean to continue with the same pattern, the odd numbers.

Another simple pattern is finding the even numbers in a list of numbers.  Note that the even numbers are numbers that are divisible by 2, a pattern.  

Mathematically and in computer science, each of the above patterns would be represented with using a variable, say x, and setting x to the first value.  Every value after would be x + 2 or using modulus and finding the values from a list that did not have sequential evens or odds; x%2 = 0 where % is modulus for remainder and every x value tested for a remainder of zero would be even if the answer is true, odd if false.

In the table below is the meaning of some words that are used to describe data.  

Vocabulary

Word

Meaning

Average

The total number of items over a time period/number of time periods

Generally

Information depicts over 50% for opinion type data, qualitative data

Majority

Over 50%

Many

Over 50% when used with qualitative data

Mean

The average of the data over a period of time

Median

The middle when a data is in order

Mode

In a list of numbers, the number that occurs the most, or the numbers that all occur the same number of times greater than 1

Most

Over 50% in a qualitative study

Outlier

A data element or a few data elements that don't follow the pattern seen in the data.   Outliers may be removed with comment as excluded for

Range

The ending value minus beginning value when the numbers are in order

Qualitative Data

Opinion or other non-numeric evaluation of data

Quantitative Data

Numeric data

Vast Majority

Over 70%

Increasing Data

This is easily visualized in math with an increasing line.  As x gets larger, y gets larger.  Remember, x could be the number of items and y the cost of buying those items. Climbing or increasing data has a positive slope from left to right. Increasing is a pattern of each subsequent value being higher than the one before or in general are. 

Here is another example with large data plotted in a scatter plot by a computer. Note the data points are increasing as the temperature rises. A line of best fit can be calculated to estimate where other data points would lie or to provide a best fit approximation for any data missing at various temperatures. What was just described is called an analysis of the data. With large data visualizations are hard to do without the aid of computer technology.

Scatter graph showing the relationship between soil respiration and temperature.

Decreasing Data

This is easily visualized in math with a decreasing line. As x gets larger, y gets smaller. The variable x could be the temperature x could be the number of items, but the answers for the cost of the items could be falling representing the more that you buy, the less each item costs.   If we look at the graph above, the line would be higher on the left and lower on the right.  The same with the scatter plot of data that is with the line.  The scatter plot dots would fall the same way as the line change. So a decreasing graph or declining graph representation, would fall from left to right, a negative slope.

Decreasing Data ChartLook at the visualization graph here - a decreasing graph is seen.  Though this is curved and not a straight line, the Healthcare-Education graph shows the graph with lowering y values as education, the x-values increase. Without further information, what is being compared is not strongly apparent, we can discern that the graph is decreasing (falling y values from left to right. Rather than the graph rising from left to right, increasing, a declining or decreasing graph would fall from left to right, a negative slope.

Look at the example of data in the bar and line graph below. Note that this graph shows many items including a stacked bar representing ads and a line graph with the number of members. Can you identify some characteristics, patterns, and possible areas to look for impact? There are questions here to be asked based on the data. Can you identify some? With computer technology to assimilate data, visualizations of cause and where to look for effect may be pinpointed or hypotheses created for further investigation.

Cy world

What does this chart represent?

  1. Bar graph light green is digital item sales
  2. Bar graph purple is ads
  3. Line graph indicates the number of members.
  4. Critical mass (unknown, but maybe explained in a data document that could accompany the graph) and upward memberships started after the merger
  5. Digital sales after October 2004 are mostly flat (dropped slightly, came back to approximately even at February 2006).
    1. What caused the change now that data is visual could be analyzed as the time frame is set.
    2. What else came into the market? 
    3. Something has had an impact as the digital sales stopped rising steadily. 
  6. Ad sales grew initially and now are declining. This data pattern again begs questions.
    1. Is the company getting ads only from a fixed section of the market?
    2. Are the ads only extra income? 
  7. What is the business plan for the ads?  The digital sales do not appear to be affected b y the ads as shown by the visualization of the data. The company can analyze their plan. Were the ads supposed to have increased digital sales?
  8. The constant increased is in the number of members. Why did members continue increasing after ads started, but the ads and digital sales did not?
  9. With further data analysis, the sales could be plotted with the types of ads to see how much of the digital sales from the ads they are running.

Note that if over the time period you were to take the average of the membership, the mean, would be skewed to the left from the small numbers initially if all were used.  Remember the mean means to add up all of the numbers at the time slots and divide by the number of time slots.  

What of the data for Flickr below? What stories does the data tell us when we analyze this large data set illustrated with computer technology as an image? What patterns do you see?

Flickr Data

What does this chart tell us?

  1. Upload of pictures reached its' height in the second quarter of the year 2013 ( a quarter is 3 months of the year).
  2. Beginning with the year 2007, the number of pictures uploaded drops at the end of the year and beginning of the next with a  similar pattern.
  3. Why the strong climb and steep drop in picture uploads in the year 2013
  4. Year's 2014 and 2015 appeared it gave similar patterns of picture data uploaded throughout the year. Why?
  5. A similar pattern of data is also found in the years 2009 and 2010. Is there a similar cause to the 2014 and 2015 years except for the fact that ore uploads are in 2014, 2015?
  6. Why the strong climb and steep drop in picture uploads in the year 2013?
  7. The number of uploads appears to peak in the middle of the year, declining, decreasing at the end of the year and climbing back, increasing, starting at the beginning of the next year.  Could this be summer vacations?
  8. What could cause the various questions to have answers? This is the reason for mapping the Big Data. It provides us with starting points for further analysis that might not be evident due to the sheer number of items to process.

As we note, visually seeing large data mapped using computer technology allows for questioning of causes when patterns or anomalies are found in the data.

Pie Charts

Let's look at another type of data representation, pie charts. Pie charts are circular and represent data in comparison to other data usually as a percentage of the pie, the pie is 100%. In this example, bar charts are used to show another method of data visualization. Note the one that is easier to understand the data with.

Pie ChartsNote that though we see that the pie charts are slightly different, without concise percentage labels it is hard to compare the data. Pie charts should contain percentages or values in order to provide comparison data. The bar chart interpretation of the pie chart with numeric values provides a better analysis.

We see with the pie charts that the green 3 is adjusted somewhat for each chart, thus giving guidance to a larger and smaller value of the two to the right (1 & 2) and the two colors to the left (4 & 5). Looking at the reds, note that in the pie charts the red grows from left to right and yellows decrease from right to left if you look carefully. This would be qualitative opinion data as there is no numeric quantity to relate to quantitative data.

The bar chart verifies the opinion that you created and we note that it is quicker and easier to read for a pattern, increasing colors values, essentially flat or horizontal color values, and decreasing color values. From the pie charts, more time was used to decipher the patterns with all of the colors.

This is an example of using appropriate charts to get the information patterns that you wish to see across to others. Different charts convey data differently. Using different charts on the same data may also identify different types of questions to be asked about the data for analysis. Pie Charts

Here are two pie charts with data that can be analyzed from the actual chart.

Pie Charts from Wikimedia showing browser usage.

What patterns and questions do you see in these charts?

Outlier Data

Outliers may go against the overall pattern.  An occasional anomaly that may or may not be significant depending on the collection of data method, options for the data collection, and other factors.  Analyzation of the data accounts for why an outlier would occur in a research project.  If they exist largely, then collection of data may need to be re-examined or the collection process revised and the collection of data started over.  

Below is statistical data that the researcher is attempting to decide where the "line of best fit" would go.  Note the red dot that is way off what appears to be a consensus of data where the lines are tried.  This is an outlier.  Determination of why this data anomaly occurred might have an explanation depending on how the data was collected.

Outlier_statistics.svg CC.png

For an outlier, think of this:  If you are in Chemistry doing lab, is your accuracy going to increase as you get comfortable with the procedures and do the lab, collecting the data 5 times.  The first test you may be the one that you were "experimenting" with learning directions etc.  So, you could explain the data being off differently if you were collecting results.  On one of the subsequent tests maybe you were a little too comfortable or something happened that you did not fully pay attention to the experiment, and you got an odd result. This is why, as in programming, you keep records of testing, thus providing some insight into differences allowing for analysis of differences.

A "line of best fit" is a line that shows how the data is trending based on being as close as possible to the majority of the data.

Reading Assignment

Review Chapter 1 of Blown to Bits and read Chapter 2 to understand how data being collected is affecting society in various areas.  These interesting stories of big data use will assist in your reasoning to understand how technology is affecting society.  This understanding will be important as you relate your upcoming task to the beneficial and harmful effects of a computer innovation in your upcoming College Board Tasks.  Understanding that change is not always positive is important and with this understanding will come ways to help mitigate the negative effects. Download Click here for the Blown to Bits book.

IMAGES FROM THE PUBLIC DOMAIN  AND USED ACCORDING TO TERMS OF USE.