Ryerson Computer Science CPS844 Data Mining

(go back)

Terminology

Data mining

Automatic searching for useful patterns in large amounts of data

Approaches

  • Classify
  • Associate
  • Cluster

Model

Simple Representation of the data

Machine learning

Improved performance

Instance-based learning ..Just memorize all the possible sets of values

ML has some overlap with data mining, but DM involves more preprocessing, whileML uses processed data

Weka

Explorer

Preprocess

  • Open file -> data

Classify

  • Choose
    • Prism
    • Test options
      • Don't test on the training set
      • Cross validation
        • Folds == number of models
        • N folds means that the training data is divided into N portions and the algorithm is trained with n -1 parts as the training data, well one part is used to test theresulting model
      • Stratified
  • In weka, the class attribute is the last column -- it's the 'output'
  • Tree pruning
    • Reduces the tree size:might be OVERFIT, so it may not perform well on new data

Visualize

Why not just use the training set?

With real world data,we likely won't be able to produce a perfect model anyway, so using less data to make the model gives us amore realistic version of what to expect

Confusion matrix

Tells how many were correctly classified and how many were incorrectly classified. Ideally, the diagonal of the matrix should have all the values(sort of like an identity matrix)

Classification

ZeroR (0R)

No rules... Just the majority value.

OneR (1R)

Choose a single attribute for classification;choose the one with the least errors

Discretization

Divide up the numeric values

Naive Bayes

  • table of values
  • laplace estimator adds n (usually 1) to everything

Prism Method

  • If ? then class = value1
  • Repeat until all the instances of the given class are used by rules

Naive Bayes theorem (variables are independent)

Temperature table

  • If an attribute is unknown simply omit its value
  • Removing redundant columns improves naive bayes predictions
  • Likelihood and probability are not the same
    • PR(yes) = L of yes divided by (L of yes plus L of no)
  • To get rid of possible zeros, just add 1 to each number of instances for each class...? This avoidsmultiplying by zero
  • Removing an attribute-- decrease likelihood fraction by 1??
  • Note how to calculate likelihoods and probabilities

Document classification

  • Ground truth - the manual classification of documents

(ID3) Choosing the first attribute to branch on

Best, highest information gain

  • See info() calculations
    • When the formulas contains a zero, the output is zero

Sometimes decision trees can be rewritten using simpler rules

Some trees become too complex with subtree replication problems

PRISM a simple covering algorithm

Start with the classes, try to generate a rule set for each

  • For each class
    • While the are instances with the class
      • Go through each attribute/value, find with highest accuracy
        • Repeat till the rule is perfect

If you have a tie for accuracy, choose the bigger number

If ? Then class value

  • Fill out the ? field
    • If astigmatism is true and ?Then class value

Association rules

  • Item = attribute-valuepair
  • Coverage - frequency of occurrence
  • Use hashing for efficiency when checking if the previous item sets contain this item...?
  • If ABC -> D
  • If X then Y; LHS is antecedent, RHS is consequent. See p/t notes

Numerical Regression

Linear model

Numeric

  • Straight line
  • Linear regression
    • Sum of squared errors

Nominal (class)

Multi-response

  • Attributes are all numeric, output is nonnumeric class
    • Can cover convert to numeric (1, 0)
  • Membership function

Logistic regression

Linearly separable

Can put a plane between data-points in different classes

Linear Classification using the Perceptron

if $w \cdot a > 0$ predict class A else predict class B

Instance based learning

From wikipedia: instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory. Instance-based learning is a kind of lazy learning.

Instance based learning involves storing all of the instances of the training set. Given a new instance, a, predict the class according to its nearest neighbour

Ways to find nearest neighbour:

  • KD Tree
    1. Creating the tree from the instances
      1. Find the median point on the most widely distributed attribute, draw a line through it (this becomes a parent node)
      2. Switch directions and draw lines through the other direction on either sides of the first line
      3. Repeat until there is only at most 1 node inside each box. These nodes are leaf nodes
    2. Finding the nearest neighbour for a new instance
      1. A "star" represents a new instance - move it down the tree from the root toward where it is known to be on the attribute plane
      2. When you've reached the last leaf in the direction of the star location, draw a circle around the star that is of the radius of the distance to that last leaf
      3. Walk back up the tree and down adjacent branches as appropriate to decide if there is a closer neighbour based on the radius
  • Euclidian Distance

Clustering

K-Means