Data Mining with Clementine (2004-06-15)

This page has moved to my new website.

I attended an SPSS web seminar about their Clementine program, which performs data mining. The talk was oriented to business applications, but still had some interesting general insights. The speaker started with the claim that projects that incorporated data mining technologies had a much greater return on investment than other projects. -- www.spss.com/dk/IDC%20Predictive%20Analytics%20and%20ROI%20Report.pdf

Data mining is not really new. Anytime people use information about the world to draw conclusions, you can argue that they are data mining. Typically, though, data mining is reserved for situations where the number of data observations are large.

Data mining is used to

predict category membership or a numeric value,
group or cluster things together than have similar characteristics,
associate events that occur together or in a sequence,
find outliers that don't fit ordinary patterns or expected behavior.

The first two bullets represent supervised learning and unsupervised learning, and when I have time I want to document some of the approaches used for supervised and unsupervised learning. But for now, these web pages are painfully incomplete.

Finding outliers is an interesting approach that I had not devoted much thought to. Perhaps the outliers are observations that merit additional scrutiny. For example, in some applications, outliers may be potentially fraudulent cases.

CRISP-DM is the model that SPSS uses to model data mining (www.crisp-dm.org). The steps in CRISP-DM include

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

There are several loop backs in this model. For example, data understanding is involved in a feedback loop with business understanding. Modeling is involved with a feedback loop with data preparation. Evaluation, of course feeds back to business understanding.

Children's Memorial Hospital used Clementine and SPSS recognized by Computerworld Honors Foundation for research for treatments of pediatric brain tumors (www.spss.com/press/template_view.cfm?PR_ID=636).

A nice feature of Clementine is that models generated by the software can be exported as C code or XML which allows you to automate the delivery of data mining solutions to other computer platforms or on the web. It includes a module for mining data from text fields and a module for extracting events from web logs.

The speaker mentioned a new model (CARMA -- Continuous Association Rule Mining Algorithm) which allows interactive pruning of rules in a decision tree (http://control.cs.berkeley.edu/carma.html).

In a one hour presentation, you can't get a good feel for how the software works. It looks like a good comprehensive package that is easy to use. There are a lot of competing products out there, of course. One of the more intriguing competitors is Weka, an open source system for data mining. The main site for Weka is at the University of Waikato in Hamilton, New Zealand.

Since Weka is open source, it is popular with data mining classes at universities where you can't ask the students to go out and buy a thousand dollar software program (the price of college textbooks is already bad enough).

A good book about Weka is Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, by Ian H. Witten, Eibe Frank (ISBN: 1558605525) [BookFinder4U link]

I have not had a chance to work with either Clementine or Weka.