Java Data Mining Framework

[ About | News | Documentation | JavaDocs | Download | Project page ]

Usage example - clustering

What is clustering? A simplified definition could look like this: a process of finding groups of similar items. Items are quite similar to each other within each group, but also different from items in other groups. It's important to remember that data mining algorithms can find those groups, but we must name them ourselves based on the results of clustering.

There are many clustering algorithms available. One of them (the k-means algorithm) is already implemented in JDMF. You can find some implementation details in the JavaDocs. For now it is important to know that this algorithm operates on item sets of points with n coordinates. The distance between points is calculated using the Euclidean metric. Clusters are formed from points close enough to each other and distant enough from other points.

How many clusters will be formed? The decision is up to you. You must predict the number of clusters and verify the results. If they don't satisfy you, try another number. Generally don't stick with one number, other predictions may also give interesting results. Because the starting conditions are partially randomized, you should run the algorithm several times for each prediction, because results may differ.

Example data

The algorithm needs some data to operate on, so let's prepare an example data set. It consists of two attributes with 50 numerical values each. Subsequent values of both attributes form pairs. We will try to find clusters of similar pairs. Let's take a look at our data set:

First attribute Second attribute
10 13
1 20
5 22
7 14
1 28
8 24
16 17
3 17
14 21
12 27
8 12
9 26
22 14
38 9
30 3
32 5
36 12
25 10
37 7
29 13
24 11
31 8
33 9
27 7
35 30
40 45
49 35
44 48
46 36
36 47
41 31
35 39
39 44
38 42
48 34
47 32
11 2
2 48
32 15
4 22
47 33
5 29
7 17
9 8
12 32
45 26
39 6
22 31
29 41
19 37

This data set should form three clusters (I'm not guessing now, because this data set deliberately contains three groups of points close to each other). Let's verify our prediction.

Preparing data for the algorithm

There are a few ways to do this. For most of the algorithms, the best way is to use an existing implementation of net.sf.jdmf.data.sources.DataSource or write your own implementation. The data source can be then converted to input data using net.sf.jdmf.data.input.InputDataBuilder. Another way is to extend net.sf.jdmf.data.input.InputData or its subclass, though this method should be used only in special cases like mocking input data.

Providing data for clustering algorithms is a bit problematic at the moment. You need to extend net.sf.jdmf.data.input.clustering.ClusteringInputData or use net.sf.jdmf.data.input.InputDataBuilder and convert the result by hand. Our example data set will be prepared using the first method:


package net.sf.jdmf.data.input;

import net.sf.jdmf.data.input.attribute.Attribute;
import net.sf.jdmf.data.input.clustering.ClusteringInputData;

public class ExampleClusteringInputData extends ClusteringInputData {
    public ExampleClusteringInputData() {
        super();
        
        prepareAttributes();
    }

    private void prepareAttributes() {
        Attribute firstAttribute = prepareFirstAttribute();
        Attribute secondAttribute = prepareSecondAttribute();
        
        addAttribute( firstAttribute );
        addAttribute( secondAttribute );
    }

    private Attribute prepareFirstAttribute() {
        Attribute firstAttribute = new Attribute();
        firstAttribute.setName( "first" );
        
        double[] attributeValues = { 
            10, 1, 5, 7, 1, 8, 16, 3, 14, 12, 8, 9, 22, 38, 30, 32, 36, 25, 37,
            29, 24, 31, 33, 27, 35, 40, 49, 44, 46, 36, 41, 35, 39, 38, 48, 47,
            11, 2, 32, 4, 47, 5, 7, 9, 12, 45, 39, 22, 29, 19
        };
        
        for ( int i = 0; i < attributeValues.length; i++ ) {
            firstAttribute.addValue( attributeValues[ i ] );
        }
        
        return firstAttribute;
    }
    
    private Attribute prepareSecondAttribute() {
        Attribute secondAttribute = new Attribute();
        secondAttribute.setName( "second" );
        
        double[] attributeValues = { 
            13, 20, 22, 14, 28, 24, 17, 17, 21, 27, 12, 26, 14, 9, 3, 5, 12, 10,
            7, 13, 11, 8, 9, 7, 30, 45, 35, 48, 36, 47, 31, 39, 44, 42, 34, 32,
            2, 48, 15, 22, 33, 29, 17, 8, 32, 26, 6, 31, 41, 37
        };
        
        for ( int i = 0; i < attributeValues.length; i++ ) {
            secondAttribute.addValue( attributeValues[ i ] );
        }
        
        return secondAttribute;
    }
}

Using the algorithm

Here is an example code that uses the k-means algorithm to find clusters in our data set:


package net.sf.jdmf.algorithms.clustering;

import java.io.FileWriter;
import java.io.IOException;

import net.sf.jdmf.data.input.ExampleClusteringInputData;
import net.sf.jdmf.data.input.clustering.ClusteringInputData;
import net.sf.jdmf.data.output.clustering.ClusteringDataMiningModel;
import net.sf.jdmf.visualization.clustering.ChartGenerator;
import net.sf.jdmf.visualization.export.OutputDataExporter;

import org.apache.batik.transcoder.TranscoderException;
import org.apache.batik.transcoder.TranscoderInput;
import org.apache.batik.transcoder.TranscoderOutput;
import org.apache.batik.transcoder.svg2svg.SVGTranscoder;
import org.jfree.chart.ChartFrame;
import org.jfree.chart.JFreeChart;
import org.w3c.dom.svg.SVGDocument;

public class ClusteringExample {
    
    public static void main( String[] args ) throws Exception {
        KMeansAlgorithm algorithm = new KMeansAlgorithm();
        
        ClusteringInputData inputData = new ExampleClusteringInputData();
        // predicted number of clusters
        inputData.setNumberOfClusters( 3 );

		// analyze input data and produce a model        
        ClusteringDataMiningModel dataMiningModel 
            = (ClusteringDataMiningModel) algorithm.analyze( inputData );

        ChartGenerator chartGenerator = new ChartGenerator();
        
        // visualize the clusters formed (2D only)
        JFreeChart xyChart = chartGenerator.generateXYChart( 
            dataMiningModel.getClusters(), 0, "first", 1, "second" );
        
        ChartFrame chartFrame = new ChartFrame( "Clustering example", xyChart );
        chartFrame.pack();
        chartFrame.setVisible( true );
        
        // show the percentage of points falling into each cluster
        JFreeChart pieChart = chartGenerator.generatePieChart( 
            dataMiningModel.getClusters() );
        
        ChartFrame anotherChartFrame 
            = new ChartFrame( "Clustering example", pieChart );
        anotherChartFrame.pack();
        anotherChartFrame.setVisible( true );
        
        // and... finally - export both charts to SVG using Apache Batik
        OutputDataExporter exporter = new OutputDataExporter();
        
        SVGDocument svgXYChart = exporter.exportChartToSVG( xyChart, 600, 450 );

        SVGDocument svgPieChart 
            = exporter.exportChartToSVG( pieChart, 600, 450 );
        
        SVGTranscoder svgTranscoder = new SVGTranscoder();
        
        TranscoderInput xyInput = new TranscoderInput( svgXYChart );
        TranscoderOutput xyOutput = new TranscoderOutput( 
            new FileWriter( "xy-chart.svg" ) );
        
        TranscoderInput pieInput = new TranscoderInput( svgPieChart );
        TranscoderOutput pieOutput = new TranscoderOutput( 
            new FileWriter( "pie-chart.svg" ) );
        
        svgTranscoder.transcode( xyInput, xyOutput );
        svgTranscoder.transcode( pieInput, pieOutput );
    }
}

Results


XY chart [PNG]

Pie chart [PNG]

XY chart [SVG]

Pie chart [SVG]

To be continued...

SourceForge.net Logo