X-Git-Url: https://www.fleuret.org/cgi-bin/gitweb/gitweb.cgi?a=blobdiff_plain;f=README;fp=README;h=0000000000000000000000000000000000000000;hb=4a09699b90f56d4d029bc81c2ec440150b2e7b3f;hp=fe552d0f929caa9fe4bbe340596ac72a210be653;hpb=8ea0c152349515fc4f23c606ae1cadd694b9d9cb;p=cmim.git diff --git a/README b/README deleted file mode 100644 index fe552d0..0000000 --- a/README +++ /dev/null @@ -1,104 +0,0 @@ --*- mode: text -*- - -+----------------------------------------------------------------------+ -| This archive contains a simple implementation of the Conditional | -| Mutual Information Maximization for feature selection. | -+----------------------------------------------------------------------+ -| Written by François Fleuret | -| Contact for comments & bug reports | -| Copyright (C) 2004 EPFL | -+----------------------------------------------------------------------+ - -$Id: README,v 1.3 2007-08-23 08:36:50 fleuret Exp $ - -0/ INTRODUCTION - - The CMIM feature selection scheme is designed to select a small - number of binary features among a very large set in a context of two - class classification. It consists in picking features one after - another to maximize the conditional mutual information between the - selected feature and the class to predict given any one of the - features already picked. Such a criterion picks features which are - both individually informative yet pairwise weakly dependent. CMIM - stands for Conditional Mutual Information Maximization. See - - Fast Binary Feature Selection with Conditional Mutual Information - Francois Fleuret - JMLR 5 (Nov): 1531--1555, 2004 - http://www.jmlr.org/papers/volume5/fleuret04a/fleuret04a.pdf - -1/ INSTALLATION - - To compile and test, just type 'make test' - - This small test consists of generating a sample set for a toy - problem and testing CMIM, MIM and a random feature selection with - the naive Bayesian learner. The two populations of the toy problem - live in the [0, 1]^2 square. The positive population is in x^2+y^2 < - 1/4 and the negative population is everything else. Look at - create_samples.cc for more details. The features are responses of - linear classifiers generated at random. - -2/ DATA FILE FORMAT - - Each data file, either for training or testing, starts with the - number of samples and the number of features. Then follow for every - single sample two lines, one with the value of the features (0/1) - and one with the value of the class to predict (0/1). Check the - train.dat and test.dat generated by create_samples to get an - example. - - The test file has the same format, and the real class is used to - estimate the error rates. During test, the response of the naive - bayse before thresholding is saved in a result file (3rd parametre - of the --test option) - -3/ OPTIONS - - --silent - - Switch off all the outputs to stdout - - --feature-selection - - Selects the feature selection method - - --classifier - - Selects the classifier type - - --error - - Choses which error to minimize during bias estimation for the CMIM - + naive Bayesian. - - standard = P(f(X) = 0, Y = 1) + P(f(X) = 1, Y = 0) - - ber = (P(f(X) = 0 | Y = 1) + P(f(X) = 1 | Y = 0))/2 - - --nb-features - - Selects the number of selected features - - --cross-validation - - Do cross-validation - - --train - - Build a classifier and save it on disk - - --test - - Load a classifier and test it on a dataset - -4/ LICENCE - - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License version 3 as - published by the Free Software Foundation. - - This program is distributed in the hope that it will be useful, but - WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - General Public License for more details.