Clustering
Clustering is a common computational technique for data analysis in life sciences. Clustering tries to partition data in groups which have similar characteristics.
Software
TransClust
TransClust (available for Win/Lin/Mac) [1] is a high-throughput clustering software that is based on Weighted Transitive Graph Projection. It's main advantage over other approaches is that it's underlying model directly reflects hidden transitive substructures typical e.g. for biomedical data sets. In comparison to other clustering methods the density parameter (the threshold) can be chosen as intuitively as for k-means .
TransClust can be used:
- Within Cytoscape as a plugin (see tutorial)
- On the developer web-server (link and Tutorial)
- From the terminal as a graphical user interface (java -jar -Xmx2G -Xss100M TransClust.jar -gui)
- In the terminal as a standalone command line application with required parameters. adding -gui to the java command will open a graphical interface.
The help page is reproduced here to help you build a valid command
java -jar [java virtual machine options] TransClust.jar [-key
value]
e.g. java -jar -Xmx2G -Xss100M TransClust.jar -i cost_matrix_dir
-o clusters.cls
Note: If the input is large and/or complex then the virtual machine
options must be set.
Any values that inlude spaces must be surrounded by quotation marks
'"'.
{ } denotes the value choices, [ ] means that the value is a list,
and ' ' surrounds a description of the value.
Further note that the keys are not case sensitive, but the class
names of the respective implementations are!
## COMPULSORY OPTIONS
One of the following must be entered.
-key value
-i {'inputdir', 'costmatrixfile.cm'}
Input file or directory.
-o {'output.file', 'output.conf'}
Output file for the clustering results or the generated config
file.
OR
-gui {[OTHER OPTIONS]}
Start the program with the graphical user interface. It is
also possible to initialise the gui with the OTHER OPTIONS
defined below!
OR
-help {}
Show this help manual.
## OTHER OPTIONS
These are optional. All parameters that are not specified here are
first taken from the input config file if stated,otherwise from
the default config file that comes with this program. IMPORTANT:
The given input parameter values override any values written in
the config files.
-key value (default value)
EXTRA (not defined in the config file)
-verbose {} ()
Write a short summary of the program results to the standard
output (console).
-cf {true, false} (false)
Use config file (true) or hard coded standard options (false).
-config {'config.conf'}
A config file with the program parameters in the correct
format (see documentation for details).
-mode {0,1} (0)
Determines the mode in which the program should be started
0 Default clustering mode: clustering of given input
and writing the clusters to the output file.
1 General training mode: trains a set of data (cost
matrices) and writes the generated parameters in
the output file.
-info {'file.info'}
A summary of what functions the program carried out.
This file includes information such the date, the
input and output files, which mode the program was
carried out in, and which processes were done using
which implementations.
-log {ALL,FINEST,FINER,FINE,CONFIG,INFO,WARNING,SEVERE,OFF} (OFF)
Defines the level of logging from the most sensitive level
to completely off.
GENERAL
-l ['layouterClass'] (FORCEnDLayouter)
A List of class names of layouter implementations. These
implementations are then used for the layouting phase in the
order they are given. Each name should be separated by a ","
(comma). E.g. FORCEnDLayouter,ACCLayouter or for just one
layouter, then only e.g. FORCEnDLayouter.
Implemented Classes {FORCEnDLayouter, ACCLayouter}
-g {'geometricClustererClass'} (SingleLinkageClusterer)
The class name of the geometric clustering implementation.
Implemented Classes {SingleLinkageClusterer, KmeansClusterer}
-p {'postProcessorClass'} (PP_DivideAndReclusterRecursively)
The class name of the post processing implementation. Write
'none' if post-processing should NOT be carried out.
Implemented Classes {PP_RearrangeAndMergeBest, PP_DivideAndRecluster,
PP_DivideAndReclusterRecursively}
-e {ICCEdgesImplementation} (CC2DArray)
The class name of the implementation of the ICCEdges interface
describing the datastructure for the costs between objects.
Implemented Classes {FORCEnDLayouter, ACCLayouter}
-t {1,...,max no. CPUs} (3)
Turn the use of multiple threads on and give the maximum no.
of parallel threads (do not give a number greater than the
number of CPUs your system has).
GENERAL LAYOUT
-ld {2,...,n} (3)
The dimension in which the layouters should run in. NOTE:
Because of runtime reasons, ACCLayouter only makes sense for
dimensions 2 and 3.
-lp {'parameterTrainingClass'} (ParameterTraining_SE)
The class name of the parameter training implementation. Write
'none' if parameter training should NOT be carried out.
Implemented Classes {ParameterTraining_SE}
-lps {2,...,n} (15)
Number of parameter configurations for each generation in the
parameter training.
-lpn {1,...,n} (3)
The number of generations that should be used for parameter
training.
FORCEnDLayouter
-fa {'double'} (100.0)
The value for the attraction factor.
-fr {'double'} (100.0)
The value for the repulsion factor.
-fi {'integer'} (100)
Number of iterations.
-ft {'float'} (100.0)
The cooling temperature value for the convergence of the
layout.
ACCLayouter
-aix {'integer'} (10000)
The multiplication factor for the number of iterations.
(Iterations = number of items * factor)
-agx {'integer'} (25)
Multiplication factor for the grid size. (Places on the grid
= number of items * factor)
-asx {'integer'} (15)
Multiplication factor for the maximum step size. Please choose
this smaller then the multiplicator for the grid size.
-at {'antTypeClass'} (MemoryAnt)
The class name of the type of ant to be used. ('SimpleAnt',
'JumpingAnt', 'JumpingAntWithIncreasingViewSize' or 'MemoryAnt')
-akp {'double'} (0.15)
kp value, the higher this value the higher the probability
to pick up items.
-akd {'doube'} (0.2)
kd value, the higher this value the higher the probability
to drop items.
-an {'integer'} (1)
Number of ants.
-am {'integer'} (50)
Memory size: The number of items that the ant remembers.
-aa {'double'} (1.0)
The value of the factor alpha for the neighbourhood function.
(Scales the dissimilarities)
-as {'integer'} (20)
The maximum step size.
-av {'integer'} (2)
The maximum view size. Only used with JumpingAntsWithIncreas
ingViewField and MemoryAnts.
-az {'double'} (1.0)
Normalisation threshold.
GEOMETRIC CLUSTERING
SingleLinkageClusterer
-sm {'double'} (0.01)
The minimum distance.
-sx {'double'} (5.0)
The maximum distance to look at.
-ss {'double'} (0.01)
The step size.
-sf {'double'} (0.01)
The step size factor.
KmeansClusterer
-km {'integer'} (30)
The maximum k value that is allowed. This means the maximum
number of clusters that the input can be divided into.
-ki {'integer'} (1)
Maximum number of different initial starting point combinations
(for one k) that k-means uses.
OverlappingClustering
-fuzzy {'double'} false
Fuzzy threshold to compute overlapping clustering (give to
activate fuzzy clustering) - disabled by default.
-fb {'double'} lowest observed value
Fallback value used to create costmatrices
No tutorial is provided for the command line usage but training data is available from the Description page
References:
- ↑
Tobias Wittkop, Dorothea Emig, Sita Lange, Sven Rahmann, Mario Albrecht, John H Morris, Sebastian Böcker, Jens Stoye, Jan Baumbach
Partitioning biological data with transitivity clustering.
Nat Methods: 2010, 7(6);419-20
[PubMed:20508635] ##WORLDCAT## [DOI] (I p)
[ Main_Page ]