Clustering Advanced Options

This endpoint allows user to select an algorithm for clustering.

  1. Algorithms that require inputing "number of cluster": K-Means, minibatchkmeans, spectral clustering, agglomerate clustering. Herein, if a number of cluster >=2 is input, the clustering will be run using the given number of cluster. Meanwhile, if a number of clusters is set to 0 or 1, the silhouette analysis will be run to give a recommended number of cluster.

  2. Algorithms that don't require inputing "number of cluster": affinity propagation, meanshift, DBScan, HDBScan.

  3. It is recommended to first explore and visualize the dataset before selecting an algorithm, especially for large datasets.

Request Body

NameDataTypeDescriptionMandatorySample ValueList of possible valuesNotes
customersArray of DictionaryArray containing each customer’s information. The fields are different per project, based on what we’re getting from the client’s DB.The array is mandatoryFor each customer’s information in the dictionary, the only mandatory field is customerId"customers": [ { "nationality": "Indian", "gender": "male", "age": 19, "race": "Canadian", "platformCountry": "IN", "personalIncome": 8806, "savingsRatio": 0.49, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, ... { "nationality": "Chinese", "gender": "female", "age": 50, "race": "Indian", "platformCountry": "SG", "personalIncome": 4043, "savingsRatio": 0.61, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f120" } ]The set of key-value pair dictionary in the array object always changes. And the input validation of the set of key-value pair is defined and set by the customer
clusteringAlgorithmStringSelect an algorithm for clusteringYKMeansList of possible key values are: “KMeans”, ”MiniBatchKMeans”, “SpectralClustering“, “AgglomerativeClustering“, “AffinityPropagation“, “MeanShift“, “DBSCAN“, “HDBSCAN“
baseFeaturesForClusterArrayThe clustering algorithm will use the selected features to segment the groupsY"base_features_for_cluster": ["age", "gender", "personalIncome", "savingsRatio", "platformCountry", "nationality"]When the array is empty, all features found in the customer input data will be used. The value in the array must match with the set key-value pair found in customer input data
showScatterPlotBooleanTo show scatter plot x and y value for every point in each clusterYtruetrue, false
featuresForScatterPlotArrayThe selected features will be used to plot the scatter graphY["age", "personalIncome", "savingsRatio"]length of array is at least 2 or exactly 0The value in the array must match with the set key-value pair found in customer input data. When the length of featuresForScatterPlot is exactly 2, there will be scatterPlot key-value pair in the output. When the length of featuresForScatterPlot is > 2, dimensionalityReductionMethod value will be used to reduce the number of features to a set of principal variables which will be used to plot the scatter graph. When the length of featuresForScatterPlot is exactly 0, all numeric features will be used in scatter plot
dimensionalityReductionMethodStringDimensionality reduction is the process of reducing the dimension of the feature set to obtain a set of principal variablesN“pca”“pca”, “tsne”, nullWhen the length of featuresForScatterPlot is >= 3, dimensionalityReductionMethod value will only be used. Else, dimensionalityReductionMethod value will not be used. If the value of dimensionalityReductionMethodis null or dimensionalityReductionMethod is not included in the input payload, “pca” value will be used
parametersDictionaryContains a set of parameter for the chosen clusteringAlgorithmY"parameters": { "KMeans": {"n_clusters": 3}, "MiniBatchKMeans": {"n_clusters": 3}, "SpectralClustering": { "n_clusters": 3, "affinity": "nearest_neighbors", "assign_labels": "kmeans" }, "AgglomerativeClustering": { "n_clusters": 3, "linkage": "average", "affinity": "manhattan" }, "AffinityPropagation": { "damping": 0.8 }, "MeanShift": {"quantile": 0.7}, "DBSCAN": {"min_samples": 2, "eps": 0.3 }, "HDBSCAN": {"min_samples": 2, "min_cluster_size": 3} }List of possible nameOfAlgo values are: “KMeans”, ”MiniBatchKMeans”, “SpectralClustering“, “AgglomerativeClustering“, “AffinityPropagation“, “MeanShift“, “DBSCAN“, “HDBSCAN“The nameOfAlgo value in the input must match with clusteringAlgorithm input
KMeansDictionaryContains hyperparameter of KMeansN"KMeans": {"n_clusters": 3}
n_clustersIntegerNumber of clusters to group the datasetY"n_clusters": 3>=0, <=10
MiniBatchKMeansDictionaryContains hyperparameter of MiniBatchKMeansN"MiniBatchKMeans": {"n_clusters": 2}
n_clustersIntegerNumber of clusters to group the datasetY"n_clusters": 3>=0, <=10
SpectralClusteringDictionaryContains hyperparameter of SpectralClusteringN"SpectralClustering": { "n_clusters": 1, "affinity": "nearest_neighbors", "assign_labels": "kmeans" }
n_clustersIntegerNumber of clusters to group the datasetY"n_clusters": 3,>=0, <=10
assign_labelsStringThe strategy to use to assign labels in the embedding space.Y"assign_labels": “kmeans”,“kmeans”, “discretize”
affinityStringMethod to construct affinity matrixY"affinity": "nearest_neighbors""nearest_neighbors", “rbf”
AgglomerativeClusteringDictionaryContains hyperparameter of AgglomerativeClusteringN"AgglomerativeClustering": { "n_clusters": 1, "linkage": "average", "affinity": "manhattan" }
n_clustersIntegerNumber of clusters to group the datasetY"n_clusters": 0>=0, <=10
affinityStringMetric used to compute the linkage.Y"affinity":"euclidean""euclidean", “l1”, “l2”, “manhattan”
linkageStringWhich linkage criterion to use.Y"linkage": "average""average", “ward”, “complete”, “single” (if “ward”, only “euclidean” can be used)
AffinityPropagationDictionaryContains hyperparameter of AffinityPropagationN"AffinityPropagation": {"damping": 0.9}
dampingFloatThe extent to which the current value is maintained relative to incoming values (weighted 1 - damping).Y"damping": 0.6>=0.5, <1
MeanShiftDictionaryContains hyperparameter of MeanShiftN"MeanShift": {"quantile": 0.2}
quantileFloatThat is used to estimate bandwidth (0.5 means that the median of all pairwise distances is used)Y"quantile": 0.2>=0, <=1
DBSCANDictionaryContains hyperparameter of DBSCANN"DBSCAN": { "min_samples": 1, "eps": 0.3 }
min_samplesIntegerThe number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itselfY"min_samples": 10>=1
epsFloatThe maximum distance between two samples for one to be considered as in the neighborhood of the otherY"eps": 0.2>0
HDBSCANDictionaryContains hyperparameter of HDBSCANN"HDBSCAN": { "min_samples": 10, "min_cluster_size": 5 }
min_samplesIntegerThe number of samples in a neighbourhood for a point to be considered a core point.Y"min_samples": 10,>=1
min_cluster_sizeIntegerThe minimum size of clustersY"min_cluster_size": 5>=2

Response Body

NameDataTypeDescriptionSample ValueNo. of decimalsNotes
clusteringSummaryDictionaryContains 5 values: averageSilhouetteScore, customerId2Cluster, featuresForClustering, numberOfClusters, and removedFeatures
averageSilhouetteScoreNumberMeasures of how similar an object is to its own cluster compared to other clusters. Silhouette score ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters0.504725
customerId2ClusterArray of DictionaryAn array of the customer ID’s and their corresponding clusters. Contains 2 values in the dictionary: clusterNumber and customerId"customerId2Cluster": [ { "clusterNumber": 4, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, { "clusterNumber": 2, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f002" }, …]
clusterNumberNumberDescribe which cluster number particular customer id falls into40
customerIdStringCustomer Id“273a3980-2458-11e9-b0a0-9d372de0f001"
featuresForClusteringDictionaryContains lists of categorical and numerical features that were used in the clustering."featuresForClustering": { "categoricFeatures": [ … ], “numericFeatures”: [ … ]}
categoricFeaturesArrayContains all categorial features used in clustering[ "gender_female", "platformCountry_AU", "platformCountry_CA", "platformCountry_GB", "platformCountry_ID", "platformCountry_IN", "platformCountry_NZ", "platformCountry_SG", "platformCountry_US", "nationality_Chinese", "nationality_Singaporean" ]
numericFeaturesArrayContains all numerical features used in clustering["personalIncome", "savingsRatio"]
numberOfClustersNumberNumber of groups that are segmented by the clustering algorithm90
removedFeaturesDictionaryContains lists of categorical and numerical features that were removed"removedFeatures": { "categoricFeatures": [ "gender_male", "nationality_American", "nationality_Australian", "nationality_British", "nationality_Canadian", "nationality_Indian", "nationality_Indonesian", "nationality_New Zealand" ], "numericFeatures": [] }
scatterPlotDictionaryContains 3 values: component, methodName and plot
componentArray of DictionaryContains variance for numerical features for both principal components"components": [ { "explainVariance": 0.35104, "feature": [ { "name": "age", "value": -0.41682 }, { "name": "personalIncome", "value": 0.73593 }, { "name": "savingsRatio", "value": 0.53354 } ], "name": "PC1" }, { "explainVariance": 0.33639, "feature": [ { "name": "age", "value": -0.65268 }, { "name": "personalIncome", "value": -0.65084 }, { "name": "savingsRatio", "value": 0.38783 } ], "name": "PC2" } ]Only shown when methodName is “pca”
explainVarianceNumberThe fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance0.336395
nameStringName of principal component“PC1”Possible values: “PC1” and “PC2”
featureArray of DictionaryContains name of feature and the importance value of that feature in particular principal component"feature": [ { "name": "age", "value": -0.65268 }, … ]5 dp for value
methodNameStringMethod to create scatter plot“pca”Possible values: “pca”, “tsne”, and “2d”
plotArray of DictionaryContains the x and y value for each clusterNumber"plot": [ { "clusterNumber": 6, "xValue": 0.36019, "yValue": 0.38658 }, …]
xValueNumberx value coordinate0.360195
yValueNumbery value coordinate0.386585
Click Try It! to start a request and see the response here!