Clustering

This endpoint is to process the data set, cluster the customers and visualize results.

  • Data processing: features with a single value or with number of values equals to the number of samples ('customerId' excluded) will be removed, one of highly correlated features (threshold > 0.8) will be removed. Then categorical data will subject to one-hot-coding, and numeric data will be scaled.

  • For the clustering, users are allowed to select features for training clustering model (if no features are selected, all the features will be included), linkage criteria (single, complete, average and ward, the default is ward method). Users are allowed to input a number of clusters (integer >= 0 and <=10). If 0 or 1 is given, the endpoint will run a silhouette analysis to get a recommended number of clusters, and shows the details of the clustering results.

  • Finally, users are able to select features for visualization.

    1. If two features are selected, a scatter plot based on the clusters will be shown.

    2. If more than two features are selected, users are allowed to select methods to reduce dimension:

    • PCA: The selected features will subject to PCA procedure to obtain two prinical components, which will be used as x and y axis for visualization.

    • TSNE: There are a wide variety of tuning parameters for TSNE. The default values will be used for these parameters.

    • Note: For both methods, if the sample size is less than 3000, the whole dataset will be used for plot. If the sample size is more than 3000, 3000 samples will be selected by stratified sampling. For TSNE, since it is very expensive in running time, if the number of features is very high (>50), PCA will first be applied to select the 50 most important features.

    • PCA vs TSNE: (1) t-SNE is computationally expensive. (2) PCA it is a mathematical technique, but t-SNE is a probabilistic one. (3) Linear dimensionality reduction algorithms, like PCA, concentrate on placing dissimilar data points far apart in a lower dimension representation. But in order to represent high dimension data on low dimension, non-linear manifold, it is essential that similar data points must be represented close together. (4) Sometimes in t-SNE, different runs with the same hyperparameters may produce different results hence multiple plots must be observed before making any assessment. (5) Since PCA is a linear algorithm, it will not be able to interpret the complex polynomial relationship between features while t-SNE is made to capture exactly that.

Request Body

NameDataTypeDescriptionMandatorySample ValueList of possible valuesNotes
customersArray of DictionaryContains customer’s datasetY"customers": [ { "nationality": "Indian", "gender": "male", "age": 19, "race": "Canadian", "platformCountry": "IN", "personalIncome": 8806, "savingsRatio": 0.49, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, ... { "nationality": "Chinese", "gender": "female", "age": 50, "race": "Indian", "platformCountry": "SG", "personalIncome": 4043, "savingsRatio": 0.61, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f120" } ]The set of key-value pair dictionary in the array object always changes. And the input validation of the set of key-value pair is defined and set by the customer
numberOfClustersIntegerNumber of groups to segment the datasetY100-10, including 0 and 10When the value of numberOfClusters is 0 or 1, the algorithm will find the most optimum number of groups
linkageMethodStringThe linkage criteria determines which distance method to use between sets of observation. The algorithm will merge the pairs of clusters that minimize the linkage criterion.Y“average”'single', 'complete', 'ward', 'average'
baseFeaturesForClusterArrayThe clustering algorithm will use the selected features to segment the groupsY"base_features_for_cluster": ["age", "gender", "personalIncome", "savingsRatio", "platformCountry", "nationality"]When the array is empty, all features found in the customer input data will be used. The value in the array must match with the set key-value pair found in customer input data
showDendrogramBooleanShow dendrogram output value informationYtruetrue or false
showScatterPlotBooleanShow scatter plot output value informationYtruetrue or false
featuresForScatterPlotArrayThe selected features will be used to plot the scatter graphY["age", "personalIncome", "savingsRatio"]Null, or length of array is at least 2The value in the array must match with the set key-value pair found in customer input data. When the length of featuresForScatterPlot is exactly 2, there will be scatterPlot key-value pair in the output. When the length of featuresForScatterPlot is > 2, dimensionalityReductionMethod value will be used to reduce the number of features to a set of principal variables which will be used to plot the scatter graph. When the length of featuresForScatterPlot is exactly 0, all numeric features will be used in scatter plot
dimensionalityReductionMethodStringDimensionality reduction is the process of reducing the dimension of the feature set to obtain a set of principal variablesY“pca”“pca”, “tsne”, nulldimensionality_reduction_method value will only be used when the length of features_for_scatter_plot is >= 3Else, dimensionality_reduction_method value will not be used

Response Body

NameDataTypeDescriptionSample ValueNo. of decimalsNotes
clusteringSummaryDictionaryContains 5 values: averageSilhouetteScore, customerId2Cluster, featuresForClustering, numberOfClusters, and removedFeatures
averageSilhouetteScoreNumberMeasures of how similar an object is to its own cluster compared to other clusters. Silhouette score ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters0.218765
clusterInfoDictionaryInformation of each cluster: their respective count, as well as 25th, 50th, and 75th percentile values for numerical features.Output structure template:{ "clusterInfo": [ { "clusterNumber": 0, "count": 106, "numericFeaturesInfo": { "percentile": [ { "features": [ … ], “percentialeValue”: 25 }, { "features": [ … ], “percentialeValue”: 50 }, { "features": [ … ], “percentialeValue”: 75 }, ] }, { … }, { … } ]}
clusterNumberIntegerDescribe which cluster number particular customer id falls into00
countIntegerThe size of cluster630
numericFeaturesInfoDictionaryContains 25th, 50th, and 75th percentile values for those numerical features used for clustering. The key "value” describes the nth percentile of that particular feature. The key ”percentileValue” represent the nth percentile"numericFeaturesInfo": { "percentile": [ { "features": [ { "name": "age", "value": 38 }, { "name": "savingsRatio", "value": 0.16 }, { "name": "personalIncome", "value": 4519.5 } ], "percentileValue": 25 }, … ] }5 dp for key "value”From the sample value example here, the value of age at 25th percentile is 38, the value of savingsRatio at 25th percentile is 0.16, and the value of personalIncome at 25th percentile is 4519.5.
customerId2ClusterArray of DictionaryAn array of the customer ID’s and their corresponding clusters. Contains 2 values in the dictionary: clusterNumber and customerId"customerId2Cluster": [ { "clusterNumber": 0, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, { "clusterNumber": 0, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f002" }, …]
featuresForClusteringDictionaryContains lists of categorical and numerical features that were used in the clustering."featuresForClustering": { "categoricFeatures": [ … ], “numericFeatures”: [ … ]}
categoricFeaturesArrayContains all categorial features used in clustering"categoricFeatures": [ "race_American", "race_Australian", "race_British", "race_Canadian", "race_Chinese", "race_Indian", "race_Indonesian", "race_New Zealand", "race_Singaporean", "gender_female" ]
numericFeaturesArrayContains all numerical features used in clustering"numericFeatures": [ "age", "personalIncome", "savingsRatio" ]
numberOfClustersIntegerNumber of groups that are segmented by the clustering algorithm4
removedFeaturesContains lists of categorical and numerical features that were removed."removedFeatures": { "categoricFeatures": [“gender_male"], "numericFeatures": [] }
dendrogramDataDictionaryContains information to plot dendogram
scatterPlotDictionaryContains 3 values: component, methodName and plot
componentArray of DictionaryContains variance for numerical features for both principal components"components": [ { "explainVariance": 0.35104, "feature": [ { "name": "age", "value": -0.41682 }, { "name": "personalIncome", "value": 0.73593 }, { "name": "savingsRatio", "value": 0.53354 } ], "name": "PC1" }, { "explainVariance": 0.33639, "feature": [ { "name": "age", "value": -0.65268 }, { "name": "personalIncome", "value": -0.65084 }, { "name": "savingsRatio", "value": 0.38783 } ], "name": "PC2" } ]Only shown when methodName is “pca”
explainVarianceNumberThe fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance0.336395
nameStringName of principal component“PC1”Possible values: “PC1” and “PC2”
featureArray of DictionaryContains name of feature and the importance value of that feature in particular principal component"feature": [ { "name": "age", "value": -0.65268 }, … ]5 dp for key "value”
methodNameStringMethod to create scatter plot“pca”Possible values: “pca”, “tsne”, and “2d”
plotArray of DictionaryContains the x and y value for each clusterNumber"plot": [ { "clusterNumber": 0, "xValue": 0.36019, "yValue": 0.38658 }, …]
xValueNumberx value coordinate0.360195
yValueNumbery value coordinate0.386585
Language
Authentication
Bearer
Click Try It! to start a request and see the response here!