This endpoint is to process the data set, cluster the customers and visualize results.
Data processing: features with a single value or with number of values equals to the number of samples ('customerId' excluded) will be removed, one of highly correlated features (threshold > 0.8) will be removed. Then categorical data will subject to one-hot-coding, and numeric data will be scaled.
For the clustering, users are allowed to select features for training clustering model (if no features are selected, all the features will be included), linkage criteria (single, complete, average and ward, the default is ward method). Users are allowed to input a number of clusters (integer >= 0 and <=10). If 0 or 1 is given, the endpoint will run a silhouette analysis to get a recommended number of clusters, and shows the details of the clustering results.
Finally, users are able to select features for visualization.
If two features are selected, a scatter plot based on the clusters will be shown.
If more than two features are selected, users are allowed to select methods to reduce dimension:
PCA: The selected features will subject to PCA procedure to obtain two prinical components, which will be used as x and y axis for visualization.
TSNE: There are a wide variety of tuning parameters for TSNE. The default values will be used for these parameters.
Note: For both methods, if the sample size is less than 3000, the whole dataset will be used for plot. If the sample size is more than 3000, 3000 samples will be selected by stratified sampling. For TSNE, since it is very expensive in running time, if the number of features is very high (>50), PCA will first be applied to select the 50 most important features.
PCA vs TSNE: (1) t-SNE is computationally expensive. (2) PCA it is a mathematical technique, but t-SNE is a probabilistic one. (3) Linear dimensionality reduction algorithms, like PCA, concentrate on placing dissimilar data points far apart in a lower dimension representation. But in order to represent high dimension data on low dimension, non-linear manifold, it is essential that similar data points must be represented close together. (4) Sometimes in t-SNE, different runs with the same hyperparameters may produce different results hence multiple plots must be observed before making any assessment. (5) Since PCA is a linear algorithm, it will not be able to interpret the complex polynomial relationship between features while t-SNE is made to capture exactly that.
When the array is empty, all features found in the customer input data will be used. The value in the array must match with the set key-value pair found in customer input data
Show dendrogram output value information
true or false
Show scatter plot output value information
true or false
The selected features will be used to plot the scatter graph
["age", "personalIncome", "savingsRatio"]
Null, or length of array is at least 2
The value in the array must match with the set key-value pair found in customer input data. When the length of featuresForScatterPlot is exactly 2, there will be scatterPlot key-value pair in the output. When the length of featuresForScatterPlot is > 2, dimensionalityReductionMethod value will be used to reduce the number of features to a set of principal variables which will be used to plot the scatter graph. When the length of featuresForScatterPlot is exactly 0, all numeric features will be used in scatter plot
Dimensionality reduction is the process of reducing the dimension of the feature set to obtain a set of principal variables
“pca”, “tsne”, null
dimensionality_reduction_method value will only be used when the length of features_for_scatter_plot is >= 3Else, dimensionality_reduction_method value will not be used
No. of decimals
Contains 5 values: averageSilhouetteScore, customerId2Cluster, featuresForClustering, numberOfClusters, and removedFeatures
Measures of how similar an object is to its own cluster compared to other clusters. Silhouette score ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters
Information of each cluster: their respective count, as well as 25th, 50th, and 75th percentile values for numerical features.
Describe which cluster number particular customer id falls into
The size of cluster
Contains 25th, 50th, and 75th percentile values for those numerical features used for clustering. The key "value” describes the nth percentile of that particular feature. The key ”percentileValue” represent the nth percentile