This endpoint is to process the data set, cluster the customers and visualize results.
-
Data processing: features with a single value or with number of values equals to the number of samples ('customerId' excluded) will be removed, one of highly correlated features (threshold > 0.8) will be removed. Then categorical data will subject to one-hot-coding, and numeric data will be scaled.
-
For the clustering, users are allowed to select features for training clustering model (if no features are selected, all the features will be included), linkage criteria (single, complete, average and ward, the default is ward method). Users are allowed to input a number of clusters (integer >= 0 and <=10). If 0 or 1 is given, the endpoint will run a silhouette analysis to get a recommended number of clusters, and shows the details of the clustering results.
-
Finally, users are able to select features for visualization.
-
If two features are selected, a scatter plot based on the clusters will be shown.
-
If more than two features are selected, users are allowed to select methods to reduce dimension:
-
PCA: The selected features will subject to PCA procedure to obtain two prinical components, which will be used as x and y axis for visualization.
-
TSNE: There are a wide variety of tuning parameters for TSNE. The default values will be used for these parameters.
-
Note: For both methods, if the sample size is less than 3000, the whole dataset will be used for plot. If the sample size is more than 3000, 3000 samples will be selected by stratified sampling. For TSNE, since it is very expensive in running time, if the number of features is very high (>50), PCA will first be applied to select the 50 most important features.
-
PCA vs TSNE: (1) t-SNE is computationally expensive. (2) PCA it is a mathematical technique, but t-SNE is a probabilistic one. (3) Linear dimensionality reduction algorithms, like PCA, concentrate on placing dissimilar data points far apart in a lower dimension representation. But in order to represent high dimension data on low dimension, non-linear manifold, it is essential that similar data points must be represented close together. (4) Sometimes in t-SNE, different runs with the same hyperparameters may produce different results hence multiple plots must be observed before making any assessment. (5) Since PCA is a linear algorithm, it will not be able to interpret the complex polynomial relationship between features while t-SNE is made to capture exactly that.
-
Request Body
Name | DataType | Description | Mandatory | Sample Value | List of possible values | Notes |
---|---|---|---|---|---|---|
customers | Array of Dictionary | Contains customer’s dataset | Y | "customers": [ { "nationality": "Indian", "gender": "male", "age": 19, "race": "Canadian", "platformCountry": "IN", "personalIncome": 8806, "savingsRatio": 0.49, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, ... { "nationality": "Chinese", "gender": "female", "age": 50, "race": "Indian", "platformCountry": "SG", "personalIncome": 4043, "savingsRatio": 0.61, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f120" } ] | The set of key-value pair dictionary in the array object always changes. And the input validation of the set of key-value pair is defined and set by the customer | |
numberOfClusters | Integer | Number of groups to segment the dataset | Y | 10 | 0-10, including 0 and 10 | When the value of numberOfClusters is 0 or 1, the algorithm will find the most optimum number of groups |
linkageMethod | String | The linkage criteria determines which distance method to use between sets of observation. The algorithm will merge the pairs of clusters that minimize the linkage criterion. | Y | “average” | 'single', 'complete', 'ward', 'average' | |
baseFeaturesForCluster | Array | The clustering algorithm will use the selected features to segment the groups | Y | "base_features_for_cluster": ["age", "gender", "personalIncome", "savingsRatio", "platformCountry", "nationality"] | When the array is empty, all features found in the customer input data will be used. The value in the array must match with the set key-value pair found in customer input data | |
showDendrogram | Boolean | Show dendrogram output value information | Y | true | true or false | |
showScatterPlot | Boolean | Show scatter plot output value information | Y | true | true or false | |
featuresForScatterPlot | Array | The selected features will be used to plot the scatter graph | Y | ["age", "personalIncome", "savingsRatio"] | Null, or length of array is at least 2 | The value in the array must match with the set key-value pair found in customer input data. When the length of featuresForScatterPlot is exactly 2, there will be scatterPlot key-value pair in the output. When the length of featuresForScatterPlot is > 2, dimensionalityReductionMethod value will be used to reduce the number of features to a set of principal variables which will be used to plot the scatter graph. When the length of featuresForScatterPlot is exactly 0, all numeric features will be used in scatter plot |
dimensionalityReductionMethod | String | Dimensionality reduction is the process of reducing the dimension of the feature set to obtain a set of principal variables | Y | “pca” | “pca”, “tsne”, null | dimensionality_reduction_method value will only be used when the length of features_for_scatter_plot is >= 3Else, dimensionality_reduction_method value will not be used |
Response Body
Name | DataType | Description | Sample Value | No. of decimals | Notes |
---|---|---|---|---|---|
clusteringSummary | Dictionary | Contains 5 values: averageSilhouetteScore, customerId2Cluster, featuresForClustering, numberOfClusters, and removedFeatures | |||
averageSilhouetteScore | Number | Measures of how similar an object is to its own cluster compared to other clusters. Silhouette score ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters | 0.21876 | 5 | |
clusterInfo | Dictionary | Information of each cluster: their respective count, as well as 25th, 50th, and 75th percentile values for numerical features. | Output structure template:{ "clusterInfo": [ { "clusterNumber": 0, "count": 106, "numericFeaturesInfo": { "percentile": [ { "features": [ … ], “percentialeValue”: 25 }, { "features": [ … ], “percentialeValue”: 50 }, { "features": [ … ], “percentialeValue”: 75 }, ] }, { … }, { … } ]} | ||
clusterNumber | Integer | Describe which cluster number particular customer id falls into | 0 | 0 | |
count | Integer | The size of cluster | 63 | 0 | |
numericFeaturesInfo | Dictionary | Contains 25th, 50th, and 75th percentile values for those numerical features used for clustering. The key "value” describes the nth percentile of that particular feature. The key ”percentileValue” represent the nth percentile | "numericFeaturesInfo": { "percentile": [ { "features": [ { "name": "age", "value": 38 }, { "name": "savingsRatio", "value": 0.16 }, { "name": "personalIncome", "value": 4519.5 } ], "percentileValue": 25 }, … ] } | 5 dp for key "value” | From the sample value example here, the value of age at 25th percentile is 38, the value of savingsRatio at 25th percentile is 0.16, and the value of personalIncome at 25th percentile is 4519.5. |
customerId2Cluster | Array of Dictionary | An array of the customer ID’s and their corresponding clusters. Contains 2 values in the dictionary: clusterNumber and customerId | "customerId2Cluster": [ { "clusterNumber": 0, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, { "clusterNumber": 0, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f002" }, …] | ||
featuresForClustering | Dictionary | Contains lists of categorical and numerical features that were used in the clustering. | "featuresForClustering": { "categoricFeatures": [ … ], “numericFeatures”: [ … ]} | ||
categoricFeatures | Array | Contains all categorial features used in clustering | "categoricFeatures": [ "race_American", "race_Australian", "race_British", "race_Canadian", "race_Chinese", "race_Indian", "race_Indonesian", "race_New Zealand", "race_Singaporean", "gender_female" ] | ||
numericFeatures | Array | Contains all numerical features used in clustering | "numericFeatures": [ "age", "personalIncome", "savingsRatio" ] | ||
numberOfClusters | Integer | Number of groups that are segmented by the clustering algorithm | 4 | ||
removedFeatures | Contains lists of categorical and numerical features that were removed. | "removedFeatures": { "categoricFeatures": [“gender_male"], "numericFeatures": [] } | |||
dendrogramData | Dictionary | Contains information to plot dendogram | |||
scatterPlot | Dictionary | Contains 3 values: component, methodName and plot | |||
component | Array of Dictionary | Contains variance for numerical features for both principal components | "components": [ { "explainVariance": 0.35104, "feature": [ { "name": "age", "value": -0.41682 }, { "name": "personalIncome", "value": 0.73593 }, { "name": "savingsRatio", "value": 0.53354 } ], "name": "PC1" }, { "explainVariance": 0.33639, "feature": [ { "name": "age", "value": -0.65268 }, { "name": "personalIncome", "value": -0.65084 }, { "name": "savingsRatio", "value": 0.38783 } ], "name": "PC2" } ] | Only shown when methodName is “pca” | |
explainVariance | Number | The fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance | 0.33639 | 5 | |
name | String | Name of principal component | “PC1” | Possible values: “PC1” and “PC2” | |
feature | Array of Dictionary | Contains name of feature and the importance value of that feature in particular principal component | "feature": [ { "name": "age", "value": -0.65268 }, … ] | 5 dp for key "value” | |
methodName | String | Method to create scatter plot | “pca” | Possible values: “pca”, “tsne”, and “2d” | |
plot | Array of Dictionary | Contains the x and y value for each clusterNumber | "plot": [ { "clusterNumber": 0, "xValue": 0.36019, "yValue": 0.38658 }, …] | ||
xValue | Number | x value coordinate | 0.36019 | 5 | |
yValue | Number | y value coordinate | 0.38658 | 5 |