post https://api-lib.bambu.life/api/customerSegmentation/v2/clusteringAdvancedOptions
This endpoint allows user to select an algorithm for clustering.
-
Algorithms that require inputing "number of cluster": K-Means, minibatchkmeans, spectral clustering, agglomerate clustering. Herein, if a number of cluster >=2 is input, the clustering will be run using the given number of cluster. Meanwhile, if a number of clusters is set to 0 or 1, the silhouette analysis will be run to give a recommended number of cluster.
-
Algorithms that don't require inputing "number of cluster": affinity propagation, meanshift, DBScan, HDBScan.
-
It is recommended to first explore and visualize the dataset before selecting an algorithm, especially for large datasets.
Request Body
Name | DataType | Description | Mandatory | Sample Value | List of possible values | Notes |
---|---|---|---|---|---|---|
customers | Array of Dictionary | Array containing each customer’s information. The fields are different per project, based on what we’re getting from the client’s DB. | The array is mandatoryFor each customer’s information in the dictionary, the only mandatory field is customerId | "customers": [ { "nationality": "Indian", "gender": "male", "age": 19, "race": "Canadian", "platformCountry": "IN", "personalIncome": 8806, "savingsRatio": 0.49, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, ... { "nationality": "Chinese", "gender": "female", "age": 50, "race": "Indian", "platformCountry": "SG", "personalIncome": 4043, "savingsRatio": 0.61, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f120" } ] | The set of key-value pair dictionary in the array object always changes. And the input validation of the set of key-value pair is defined and set by the customer | |
clusteringAlgorithm | String | Select an algorithm for clustering | Y | KMeans | List of possible key values are: “KMeans”, ”MiniBatchKMeans”, “SpectralClustering“, “AgglomerativeClustering“, “AffinityPropagation“, “MeanShift“, “DBSCAN“, “HDBSCAN“ | |
baseFeaturesForCluster | Array | The clustering algorithm will use the selected features to segment the groups | Y | "base_features_for_cluster": ["age", "gender", "personalIncome", "savingsRatio", "platformCountry", "nationality"] | When the array is empty, all features found in the customer input data will be used. The value in the array must match with the set key-value pair found in customer input data | |
showScatterPlot | Boolean | To show scatter plot x and y value for every point in each cluster | Y | true | true, false | |
featuresForScatterPlot | Array | The selected features will be used to plot the scatter graph | Y | ["age", "personalIncome", "savingsRatio"] | length of array is at least 2 or exactly 0 | The value in the array must match with the set key-value pair found in customer input data. When the length of featuresForScatterPlot is exactly 2, there will be scatterPlot key-value pair in the output. When the length of featuresForScatterPlot is > 2, dimensionalityReductionMethod value will be used to reduce the number of features to a set of principal variables which will be used to plot the scatter graph. When the length of featuresForScatterPlot is exactly 0, all numeric features will be used in scatter plot |
dimensionalityReductionMethod | String | Dimensionality reduction is the process of reducing the dimension of the feature set to obtain a set of principal variables | N | “pca” | “pca”, “tsne”, null | When the length of featuresForScatterPlot is >= 3, dimensionalityReductionMethod value will only be used. Else, dimensionalityReductionMethod value will not be used. If the value of dimensionalityReductionMethodis null or dimensionalityReductionMethod is not included in the input payload, “pca” value will be used |
parameters | Dictionary | Contains a set of parameter for the chosen clusteringAlgorithm | Y | "parameters": { "KMeans": {"n_clusters": 3}, "MiniBatchKMeans": {"n_clusters": 3}, "SpectralClustering": { "n_clusters": 3, "affinity": "nearest_neighbors", "assign_labels": "kmeans" }, "AgglomerativeClustering": { "n_clusters": 3, "linkage": "average", "affinity": "manhattan" }, "AffinityPropagation": { "damping": 0.8 }, "MeanShift": {"quantile": 0.7}, "DBSCAN": {"min_samples": 2, "eps": 0.3 }, "HDBSCAN": {"min_samples": 2, "min_cluster_size": 3} } | List of possible nameOfAlgo values are: “KMeans”, ”MiniBatchKMeans”, “SpectralClustering“, “AgglomerativeClustering“, “AffinityPropagation“, “MeanShift“, “DBSCAN“, “HDBSCAN“ | The nameOfAlgo value in the input must match with clusteringAlgorithm input |
KMeans | Dictionary | Contains hyperparameter of KMeans | N | "KMeans": {"n_clusters": 3} | ||
n_clusters | Integer | Number of clusters to group the dataset | Y | "n_clusters": 3 | >=0, <=10 | |
MiniBatchKMeans | Dictionary | Contains hyperparameter of MiniBatchKMeans | N | "MiniBatchKMeans": {"n_clusters": 2} | ||
n_clusters | Integer | Number of clusters to group the dataset | Y | "n_clusters": 3 | >=0, <=10 | |
SpectralClustering | Dictionary | Contains hyperparameter of SpectralClustering | N | "SpectralClustering": { "n_clusters": 1, "affinity": "nearest_neighbors", "assign_labels": "kmeans" } | ||
n_clusters | Integer | Number of clusters to group the dataset | Y | "n_clusters": 3, | >=0, <=10 | |
assign_labels | String | The strategy to use to assign labels in the embedding space. | Y | "assign_labels": “kmeans”, | “kmeans”, “discretize” | |
affinity | String | Method to construct affinity matrix | Y | "affinity": "nearest_neighbors" | "nearest_neighbors", “rbf” | |
AgglomerativeClustering | Dictionary | Contains hyperparameter of AgglomerativeClustering | N | "AgglomerativeClustering": { "n_clusters": 1, "linkage": "average", "affinity": "manhattan" } | ||
n_clusters | Integer | Number of clusters to group the dataset | Y | "n_clusters": 0 | >=0, <=10 | |
affinity | String | Metric used to compute the linkage. | Y | "affinity":"euclidean" | "euclidean", “l1”, “l2”, “manhattan” | |
linkage | String | Which linkage criterion to use. | Y | "linkage": "average" | "average", “ward”, “complete”, “single” (if “ward”, only “euclidean” can be used) | |
AffinityPropagation | Dictionary | Contains hyperparameter of AffinityPropagation | N | "AffinityPropagation": {"damping": 0.9} | ||
damping | Float | The extent to which the current value is maintained relative to incoming values (weighted 1 - damping). | Y | "damping": 0.6 | >=0.5, <1 | |
MeanShift | Dictionary | Contains hyperparameter of MeanShift | N | "MeanShift": {"quantile": 0.2} | ||
quantile | Float | That is used to estimate bandwidth (0.5 means that the median of all pairwise distances is used) | Y | "quantile": 0.2 | >=0, <=1 | |
DBSCAN | Dictionary | Contains hyperparameter of DBSCAN | N | "DBSCAN": { "min_samples": 1, "eps": 0.3 } | ||
min_samples | Integer | The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself | Y | "min_samples": 10 | >=1 | |
eps | Float | The maximum distance between two samples for one to be considered as in the neighborhood of the other | Y | "eps": 0.2 | >0 | |
HDBSCAN | Dictionary | Contains hyperparameter of HDBSCAN | N | "HDBSCAN": { "min_samples": 10, "min_cluster_size": 5 } | ||
min_samples | Integer | The number of samples in a neighbourhood for a point to be considered a core point. | Y | "min_samples": 10, | >=1 | |
min_cluster_size | Integer | The minimum size of clusters | Y | "min_cluster_size": 5 | >=2 |
Response Body
Name | DataType | Description | Sample Value | No. of decimals | Notes |
---|---|---|---|---|---|
clusteringSummary | Dictionary | Contains 5 values: averageSilhouetteScore, customerId2Cluster, featuresForClustering, numberOfClusters, and removedFeatures | |||
averageSilhouetteScore | Number | Measures of how similar an object is to its own cluster compared to other clusters. Silhouette score ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters | 0.50472 | 5 | |
customerId2Cluster | Array of Dictionary | An array of the customer ID’s and their corresponding clusters. Contains 2 values in the dictionary: clusterNumber and customerId | "customerId2Cluster": [ { "clusterNumber": 4, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f001" }, { "clusterNumber": 2, "customerId": "273a3980-2458-11e9-b0a0-9d372de0f002" }, …] | ||
clusterNumber | Number | Describe which cluster number particular customer id falls into | 4 | 0 | |
customerId | String | Customer Id | “273a3980-2458-11e9-b0a0-9d372de0f001" | ||
featuresForClustering | Dictionary | Contains lists of categorical and numerical features that were used in the clustering. | "featuresForClustering": { "categoricFeatures": [ … ], “numericFeatures”: [ … ]} | ||
categoricFeatures | Array | Contains all categorial features used in clustering | [ "gender_female", "platformCountry_AU", "platformCountry_CA", "platformCountry_GB", "platformCountry_ID", "platformCountry_IN", "platformCountry_NZ", "platformCountry_SG", "platformCountry_US", "nationality_Chinese", "nationality_Singaporean" ] | ||
numericFeatures | Array | Contains all numerical features used in clustering | ["personalIncome", "savingsRatio"] | ||
numberOfClusters | Number | Number of groups that are segmented by the clustering algorithm | 9 | 0 | |
removedFeatures | Dictionary | Contains lists of categorical and numerical features that were removed | "removedFeatures": { "categoricFeatures": [ "gender_male", "nationality_American", "nationality_Australian", "nationality_British", "nationality_Canadian", "nationality_Indian", "nationality_Indonesian", "nationality_New Zealand" ], "numericFeatures": [] } | ||
scatterPlot | Dictionary | Contains 3 values: component, methodName and plot | |||
component | Array of Dictionary | Contains variance for numerical features for both principal components | "components": [ { "explainVariance": 0.35104, "feature": [ { "name": "age", "value": -0.41682 }, { "name": "personalIncome", "value": 0.73593 }, { "name": "savingsRatio", "value": 0.53354 } ], "name": "PC1" }, { "explainVariance": 0.33639, "feature": [ { "name": "age", "value": -0.65268 }, { "name": "personalIncome", "value": -0.65084 }, { "name": "savingsRatio", "value": 0.38783 } ], "name": "PC2" } ] | Only shown when methodName is “pca” | |
explainVariance | Number | The fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance | 0.33639 | 5 | |
name | String | Name of principal component | “PC1” | Possible values: “PC1” and “PC2” | |
feature | Array of Dictionary | Contains name of feature and the importance value of that feature in particular principal component | "feature": [ { "name": "age", "value": -0.65268 }, … ] | 5 dp for value | |
methodName | String | Method to create scatter plot | “pca” | Possible values: “pca”, “tsne”, and “2d” | |
plot | Array of Dictionary | Contains the x and y value for each clusterNumber | "plot": [ { "clusterNumber": 6, "xValue": 0.36019, "yValue": 0.38658 }, …] | ||
xValue | Number | x value coordinate | 0.36019 | 5 | |
yValue | Number | y value coordinate | 0.38658 | 5 |