-
This endpoint is to preprocess data, and output: X_train, X_test, y_train and y_test for use in subsequent endpoints.
-
It mainly includes two classes:
-
Feature-selector: A class for performing feature selection for machine learning or data preprocessing. This includes 4 different methods to identify features for removal: a) find columns with a missing percentage greater than a specified threshold (default = 60%). b) find columns with a single unique value c) find collinear variables with a correlation greater than a specified correlation coefficient (default = 0.8). d) find columns with a number of values that is 95% of the sample sizes
-
Pre_processor: A class for performing data split, scaling, LDA and skewness correction for numerical data, SKBest chi-square to select categorical features.
Request Body
Name | Datatype | Description | Mandatory | Sample Values | Notes |
---|---|---|---|---|---|
clientId | String | It is meant for identification purpose as information are stored under unique id value | Yes | ||
customers | Array of Dictionary | Array containing each customer’s personal information. The fields are different per project, based on what we’re getting from the client’s DB | The array is mandatory. For each customer’s information in the dictionary, the only mandatory field is customerID | "nationality", "gender", "age", "race", "platformCountry", "personalIncome", "savingsRatio", "agentId", "customerId" values of each customer. | The set of key-value pair dictionary in the array object always changes. And the input validation of the set of key-value pair is defined and set by the customer |
goals | Array of Dictionary | Array containing each customer’s goal information. The fields are different per project, based on what we’re getting from the client’s DB | The array is mandatory. For each customer’s information in the dictionary, the only mandatory fields are customerID and goalType | "customerId", "goalType", "status", "goalName", "goalValue", "goalStartDate", "goalEndDate", "goalPriority", "initialInvestment", "contributionAmount", "contributionFrequency", "lastContributionDate", "riskProfileId", "modelPortfolioId", "id", "createdAt", "modifiedAt", "createdBy", "modifiedBy". | The set of key-value pair dictionary in the array object always changes. And the input validation of the set of key-value pair is defined and set by the customer |
Response Body
Name | Datatype | Description | Sample Values | No. of decimals | Notes |
---|---|---|---|---|---|
categoricFeaturesForModel | Array | Array of categoric features included in training | [“gender_male”, ] | ||
categoricImportance | Array of dictionary | Importance of each categorical feature | { "name": "gender_male", "value": 1.38217 } | 5 | |
name | String | Name of categoric features | "gender_male" | ||
value | Number | Importance value of the feature | 5 | ||
categorical_feature placeholder | Number | 0 | |||
collinear | Dictionary | Contains corr_features, corr_values, and drop_features | 5 | ||
corr_features | Dictionary | Features that are highly correlated with drop_features, and will be included for training | {“0”: “gender_male”} | ||
corr_values | Dictionary | Correlation values | {“0”: -1} | ||
drop_features | Dictionary | Features that are highly correlated with corr_features, and will be dropped | {“0”: “gender_female” } | ||
missingFractionOfFeatures | Array of dictionaries | ratio of missing value for each feature | { "name": "gender", "value": 0 } | ||
name | String | Name of feature | gender" | ||
value | Number | missing fraction of the feature | 5 | ||
feature placeholder | Dictionary | ||||
numericFeaturesForModel | Array | Array of numerical features including in training | numericFeaturesForModel: [“age”, “personalIncome”] | ||
numericalImportance | Array of dictionaries | Importance of each numerical feature | { "name": "personalIncome", "value": 6.51383 } | ||
name | String | Name of feature | |||
value | Number | Importance value of the feature | 5 | ||
numerical_feature placeholder | Number | 5 | |||
skewAfterTransform | Array of dictionaries | skewness of each feature after transformation | "skewAfterTransform": [ { "name": "age", "value": 0.04826 },… ] | ||
name | String | Name of the feature | |||
value | Number | Skewness of the feature after transformation | 5 | ||
numerical_feature placeholder | Number | 5 | |||
skewB4Transform | Array of dictionaries | skewness of each feature before transformation | "skewB4Transform": [ { "name": "age", "value": 0.58215 }, … ] | 5 | PowerTransformer will be applied for features with a skewness > 0.5 or < -0.5 |
name | String | Name of the feature | |||
value | Number | Skewness of the feature before transformation | 5 | ||
numerical_feature placeholder | Number | 5 | |||
uniqueValueInCustomers | Dictionary | Contains unique values of each categoric_features and numeric_features. | |||
categoricFeatures | Array of dictionaries | Number of unique values in each categoric feature | { "name": "nationality", "value": 1 }, { "name": "race", "value": 1 }, | ||
name | String | name of the feature in customers | |||
value | Integer | Number of unique value for each feature | 0 | ||
numericFeatures | Number of unique values in each numeric feature | ||||
categoric_fetaures placeholder | Number | 0 | |||
numericFeatures | Array of dictionaries | Number of unique values in each numeric feature | [ { "name": "age", "value": 82 }, { "name": "personalIncome", "value": 976 } ] | ||
numerical_feature placeholder | Number | 0 | |||
uniqueValueInGoals | Array of dictionaries | Unique goal types | [ { "name": "goalType", "value": 9 } ] | ||
name | String | Name of goal | |||
value | Integer | Number of unique goals | 0 | ||
prediction_key_name placeholder | Number | 0 |