{ "questions":[ { "question":"Alice is examining the probability mass function charts of multiple categorical variables for her model. Which of these mass functions exhibits the greatest entropy?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"BASICS", "topic":"ALGEBRA", "question_family":"GENERIC", "choices":{ "A":"", "B":"", "C":"", "D":"", "E":"" }, "correct_answer":"B", "chatgpt_question_part_one":"Alice is examining the probability ma...", "chatgpt_question_part_two":"Alice is examining the probability ma...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"An Application Scorecard is built to measure the probability of default of a consumer loan when the consumer applies for the loan. A Data Scientist uses the past application data and outcomes to build the scorecard. A binary logistic regression algorithm is used to train and validate the model. After the scorecard is built, a cut-off score should be determined to produce a threshold score for accepting or rejecting applications. Each cut-off gives a different misclassification matrix.\n\nAssume that, the average loss-given-default per loan is USD4,000 and the average profit per successfully closed loan is USD1,000. Which of the following misclassification matrix as measured on a test set gives the maximum profit, hence optimal cut-off? \n(Notes: 1) Values in the matrix corresponds to probabilities, 2) Good: successful closure, Bad:Defaulted).", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"BASICS", "topic":"ALGEBRA", "question_family":"GENERIC", "choices":{ "A":"| | Predicted(Good) | Predicted(Bad) |\n| :-------- |:-------------: |:------------: |\n| **Actual(Good)**| 0.45 | 0.05|\n| **Actual(Bad)**| 0.15 | 0.35|", "B":"| | Predicted(Good) | Predicted(Bad) |\n| :-------- |:-------------: |:------------: |\n| **Actual(Good)**| 0.425 | 0.075|\n| **Actual(Bad)**| 0.125 | 0.375|", "C":"| | Predicted(Good) | Predicted(Bad) |\n| :-------- |:-------------: |:------------: |\n| **Actual(Good)**| 0.35 | 0.15|\n| **Actual(Bad)**| 0.075 | 0.425|", "D":"| | Predicted(Good) | Predicted(Bad) |\n| :-------- |:-------------: |:------------: |\n| **Actual(Good)**| 0.4 | 0.1|\n| **Actual(Bad)**| 0.1 | 0.4|", "E":"| | Predicted(Good) | Predicted(Bad) |\n| :-------- |:-------------: |:------------: |\n| **Actual(Good)**| 0.3 | 0.2|\n| **Actual(Bad)**| 0.05 | 0.45|" }, "correct_answer":"E", "chatgpt_question_part_one":"An Application Scorecard is built to ...", "chatgpt_question_part_two":"An Application Scorecard is built to ...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Assume that the sigmoid function is used as the activation function at a neural network node applied to an m-dimensional input vector to produce the activation $z$, i.e.\n\n$\\begin{array}{cl}\nz &=\\sigma({\\mathbf w} \\cdot {\\mathbf x}) = \\sigma(w_0 + \\sum_{i=1}^{m} w_i x_i) \\\\\n\\sigma(x) &= \\frac{1}{1 + e^{-x}}\n\\end{array}$\n\nWhat is the error gradient for the weight$w_j, j > 0$?\n(Note: $\\cal L$ below represents the loss function)", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"CALCULUS", "topic":"DERIVATIVES", "question_family":"GENERIC", "choices":{ "A":"$z(1-z)x_j \\frac{\\partial {\\cal L}}{\\partial z}$", "B":"$z x_j \\frac{\\partial {\\cal L}}{\\partial z}$", "C":"$\\frac{e^{-w_jx_j}}{{(1 + e^{-{\\mathbf w} \\cdot {\\mathbf x}})}^2} \\frac{\\partial {\\cal L}}{\\partial z}$", "D":"$\\frac{e^{-{\\mathbf w} \\cdot {\\mathbf x}}}{{(1 + e^{-{\\mathbf w} \\cdot {\\mathbf x}})}^2} \\frac{\\partial {\\cal L}}{\\partial z}$", "E":"$z^2 x_j \\frac{\\partial {\\cal L}}{\\partial z}$" }, "correct_answer":"A", "chatgpt_question_part_one":"Assume that the sigmoid function is u...", "chatgpt_question_part_two":"Assume that the sigmoid function is u...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"What is the variance of the random variable $U(-1,1)$, i.e. the uniform random variable over the interval $[-1, 1]$?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"CALCULUS", "topic":"INTEGRALS", "question_family":"GENERIC", "choices":{ "A":"1/12", "B":"1/3", "C":"1/4", "D":"1/2", "E":"1" }, "correct_answer":"B", "chatgpt_question_part_one":"What is the variance of the random va...", "chatgpt_question_part_two":"What is the variance of the random va...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"In a multivariate linear regression problem with independent variables $\\{X_1, X_2\\}$, and the dependent variable $Y$, the sample variance-covariance matrix is given as follows:\n\n$\\Sigma = \\left[\n\\begin{array}{ccc}\n1.0 & -0.5 & 0.4 \\\\\n-0.5 & 1.0 & 0.2 \\\\\n0.4 & 0.2 & 1.0 \\\\\n\\end{array} \\right]$\n\nWhere $\\Sigma_{ij} = Cov(X_i, X_j)$ for $i, j \\in \\{1, 2\\}$\nInputs and target are normalized to have zero - mean and unit - variance.\n\nThe linear specification assumes:\n$Y \\equiv \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\epsilon, \\space \\space \\epsilon \\sim {\\cal{N}}(0,\\sigma^2)$\n\nWhat is the ordinary least squares(OLS) estimate for the coefficient vector\n$\\hat{\\beta} = \\{\\hat{\\beta_0}, \\hat{\\beta_1}, \\hat{\\beta_2}\\}$?\n", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"LINEAR_ALGEBRA", "topic":"GAUSS_JORDAN", "question_family":"GENERIC", "choices":{ "A":"$\\hat{\\beta} = \\{0.0, 4/3, 5/9\\}$", "B":"$\\hat{\\beta} = \\{0.0, 4/9, 2/3\\}$", "C":"$\\hat{\\beta} = \\{1.0, 2/7, 8/15\\}$", "D":"$\\hat{\\beta} = \\{0.0, 2/3, 8/15\\}$", "E":"$\\hat{\\beta} = \\{1.0, 2/3, 4/9\\}$" }, "correct_answer":"D", "chatgpt_question_part_one":"In a multivariate linear regression p...", "chatgpt_question_part_two":"In a multivariate linear regression p...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Consider the 50 sample points randomly created by the equation:\n\n$Y = X + \\epsilon$\n$X \\sim U(-1, 1)$\n$\\epsilon \\sim {\\cal N}(0, 0.09)$\n\nWhich of the following scatter plots correspond to this sample set after the linear transformation represented by the matrix $A$ below is applied to it?\nA = $\\begin{bmatrix}\n0.5 & -\\frac{\\sqrt{3}}{2} \\\\ \n\\frac{\\sqrt{3}}{2} & 0.5 \\\\ \n\\end{bmatrix}$\n", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"LINEAR_ALGEBRA", "topic":"MATRIX_COMPUTATIONS", "question_family":"GENERIC", "choices":{ "A":"", "B":"", "C":"", "D":"", "E":"" }, "correct_answer":"C", "chatgpt_question_part_one":"Consider the 50 sample points randoml...", "chatgpt_question_part_two":"Consider the 50 sample points randoml...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following is NOT an example of a greedy algorithm?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DATA_AND_ALGORITHMS", "topic":"ALGO_FUNDAMENTALS", "question_family":"GENERIC", "choices":{ "A":"Finding the best input feature for splitting a node in a classification and regression tree.", "B":"Finding the values of weights in an artificial neural network by using stochastic gradient descent.", "C":"Using the nearest neighbor algorithm for solving the travelling salesman problem.", "D":"Using the branch and bound algorithm to find the subset of variables that gives the minimum validation loss in a binary logistic regression problem.", "E":"Using stepwise forward selection algorithm to select the features in a multivariate linear regression problem." }, "correct_answer":"D", "chatgpt_question_part_one":"Which of the following is NOT an exam...", "chatgpt_question_part_two":"Which of the following is NOT an exam...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice has written the following two functions for computing the Fibonacci sequence:\n\n$\\texttt{def fibonacci\\_1(n: int):}$\n$\\texttt{\\hskip{0.22in} if n == 0:}$\n$\\texttt{\\hskip{0.44in} return 0}$\n$\\texttt{\\hskip{0.22in} elif n == 1:}$\n$\\texttt{\\hskip{0.44in} return 1}$\n$\\texttt{\\hskip{0.22in} else:}$\n$\\texttt{\\hskip{0.44in} return fibonacci\\_one(n - 1) + fibonacci\\_one(n - 2)}$\n\n$\\texttt{def fibonacci\\_2: int):}$\n$\\texttt{\\hskip{0.22in} if n == 0:}$\n$\\texttt{\\hskip{0.44in} return 0}$\n$\\texttt{\\hskip{0.22in} elif n == 1:}$\n$\\texttt{\\hskip{0.44in} return 1}$\n$\\texttt{\\hskip{0.22in} else:}$\n$\\texttt{\\hskip{0.44in} prev = 0}$\n$\\texttt{\\hskip{0.44in} curr = 1}$\n$\\texttt{\\hskip{0.44in} output = 0}$\n$\\texttt{\\hskip{0.44in} for i in range(1, n):}$\n$\\texttt{\\hskip{0.66in} output = curr + prev}$\n$\\texttt{\\hskip{0.66in} prev = curr}$\n$\\texttt{\\hskip{0.66in} curr = output}$\n$\\texttt{\\hskip{0.44in} return output}$\n\nWhich of the following statements is true regarding these two functions?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DATA_AND_ALGORITHMS", "topic":"ALGO_NUMERICS", "question_family":"GENERIC", "choices":{ "A":"fibonacci_1(.) function is slower than fibonacci_2(.)", "B":"fibonacci_2(9) = 21", "C":"fibonacci_1(.) function is faster than fibonacci_2(.) since it uses recursion", "D":"Space(memory) complexity of both algorithms are the same", "E":"Time complexity of both algorithms are the same" }, "correct_answer":"A", "chatgpt_question_part_one":"Alice has written the following two f...", "chatgpt_question_part_two":"Alice has written the following two f...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"In a clinical study to evaluate the effectiveness of a breast cancer screening test, the following data was collected: Out of every 1,000 women, 10 have breast cancer. Among these 10 cancerous women, 9 receive a positive test result. Out of 990 women who do not have breast cancer, 89 falsely receive a positive test result. What is the likelihood that a woman actually has breast cancer if she tests positive?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"PROBABILITY", "topic":"INDEPENDENCE", "question_family":"GENERIC", "choices":{ "A":"9/10", "B":"9/98", "C":"9/99", "D":"98/1000", "E":"10/1000" }, "correct_answer":"B", "chatgpt_question_part_one":"In a clinical study to evaluate the e...", "chatgpt_question_part_two":"In a clinical study to evaluate the e...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following is not a valid cumulative distribution function of a continuous random variable? ", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"PROBABILITY", "topic":"RANDOM_VARIABLES", "question_family":"GENERIC", "choices":{ "A":"$\\frac{1}{1 + e^{-x}}$", "B":"$0.5\\tanh(x) + 0.5$", "C":"$\\min(\\max(0,x),1)$", "D":"$\\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$", "E":"$\\min(e^{x},1)$" }, "correct_answer":"D", "chatgpt_question_part_one":"Which of the following is not a valid...", "chatgpt_question_part_two":"Which of the following is not a valid...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Bob is doing exploratory data analysis before building a fraud detection model. The following table shows the distribution of positives and negatives in each age group:\n\n| Age group | Positives | Negatives | \n|:------------- |:------------: |:------------: |\n| Unknown | 25 | 200 |\n| $\\leq 25$ | 10 | 800 |\n| $[25, 35)$ | 40 | 1625|\n| $[35, 45)$ | 50 | 1250 |\n| $[45, 55)$ | 25 | 1600 |\n| $\\geq 55$ | 5 | 550 |\n\nWhich of the following statements is incorrect?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DESCRIPTIVE_STATISTICS", "topic":"ASSOCIATION", "question_family":"GENERIC", "choices":{ "A":"The fraud event is statistically dependent on the age variable.", "B":"The probability of fraud in the Unknown category is roughly half of the probability of fraud in $[25, 35)$ range", "C":"The statistical dependency between the fraud event and the age variable is nonlinear.", "D":"The probability of fraud in $[25, 35)$ range is roughly twice the probability of fraud in $\\leq 25$ range", "E":"The probability of fraud in $[35, 45)$ range is larger than the probability of the fraud in the whole population." }, "correct_answer":"B", "chatgpt_question_part_one":"Bob is doing exploratory data analysi...", "chatgpt_question_part_two":"Bob is doing exploratory data analysi...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"The following scatter plot illustrates the distribution between the age and monthly income variables:\n\n![img]()\nWhich of the following describes the statistical relationship between the two variables?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DESCRIPTIVE_STATISTICS", "topic":"ASSOCIATION", "question_family":"GENERIC", "choices":{ "A":"No statistical dependency", "B":"Linear and strong", "C":"Linear and weak", "D":"Non-linear and strong", "E":"Non-linear and weak" }, "correct_answer":"D", "chatgpt_question_part_one":"The following scatter plot illustrate...", "chatgpt_question_part_two":"The following scatter plot illustrate...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice is doing a Champion/Challenger testing of two models she has built for an e-mail campaign. 120 out of 5000 customers have responded positively to the campaign based on the champion model, and 150 out of 6000 customers have responded positively to the campaign based on the challenger model. Let $p1$ be the true proportion of customers who would respond positively to the campaign driven by the champion model, and $p2$ be the true proportion of customers who would respond positively to the campaign driven by the challenger model. To decide whether to replace the champion model with the challenger model, how should Alice state the null and alternative hypotheses for testing?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"INFERENTIAL_STATISTICS", "topic":"HYPOTHESIS_TESTING", "question_family":"GENERIC", "choices":{ "A":"$H_0: p_1 <= p_2; H_a: p_1 \\neq p_2$", "B":"$H_0: p_1 >= p_2; H_a: p_1 < p_2$", "C":"$H_0: p_1 <= p_2; H_a: p_1 > p_2$", "D":"$H_0: p_1 = p_2; H_a: p_1 \\neq p_2$", "E":"$H_0: p_1 = p_2; H_a: p_1 < p_2$" }, "correct_answer":"B", "chatgpt_question_part_one":"Alice is doing a Champion/Challenger ...", "chatgpt_question_part_two":"Alice is doing a Champion/Challenger ...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice has income and age variables in her dataset of 100 samples. Both variables are scale variables. She is trying to understand the bivariate dependency between income and age. She does the following computations: \n1. She first runs a simple linear regression analysis and found that the p-value of age is 0.30, and R-Squared is 0.05 between age and income\n2. In the second run, she runs power transformation for age, and regresses income on this transformed variable. P-value for the transformed age is 0.01, and R-Squared is 0.92\n\nWhich of the following could be inferred?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"INFERENTIAL_STATISTICS", "topic":"REGRESSION", "question_family":"GENERIC", "choices":{ "A":"The dependency between income and age is linear", "B":"The differences are due to sampling", "C":"An inference could not be made about the bivariate dependency of income on age", "D":"The dependency between income and age is non-linear", "E":"R-squared is not a proper measure for measuring the dependency" }, "correct_answer":"D", "chatgpt_question_part_one":"Alice has income and age variables in...", "chatgpt_question_part_two":"Alice has income and age variables in...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following problems could not be solved without explicitly labelled data?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"UNSUPERVISED_LEARNING", "topic":"CLUSTERING", "question_family":"GENERIC", "choices":{ "A":"Classifying an object in an image.", "B":"Detecting anomalies in a time-series data.", "C":"Predicting the next word in a sentence.", "D":"Recommending a product to an online customer in an e-commerce site.", "E":"Predicting the next frame in a video." }, "correct_answer":"A", "chatgpt_question_part_one":"Which of the following problems could...", "chatgpt_question_part_two":"Which of the following problems could...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"A term frequency-inverse document frequency matrix (tf-idf matrix) is using unigrams and bigrams from the following three documents with the respective texts: \n\n* Document-1: \"Machine Learning is great\"\n* Document-2: \"Deep Learning is a subfield of Machine Learning\"\n* Document-3: \"CART is a Machine Learning algorithm\"\n\nWhat are the dimensions of the resulting tf-idf matrix?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"UNSUPERVISED_LEARNING", "topic":"DIMENSIONALITY_REDUCTION", "question_family":"GENERIC", "choices":{ "A":"(3, 10)", "B":"(10, 11)", "C":"(3, 21)", "D":"(3, 11)", "E":"(18, 24)" }, "correct_answer":"C", "chatgpt_question_part_one":"A term frequency-inverse document fre...", "chatgpt_question_part_two":"A term frequency-inverse document fre...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"A binary classification model is built to predict the churn risk of post-paid subscribers at a large telco firm. After the model is applied to a test dataset, the fitted probability density functions of scores of churners and stayers are as follows: \n\n$f_{Stayers}(x) = e^{-x}, x \\geq 0$\n$f_{Churners}(x) = 4 e^{-4 (2 - x)}, x \\leq 2$\n\nThe odds ratio of stayers to churners is given as $4:1$.\nWhat is the decision boundary that minimizes the mis-classification rate? (Note: log refers to the natural logarithm).", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SUPERVISED_LEARNING", "topic":"DECISION_THEORY", "question_family":"GENERIC", "choices":{ "A":"$x > 2$", "B":"$x > 1.6$", "C":"$x > \\log{(8/5)}$", "D":"$x > 1$", "E":"$x > \\log{4}$" }, "correct_answer":"B", "chatgpt_question_part_one":"A binary classification model is buil...", "chatgpt_question_part_two":"A binary classification model is buil...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Bob has built a multivariate linear regression model for predicting a continuous variable $Y$. The results are shown below:\n| Variable | Coefficient | Std Dev | T | VIF |\n| :---: | :---: | :---: | :---: | :---: |\n| CONSTANT | -3.616924 | 0.015572 | 7.753867 | - |\n|$X_1$ | 0.704137 | 0.004011 | 175.547107 | 2.362389|\n|$X_2$ | 0.001083 | 0.009686 | 0.111851 | 5.612303|\n|$X_3$ | 0.060543 | 0.01016 | 5.959245 | 6.951038|\n|$X_4$ | 0.177235 | 0.015572 | 11.381624 | 10.492878|\n|$X_5$ | 0.016725 | 0.024115 | 0.693554 | 21.290732|\n|$X_6$ | 0.11419 | 0.011562 | 9.876636 | 5.440218|\n|$X_7$ | 0.182835 | 0.020648 | 8.854888 | 2.452144|\n\nStandard error of regression is 0.3223, and the regression $R^2$ is 0.72.\nWhich of the following statements could not be inferred from these results?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SUPERVISED_LEARNING", "topic":"LINEAR_MODELS", "question_family":"GENERIC", "choices":{ "A":"For some input variables, it is not possible to reject the hypothesis that they have no impact on the output.", "B":"For certain input variables, the $R^2$ value calculated from a regression of that variable against the others is above 0.9.", "C":"It is predicted that the output increases by 0.704137 units for each one unit increase in $X_1$ on average.", "D":"72% of the variability in the output is accounted for by the variables listed in the table.", "E":"The positive values of the coefficients are due to positive correlations between each input variable and the output." }, "correct_answer":"E", "chatgpt_question_part_one":"Bob has built a multivariate linear r...", "chatgpt_question_part_two":"Bob has built a multivariate linear r...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following statements is correct for a binary classification model whose ROC (Receiver Operating Characteristics) value is 0.85.", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SUPERVISED_LEARNING", "topic":"MODEL_ASSESSMENT", "question_family":"GENERIC", "choices":{ "A":"The probability that an actual positive case will be predicted as positive is 0.85.", "B":"The probability that a randomly chosen positive case has a higher model score than the score of a randomly chosen negative case is 0.85.", "C":"The probability that an actual positive case will be predicted as positive OR an actual negative case will be predicted as negative is 0.85.", "D":"The probability that an actual negative case will be predicted as negative is 0.85.", "E":"The misclassification rate is 0.85." }, "correct_answer":"B", "chatgpt_question_part_one":"Which of the following statements is ...", "chatgpt_question_part_two":"Which of the following statements is ...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following is not affected by pruning a classification and decision tree?\n", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SUPERVISED_LEARNING", "topic":"TREE_ALGORITHMS", "question_family":"GENERIC", "choices":{ "A":"Number of leaf nodes", "B":"Mis-classification rate", "C":"Maximum depth of the tree", "D":"Scores produced on the test set", "E":"Split variables at each node" }, "correct_answer":"E", "chatgpt_question_part_one":"Which of the following is not affecte...", "chatgpt_question_part_two":"Which of the following is not affecte...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following techniques could be used to extract useful features from zip-code data in a supervised manner?\nI. Integer encoding\nII. Classification and Regression Trees\nIII. Embedding in an artificial neural network\nIV. Word2Vec\nV. Hidden Markov Modelling", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DATA_PREPARATION", "topic":"FEATURE_ENGINEERING", "question_family":"GENERIC", "choices":{ "A":"II and IV", "B":"I, IV and V", "C":"II and III", "D":"II and IV", "E":"III and V" }, "correct_answer":"C", "chatgpt_question_part_one":"Which of the following techniques cou...", "chatgpt_question_part_two":"Which of the following techniques cou...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Bob is building a fraud model to detect the fraudulent credit applications at a bank. He has created more than 3,000 variables sourced from the bank's data warehouse. He does not have any constraints on the computational power to be employed, and the model selection criteria is based only on the model performance on a pre-determined test set. Which of the following feature selection methods represents the best solution?\n", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DATA_PREPARATION", "topic":"FEATURE_SELECTION", "question_family":"GENERIC", "choices":{ "A":"Use information gain to compute the importance of each variable and select top 100 most important variables as the inputs to a supervised learning algorithm in the modelling stage.", "B":"Use ROC value to compute the importance of each variable and take select top 100 most important variables as the inputs to the supervised learning algorithm in the modelling stage.", "C":"Use a binary logistic regression algorithm and forward selection method to build a model. Use the variables that entered the model as the inputs to a supervised learning algorithm in the modelling stage.", "D":"Use a binary logistic regression algorithm and $L_1$ regularization to build the model. Use the variables that entered the model as the inputs to a supervised learning algorithm in the modelling stage.", "E":"Do not perform any feature selection and use a Gradient Boosting Machines algorithm to build the model. " }, "correct_answer":"E", "chatgpt_question_part_one":"Bob is building a fraud model to dete...", "chatgpt_question_part_two":"Bob is building a fraud model to dete...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice is working on a regression problem in order to predict a target variable in terms of a set of input variables. He has calculated the standard deviation of all variables:\n\n| Variable name | Standard Deviation |\n| :--- | :---: |\n| Target variable | 367,362 |\n| Input-1 | 0.93 |\n| Input-2 | 0.77 |\n| Input-3 | 918.44 |\n| Input-4 | 41,420.51 |\n| Input-5 | 0.54 |\n| Input-6 | 0.09 |\n| Input-7 | 0.0 |\n| Input-8 | 0.0 |\n| Input-9 | 0.0 |\n| Input-10 | 0.0 |\n| Input-11 | 442.58 |\n| Input-12 | 29.37 |\n| Input-13 | 401.68 |\n| Input-14 | 53.51 |\nWhich of the following statements is true based on this information?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"EDA", "topic":"UNIVARIATE_NUMERICAL", "question_family":"GENERIC", "choices":{ "A":"At least 10 input variables has no correlation with the target variable", "B":"At least 8 input variables has no correlation with the target variable", "C":"At least 6 input variables has no correlation with the target variable", "D":"At least 4 input variables has no correlation with the target variable", "E":"None of the variables has zero correlation with the target variable" }, "correct_answer":"D", "chatgpt_question_part_one":"Alice is working on a regression prob...", "chatgpt_question_part_two":"Alice is working on a regression prob...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice is examining the correlations between total sales and various independent variables such as current day's weather, day of the week, and daily marketing expenditures. She has obtained the following insights from her exploratory data analysis:\n\n| Statistics | Value |\n| :--- | :---: |\n| Number of variables | 8 |\n| Number of observations | 293 |\n| Number of missing cells | 66 |\n| Number of duplicate rows | 10 |\n\n\n| Variable Type | Number of Variables |\n| :--- | :--- |\n| Numerical | 6 |\n| Categorical | 2 |\n\nAlice is conducting her analysis using Python and intends to perform linear regression following her exploratory data analysis (EDA).\nWhich of the following statements is accurate?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"EDA", "topic":"MULTIVARIATE_NUMERICAL", "question_family":"GENERIC", "choices":{ "A":"The data might be ready to use directly for linear regression.", "B":"Once duplicate rows are removed, the data is prepared for regression.", "C":"Missing values need to be imputed or rows containing missing values should be eliminated.", "D":"Both duplicate rows should be removed and missing values should either be imputed or the rows with missing values should be dropped.", "E":"None of the above" }, "correct_answer":"D", "chatgpt_question_part_one":"Alice is examining the correlations b...", "chatgpt_question_part_two":"Alice is examining the correlations b...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Consider the \"census\" table shown in the input data grid. What is the maximum value of the average age per country, rounded to two decimal places?\n\n(You can use the SQL Editor provided to solve the question). ", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SQL", "topic":"AGGREGATIONS", "question_family":"SQL", "choices":{ }, "correct_answer":"49.38", "chatgpt_question_part_one":"Consider the \"census\" table shown in ...", "chatgpt_question_part_two":"Consider the \"census\" table shown in ...", "answer_type":"RETURN_VALUE", "is_answer_correct":false }, { "question":"Consider the \"census\" table shown in the input data grid. Its columns and their measurement types are given below:\n1. RECORD_ID: Numerical\n2. AGE: Numerical\n3. WORKCLASS: Categorical\n4. FNLWGHT: Numerical\n5. EDUCATION: Categorical\n6. EDUCATION_NUM: Numerical\n7. MARITAL_STATUS: Categorical\n8. OCCUPATION: Categorical\n9. RELATIONSHIP: Categorical\n10. RACE: Categorical\n11. SEX: Categorical\n12. CAPITAL_GAIN: Numerical\n13. CAPITAL_LOSS: Numerical\n14. HOURS_PER_WEEK: Numerical\n15. COUNTRY: Categorical\n16. PROXY: Categorical\n17. TARGET: Numerical\n\nWhat is the average capital gain for records whose sex is equal to \"Female\" and native country is either Cuba or Portugal? Give the answer rounded to a single decimal place.\n\n(You can use the SQL Editor provided to solve the question). ", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SQL", "topic":"AGGREGATIONS", "question_family":"SQL", "choices":{ }, "correct_answer":"142.6", "chatgpt_question_part_one":"Consider the \"census\" table shown in ...", "chatgpt_question_part_two":"Consider the \"census\" table shown in ...", "answer_type":"RETURN_VALUE", "is_answer_correct":false }, { "question":"Alice has developed multiple binary classification models using various supervised learning algorithms. She has formatted each model as an SQL statement for scoring data directly within the database to enhance performance. The final table to be scored includes 30 million records. Which of these algorithms results in the model with the highest runtime cost?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SQL", "topic":"JOINS", "question_family":"GENERIC", "choices":{ "A":"Binary Logistic Regression", "B":"Gradient Boosting Machines", "C":"Feedforward Neural Networks", "D":"Classification and Regression Trees", "E":"k-nearest(k-NN) Neighbors" }, "correct_answer":"E", "chatgpt_question_part_one":"Alice has developed multiple binary c...", "chatgpt_question_part_two":"Alice has developed multiple binary c...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which type of database would you primarily use to store and manage data for an online bookstore's inventory with a fixed set of attributes?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DATABASES", "topic":"DATABASE_FEATURES", "question_family":"GENERIC", "choices":{ "A":"Document - based NoSQL", "B":"SQL", "C":"Key-Value Store", "D":"Graph database", "E":"Time-series database" }, "correct_answer":"B", "chatgpt_question_part_one":"Which type of database would you prim...", "chatgpt_question_part_two":"Which type of database would you prim...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice is building a binary classification model on a development sample. She allocates 70% of the data for training/validation and 30% of the data for testing. Which one of the following codes is correct for this sampling process? Note: X includes all input variables, y refers to the target variable.", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"ML_LIBRARIES", "topic":"SCIKIT_LEARN", "question_family":"GENERIC", "choices":{ "A":"$\\texttt{from sklearn.model\\_selection import train\\_test\\_split}$\n$\\texttt{X\\_train, X\\_test, y\\_train, y\\_test = train\\_test\\_split(X, y, test\\_size=0.30, random\\_state=40)}$", "B":"$\\texttt{from sklearn.model\\_selection import train\\_test\\_split}$\n$\\texttt{y\\_train, y\\_test = train\\_test\\_split(X, y, test\\_size=0.30, random\\_state=40)}$", "C":"$\\texttt{from sklearn.model\\_selection import train\\_test\\_split}$\n$\\texttt{X\\_train, X\\_test = train\\_test\\_split(X, y, test\\_size=0.30, random\\_state=40)}$", "D":"$\\texttt{from sklearn.model\\_selection import train\\_test\\_split}$\n$\\texttt{X\\_train, y\\_train = train\\_test\\_split(X, y, test\\_size=0.30, random\\_state=40)}$", "E":"$\\texttt{from sklearn.model\\_selection import train\\_test\\_split}$\n$\\texttt{X\\_train, X\\_test, y\\_train, y\\_test = train\\_test\\_split(X, y, test\\_size=0.70, random\\_state=40)}$" }, "correct_answer":"A", "chatgpt_question_part_one":"Alice is building a binary classifica...", "chatgpt_question_part_two":"Alice is building a binary classifica...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Bob is employing the Ridge Regression technique to solve a regression issue. He has implemented the following Python code using the scikit-learn library:\n\n$\\texttt{cv = RepeatedKFold(n\\_splits=5, n\\_repeats=3, random\\_state=1)}$\n$\\texttt{model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg\\_mean\\_absolute\\_error')}$\n$\\texttt{model.fit(X, y)}$\n$\\texttt{print('alpha: \\%f' \\% model.alpha\\_)}$\n\nAs the result of the print command, he has obtained \"alpha\" as 0.99. Which one/ones of the following statements regarding the above code snippet is true? (Note : X corresponds to input variables, and y represents the target variable)\n\nI- The test dataset is -randomly sampled- 10% of the whole dataset\nII- The objective of the code is to find the best \"alpha\" parameter by hyper-parameter optimization over the training set\nIII- The objective of the code is to find the best \"alpha\" parameter by hyper-parameter optimization over the validation set by using cross validation\nIV- The code uses 5-fold cross validation, by repeating it 3 times\nV- The performance metric used for the hyper-optimization is R-Squared", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"ML_LIBRARIES", "topic":"SCIKIT_LEARN", "question_family":"GENERIC", "choices":{ "A":"III and IV", "B":"I and III", "C":"III, IV and V", "D":"II and III", "E":"II, III and V" }, "correct_answer":"A", "chatgpt_question_part_one":"Bob is employing the Ridge Regression...", "chatgpt_question_part_two":"Bob is employing the Ridge Regression...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"An outlier is a data point that is \"abnormally distant\" to the other values sampled from a distribution. According to Tukey's outlier definition, The upper boundary of non-outliers ends at the 75th percentile plus $1.5\\times IQR$. Similarly, the bottom boundary for the non-outliers ends at the 25th percentile minus $1.5\\times IQR$. Any data point outside this interval is an outlier. Put in another way, if we define the first and third quartiles as Q1 and Q3, respectively, then an outlier is any data point outside the interval $[Q1-1.5\\times IQR, Q3+1.5\\times IQR]$. Note that the inter-quartile-range($IQR$) is defined as ($Q3-Q1$). \n\nConsider the census table shown in the data grid. How many outliers can be found in HOURS_PER_WEEK column?\n\n(You can use the Python Editor provided to solve the question. The sample data is already loaded into a dataframe named \"df\". You can also use the abbreviation \"pd\" for the Pandas library). ", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"PYTHON_DATA_SCIENCE", "topic":"PANDAS", "question_family":"PYTHON", "choices":{ }, "correct_answer":"9008", "chatgpt_question_part_one":"An outlier is a data point that is \"a...", "chatgpt_question_part_two":"An outlier is a data point that is \"a...", "answer_type":"RETURN_VALUE", "is_answer_correct":false }, { "question":"\"customer_transaction\" dataset contains some information about the orders made by a customer:\n1. ORDER_ID (Primary key): Unique order identifier for each order.\n2. CUSTOMER_ID: Customer identifier.\n3. ORDER_PURCHASE_TIMESTAMP: Time when the order is processed.\n4. ORDER_DELIVERED_TIMESTAMP: Time when the order is delivered.\n5. QUANTITY: Number of distinct products in each order.\n\nWhat is the CUSTOMER_ID of the customer who hast the highest total of QUANTITY column in July 2018? \n\n(You can use the Python Editor provided to solve the question. The sample data is already loaded into a dataframe named \"df\". You can also use the abbreviation \"pd\" for the Pandas library).", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"PYTHON_DATA_SCIENCE", "topic":"PANDAS", "question_family":"PYTHON", "choices":{ }, "correct_answer":"7459", "chatgpt_question_part_one":"\"customer_transaction\" dataset contai...", "chatgpt_question_part_two":"\"customer_transaction\" dataset contai...", "answer_type":"RETURN_VALUE", "is_answer_correct":false }, { "question":"The forward propagation equation for the RELU activation function is given as\n${\\mathbf y} = \\max({\\mathbf x}, 0)$, where ${\\mathbf x}$ is the input tensor, and ${\\mathbf y}$ is the activation tensor. Which of the following functions expressed in terms of numpy arrays corresponds to backpropagation equations? (Note: ${\\mathbf d}$ is the derivative tensor, and ${\\mathbf x}$ is the input tensor)", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"PYTHON_DATA_SCIENCE", "topic":"NUMPY", "question_family":"GENERIC", "choices":{ "A":"$\\texttt{import numpy as np}$\n$\\texttt{def backprop\\_relu(d, x):}$\n$\\texttt{\\hskip{0.22in} return d * (x > 0)}$\n", "B":"$\\texttt{import numpy as np}$\n$\\texttt{def backprop\\_relu(d, x):}$\n$\\texttt{\\hskip{0.22in} return np.maximum(d, 0) * x}$\n", "C":"$\\texttt{import numpy as np}$\n$\\texttt{def backprop\\_relu(d, x):}$\n$\\texttt{\\hskip{0.22in} np.maximum(x, 0) * d}$\n", "D":"$\\texttt{import numpy as np}$\n$\\texttt{def backprop\\_relu(d, x):}$\n$\\texttt{\\hskip{0.22in} return x * (d > 0)}$\n", "E":"$\\texttt{import numpy as np}$\n$\\texttt{def backprop\\_relu(d, x):}$\n$\\texttt{\\hskip{0.22in} np.max(x, 0) * d}$\n" }, "correct_answer":"A", "chatgpt_question_part_one":"The forward propagation equation for ...", "chatgpt_question_part_two":"The forward propagation equation for ...", "answer_type":"RETURN_VALUE", "is_answer_correct":false }, { "question":"Which one of the following statements is not true about Generative Adversarial Networks?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DEEP_LEARNING", "topic":"APPLICATIONS", "question_family":"GENERIC", "choices":{ "A":"It is an unsupervised learning algorithm", "B":"Two networks are trained concurrently for GAN training", "C":"No labelling of data is needed", "D":"It is applicable only to computer vision problems", "E":"It can be used for text-to-image synthesis" }, "correct_answer":"D", "chatgpt_question_part_one":"Which one of the following statements...", "chatgpt_question_part_two":"Which one of the following statements...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Which of the following is not an optimization technique for finding the best hyper-parameters for training a deep learning network?\n", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"DEEP_LEARNING", "topic":"PRACTICE", "question_family":"GENERIC", "choices":{ "A":"Grid Search", "B":"Bayesian Optimization", "C":"Linear Programming", "D":"Random Search", "E":"Evolutionary Optimization" }, "correct_answer":"C", "chatgpt_question_part_one":"Which of the following is not an opti...", "chatgpt_question_part_two":"Which of the following is not an opti...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"A social media company wants to understand the impact of some design changes on click-through rates in their application. They will conduct A/B test experiments by measuring the daily conversion rates for control and test groups for a time period. Which of the following parameters should be specified to decide on the minimum number of daysfor the experiment, i.e. sample size for the experiment?\n\n1. Mean of conversion rate in control group\n2. Mean of conversion rate in test group\n3. Standard deviation of conversion in control group\n4. Standard deviation of conversion in test group\n5. Type-I error(alpha)\n6. Power of the test(beta)\n", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SAMPLING", "topic":"DESIGN_OF_EXPERIMENTS", "question_family":"GENERIC", "choices":{ "A":"1, 2, 3, and 4", "B":"1 and 2", "C":"1,2,5, and 6", "D":"3, 4, 5, and 6", "E":"All of the parameters are needed" }, "correct_answer":"E", "chatgpt_question_part_one":"A social media company wants to under...", "chatgpt_question_part_two":"A social media company wants to under...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Alice is building a binary classification model where the number of positive cases is very small, and the dataset is heavily unbalanced with respect to the distribution of target variable. Which of the following methods should he employ to get the most accurate model?", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"SAMPLING", "topic":"STATISTICAL_SAMPLING", "question_family":"GENERIC", "choices":{ "A":"Undersampling negative cases", "B":"Oversampling positive cases with resampling", "C":"Adjustment class weights", "D":"Oversampling positive cases using SMOTE algorithm", "E":"Case weighting" }, "correct_answer":"D", "chatgpt_question_part_one":"Alice is building a binary classifica...", "chatgpt_question_part_two":"Alice is building a binary classifica...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"Consider the following time series:\n$y(t) = y(t \u2212 1) + 2 + \\epsilon(t)$\n$\\epsilon(t) \\sim \\mathcal{N}(0, 12)$ and i.i.d.\n$y(0) \\sim \\mathcal{N}(10, 64)$ and uncorrelated with $\\epsilon(t), t \\geq 0$\n\nWhat is the approximate value of $P(y(3) > 36)$, where $P$ stands for probability?\n(Note the $z$-values $z_{0.25} \\sim 0.599$, $z_{0.5} \\sim 0.692$, $z_{1.0} \\sim 0.84$, $z_{2.0} \\sim 0.977$, $z_{2.5} \\sim 0.994$)", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"TIME_SERIES_ANALYSIS", "topic":"MODELLING", "question_family":"GENERIC", "choices":{ "A":"0.95", "B":"0.4", "C":"0.16", "D":"0.02", "E":"0.01" }, "correct_answer":"D", "chatgpt_question_part_one":"Consider the following time series:...", "chatgpt_question_part_two":"Consider the following time series:...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false }, { "question":"A firm is providing bakery products (bread, cupcakes etc) to its customers. Alice is trying to predict the daily demand for bread. She has daily sales numbers for the last year. She plans to use AR algorithm for modelling. Which of the following performance metrics is more appropriate for this project? ", "role":"DATA_SCIENTIST", "difficulty":"BEGINNER", "element":"TIME_SERIES_ANALYSIS", "topic":"MODELLING", "question_family":"GENERIC", "choices":{ "A":"Training data mean absolute percentage error", "B":"Test data mean absolute percentage error", "C":"Training data Akaike Information Criterion", "D":"Test data Akaike Information Criterion", "E":"Test data R-Square value " }, "correct_answer":"E", "chatgpt_question_part_one":"A firm is providing bakery products (...", "chatgpt_question_part_two":"A firm is providing bakery products (...", "answer_type":"MULTIPLE_CHOICE", "is_answer_correct":false } ] }