# Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

## CV

This is a page not in th emain menu

## White Noise Time Series

Published:

White noise series has the following properties:

• Mean equals to zero
• Standard deviation is constant
• Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

## Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

Published:

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

## Hypothesis Testing with the Kruskal-Wallis Test

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

## Gradient Boosting Algorithm for Classification Problem

Published:

In the previous post I mentioned about how Gradient Boosting algorithm works for a regression problem.

## Local Interpretable Model-Agnostic Explanations (LIME)

Published:

LIME is a python library used to explain predictions from any machine learning classifier.

## Parzen Window Density Estimation - Kernel and Bandwidth

Published:

Imagine that you have some data `x1, x2, x3, ..., xn` originating from an unknown continuous distribution `f`. You’d like to estimate `f`.

## Tree Parzen Estimator in Bayesian Optimization for Hyperparameter Tuning

Published:

One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.

## Gradient Boosting Algorithm for Regression Problem

Published:

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

## Permanent and Temporary External Table in BigQuery

Published:

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

## Machine Learning on BigQuery ML

Published:

BigQuery Machine Learning (BQML) is a new feature in BigQuery where data analysts can train, evaluate, and predict with machine learning models with minimal coding.

## ML on Kubeflow - Part 3 (End): Model Serving

Published:

On the previous post we look at how to train an ML model on Kubeflow cluster. Having the trained model, it’s time to serve requests.

## ML on Kubeflow - Part 2: Training on the Cluster

Published:

You can find the Tensforflow code in the `model.py` file in the examples repository. After training is complete, the model will be stored to a GCS bucket.

## ML on Kubeflow - Part 1: Creating a Kubeflow Cluster

Published:

Distributing machine learning (ML) workloads across multiple worker nodes is critical when the datasets grow larger and the ML models become more complex over time. Unfortunately, distributing ML workloads might add complexity to the DevOps part of the ML system as we’ll need to deal with lots of computing nodes.

## Optimizing BigQuery Queries - Part 3 (End)

Published:

This is the last part of series on how to optimize BigQuery queries.

## Optimizing BigQuery Queries - Part 2

Published:

This is the 2nd part of series on how to optimize BigQuery queries.

## Optimizing BigQuery Queries - Part 1

Published:

Optimizing query is usually performed to reduce query execution times or cost.

## Building a Streaming Data Pipeline with Cloud Pub/Sub & Cloud Dataflow

Published:

In this post, we’re going to look at how to build a streaming data pipeline with Cloud Pub/Sub and Cloud Dataflow.

## Creating Percentile Table with a Specified Increment in MongoDB

Published:

Here’s the scenario.

## Creating Workflows in Google Cloud Composer

Published:

Workflow can simply be defined as a sequence of tasks to be performed to accomplish a goal.

## Running Spark Jobs on Google Cloud Dataproc

Published:

In this post, we’re going to look at how to migrate Spark jobs to Google Cloud Dataproc.

Published:

Google Cloud SQL is a fully-managed database service that makes it easy to set-up, maintain, manage and administer your relational MySQL, PostgreSQL, and SQL Server databases on Google Cloud Platform.

Published:

• BigQuery is Google’s fully managed, NoOps, low cost analytics database.
• With BigQuery, the users can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator.
• BigQuery uses SQL and can take advantage of the pay-as-you-go model.
• BigQuery allows you to focus on analyzing data to find meaningful insights.

## Using SparkSQL in Metabase

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

## Setting Up Database in Hive Environment

Published:

In this post, we’re going to look at how to set up a database along with the tables in Hive.

## Kalman Filter for Dynamic State & Multiple Measurements

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

## Kalman Filter for Static State & Single Measurement

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

## Sample Size is Matter for Mean Difference Testing

Published:

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

## Data Dredging (p-hacking)

Published:

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

## Moment Generating Function

Published:

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as `E[X]`, `E[X^2]`, `E[X^3]`, and so forth.

## The Investigation of Skewness & Kurtosis in Spark (Scala)

Published:

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

## One-sample Z-test with p-value Approach

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

## Maximum Likelihood Estimation - Normal Distribution

Published:

In the previous post, I mentioned about the basic concept of maximum likelihood estimation (MLE). Please visit the post if you need a refresher.

## Maximum Likelihood Estimation

Published:

If in the probability context we state that `P(x1, x2, ..., xn | params)` means the probability of getting a set of observations `x1`, `x2`, …, and `xn` given the distribution parameters, then in the likelihood context we get the following.

## Kullback-Leibler Divergence for Empirical Probability Distributions in Spark

Published:

In the previous post, I mentioned about the basic concept of two-sample Kolmogorov-Smirnov (KS) test and its implementation in Spark (Scala API).

## Accessing Resources with Extra Classpath as spark-submit Config

Published:

A few days ago I came across a case where a module needs access to the `resources` directory.

## Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

Published:

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

## Running Local Mode Spark with Logging via spark-submit

Published:

Below is a script for running spark via `spark-submit` (local mode) that utilizes logging.

## How to Solve This Extreme Algebra?

Published:

Here we’re gonna look at how to solve the following algebra problem.

## Ramanujan’s Nested Cube Roots Proof

Published:

The theorem of nested cube roots (Ramanujan) states the following.

## Vieta Triple Roots: 2019 American Invitational Mathematics Examination (AIME) I Problem 10

Published:

Let’s take a look at the problem statement.

## Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using `distinct()` and `dropDuplicates()`. The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

## The Legendary Question Six IMO 1988

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

## The Most Beautiful Equation

Published:

Euler’s formula is stated as the following.

## Euler’s Pi for the Sum of Inverse Squares Proof

Published:

Given an infinite series of inverse squares of the natural numbers, what is the sum?

## Vieta’s Infinite Products Representation for Pi: Show Me the Proof!

Published:

Last time I wrote about the infinite products representation for pi that is regarded as the Wallis’ product for pi.

## Proof of Wallis Product for Pi with Euler’s Infinite Product for Sine

Published:

In the previous post I mentioned about how to demonstrate the Wallis product for pi by starting from the powered sine integration.

## Wallis Product for Pi with Integration: Want the Proof!

Published:

The Wallis’ infinite product for pi states the following.

## Apache Griffin for Data Validation: Yay & Nay

Published:

In the previous post, I mentioned that there are several observed points regarding Griffin during my exploration.

## Data Quality with Apache Griffin Overview

Published:

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

## Standard Error of Mean Estimate Derivation

Published:

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

## Monotonic Binning for Weight of Evidence (WoE) Encoding

Published:

I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.

## Multicollinearity - Large Estimating Betas Variance (Part 2)

Published:

In the previous post, I mentioned about how collinearity affects the computation of the beta estimators.

## Multicollinearity - A Bit of Maths Behind Why It is a Problem (Part 1)

Published:

In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.

## Weight of Evidence & Information Value for Attributes Relevance Analysis with PySpark

Published:

Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.

## Tackling Covariate Shift in ML Using ML

Published:

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

## Density Ratio Estimation with Probabilistic Classification for Handling Covariate Shift

Published:

In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?

## Covariate Shift Detection with Machine Learning Based Approach

Published:

Covariate shift happens when the distribution of train data differs with the distribution of test data. Take a look at the following probability equation.

## Adding Strictly Increasing ID to Spark Dataframes

Published:

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

## Incremental Query for Large Streaming Data Operation

Published:

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

## Streaming GroupBy for Large Datasets with Pandas

Published:

I came across an article about how to perform `groupBy` operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the `groupBy` operation is applied to each chunk. This approach is implemented with pandas.

## Distributed LIME with PySpark UDF vs MMLSpark

Published:

In the previous post, I wrote about how to make LIME run in pseudo-distributed mode with PySpark UDF.

## Pseudo-distributed LIME via PySpark UDF

Published:

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

## Multiple Workers in a Single Node Configuration for Spark Standalone Cluster

Published:

A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.

## The Three-Headed Hound of the Underworld (Kerberos)

Published:

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

## Submitting and Polling Spark Job Status with Apache Livy

Published:

Livy offers a REST interface that is used to interact with Spark cluster. It provides two general approaches for job submission and monitoring.

## User Sessions Addition Error When Submitting Spark Job to Apache Livy via Local Mode

Published:

A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using `spark-submit` directly).

## Handling Dot Character in Spark Dataframe Column Name (Partial Solution)

Published:

A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.

## Creating Nested Columns in PySpark Dataframe

Published:

A nested column is basically just a column with one or more sub-columns. Take a look at the following example.

## Drawing ROC Curve Without Applying the Formulas

Published:

One of the evaluation metrics that is often optimised is ROC-AUC. In this post, we’re going to discuss how an ROC curve is created.

## Finding the Best Threshold that Maximizes Accuracy from ROC & PR Curve

Published:

The problem is simple. How to find the best threshold from an ROC and PR curve that maximise a certain binary classification metric?

## Obfuscation Modes in PyArmor

Published:

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

## Airflow Feature Improvement: Spark Driver Status Polling Support for YARN, Mesos & K8S

Published:

According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.

## Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

Published:

I was thinking of the following case.

## Obfuscating Python Scripts with PyArmor

Published:

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

## Airflow Executor & Friends: How Actually Does the Executor Run the Task?

Published:

A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via `spark-submit` to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.

## Setting Up & Debugging Airflow On Local Machine

Published:

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

## Making H2O Cluster Information Shows Plausible Results for Total & Allowed Cores

Published:

H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.

## Spark Structured Streaming with Parquet Stream Source & Multiple Stream Queries

Published:

Whenever we call `dataframe.writeStream.start()` in structured streaming, Spark creates a new stream that reads from a data source (specified by `dataframe.readStream`). The data passed through the stream is then processed (if needed) and sinked to a certain location.

## Spark History Server: Setting Up & How It Works

Published:

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

Published:

I used kafka-python v.1.4.7 as the client.

Published:

In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.

## CAP Theorem

Published:

“Consistency, Availability, and Partition Tolerance” - choose two.

## Apache Kafka: Consumer Awareness of New Topic Partitions

Published:

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

## Apache Spark [PART 33]: Making mapPartitions Accepts Partition Functions with More Than One Arguments

Published:

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

## Apache Spark [PART 32]: Structured Streaming Checkpointing with Parquet Stream Source

Published:

I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.

## Apache Spark [PART 31]: F.col() Behavior With Non-Existing Referred Columns on Dataframe Operations

Published:

I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0.

## Apache Spark [PART 30]: Machine Learning Model Re-train Mechanism via YARN Cluster Mode

Published:

Deploying a machine learning (ML) model to a production system is not the end of the whole AI engineering process. The deployed model might be obsolete over a period of time.

## Apache Spark [PART 29]: Multiple Extra Java Options for Spark Submit Config Parameter

Published:

There’s a case where we need to pass multiple extra java options as one of configurations to spark driver and executors. Here’s an example:

## Apache Spark [PART 28]: Accessing a Kerberized HDFS Cluster

Published:

A Spark application deployed to a cluster might need to access an HDFS cluster. To establish a secure connection, one may want to utilize a network authentication protocol, such as Kerberos. Using Kerberos might add a little bit complexity to the connecting process. In this article I’m going to show you one of the cases encountered by my team and I recently.

## Apache Spark [PART 27]: Crosstab Does Not Yield the Same Result for Different Column Data Types

Published:

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue.

## Apache Spark [PART 26]: Failure When Overwriting A Parquet File Might Result in Data Loss

Published:

There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.

## Final Approach of Refactoring ORMs & Repositories for a Better Attributes Management

Published:

I’ve already written three posts (including this one) related to refactoring ORM and repository modules for the sake of a better attributes management.

## Updated Approach for a Less-Hassle ORM Attributes Management

Published:

In the previous article I wrote about how I refactored the attributes management approach for Object Relational Mapper (ORM) use case. You can find the article here.

## Setting Up & Connecting to PostgreSQL (from Host) via Docker

Published:

A brief note on how to set up PostgreSQL via Docker and create tables in a database.

## Improving Attributes Management for Object Relational Mapper (ORM) Use Case

Published:

Let’s take a simple data management scenario.

## Apache Spark [PART 25]: Resolving Attributes Data Inconsistency with Union By Name

Published:

If you read my previous article titled Apache Spark [PART 21]: Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

## Apache Spark [PART 24]: A Little Bit Complicated Cumulative Sum

Published:

Suppose you have a dataframe consisting of several columns, such as the followings:

• A: group indicator -> call it A value
• B: has two different categories (b0 and b1) -> call it B value
• C: let’s assume it contains integers -> call it C value
• D: date and time -> call it D value
• E: timestamp -> call it E value

Published:

## Installing and Executing Rust Code via Docker

Published:

The best way to try new technologies without having clutter? Docker.

## Apache Spark [PART 23]: Sigma Operation in Spark’s Dataframe

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

## [MATHS] The Monty Hall Problem Using Conditional Probability

Published:

The Monty Hall Problem can be stated as the following:

## Apache Spark [PART 22]: Modifying the Code Profiler to Use Custom sort_stats Sorters

Published:

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

## [MATHS] The Infinite Hotel Paradox by David Hilbert

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

## Apache Spark [PART 21]: Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

Published:

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

## [Maths] Infinitely Many Prime Numbers by Euclid

Published:

To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.

## [Maths] Riemann Hypothesis and One Question in My Mind

Published:

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

## Mastering Spark [PART 20]: Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

Published:

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

## Mastering Spark [PART 19]: The Number of Partitions After Unioning Two or More Dataframes

Published:

An intriguing question popped into my mind. After unioning several dataframes, how many partitions the resulting dataframe will have?

## Mastering Spark [PART 18]: Ensuring Dataframe Partitions After Equi-joining (Inner)

Published:

The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:

## Mastering Spark [PART 17]: Repartitioning Input Data Stream

Published:

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

## Mastering Spark [PART 16]: How to Check the Size of a Dataframe?

Published:

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

## Mastering Spark [PART 15]: Optimizing Join on Skewed Dataframes

Published:

Joining two dataframes might not be an easy task when one of them has skewed data. Skewed data simply means that few element appears a lot more than the others.

## Mastering Spark [PART 14]: Effects of Shuffling on RDDs and Dataframes Partitioning

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

## Mastering Spark [PART 13]: Speeding Up Window Function by Repartitioning the Dataframe First

Published:

The concept of window function in Spark is pretty interesting. One of its primary usage is calculating cumulative values. Here’s a simple example.

## Mastering Spark [PART 12]: Speeding Up Parquet Write

Published:

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

## Mastering Spark [PART 11]: Too Lazy to Process the Whole Dataframe

Published:

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

## Ensembled Learning Using Voting Classifier from Scikit-Learn vs The Current Approach

Published:

I’ve been trying to speed up the ensembled model’s prediction performance. I’ve actually mentioned about this (current approach) in my previous post.

## Mastering Spark [PART 10]: Lightning Fast Pandas UDF

Published:

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

## Mastering Spark [PART 09]: An Optimized Approach for Multiple Dataframe Columns Operation

Published:

I came across an interesting problem when playing with ensembled learning. For those who don’t know about ensembled learning, it’s simply a machine learning approach that combines several weak classifiers to derive the final result. One of the simplest examples is random forest algorithm. In random forest, each tree learns different parts (features and data points) of the dataset. When predicting a new data point, each tree gives a vote for its class of choice. The final class is the one who is voted by the majority of trees.

## Mastering Spark [PART 08]: A Brief Report on GroupBy Operation After Dataframe Repartitioning

Published:

A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.

## Implementing Balanced Random Forest via imblearn

Published:

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

## Mastering Spark [PART 07]: Custom Partitioner for Repartitioning in Spark

Published:

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

## Mastering Spark [PART 06]: List of Spark Machine Learning Models & Non-overwritten Prediction Columns

Published:

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

## Mastering Spark [PART 05]: A Little Experiment on Dataframe Repartitioning

Published:

Spark has two types of partitioning. The first one is coalesce, while the second one is repartition.

## Mastering Spark [PART 04]: Accumulator

Published:

A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:

## Questions on Balanced Random Forest & Its Implementation

Published:

I came across a research paper related to balanced random forest for imbalanced data. For the sake of clarity, the following is the algorithm of BRF taken from the paper:

## Mastering Spark [PART 03]: RDD to DF Gave a StopIteration Exception

Published:

I made a silly mistake a few days ago - well, yes.

## Mastering Kafka [PART 01]: WTF is Kafka? A High-level Overview

Published:

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

Published:

Published:

Published:

## [ Back to Basics - 01 ] Synthetic Minority Over-sampling Technique (SMOTE)

Published:

First article in 2019.

## When GOD Granted That Opportunity: Part 1

Published:

On November 15th, 2018, I promised myself I would write down my journey of accomplishing one of my dreams. This post is the realization of that word.

## Python-like to Algorithm Specification

Published:

An interesting paper: http://www.phontron.com/paper/oda15ase.pdf

Published:

Published:

Published:

Published:

Published:

Published:

Published:

Published:

Published:

Primary purpose:

Published:

Primary purpose:

Published:

Primary purpose:

Published:

Primary purpose:

Published:

## Examples of Buffer Overflow Attack

Published:

In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.

Published:

## What is Buffer Overflow?

Published:

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

## Book Summary: How to Win Friends and Influence People — PART 01

Published:

In this article I’ll write about a summary of a book which is quite interesting for me. The title of the book is How to Win Friends and Influence People, written by Dale Carnegie.

The book consists of several main parts in which there are some key principles that become the building blocks of the main part. So let’s start with part 1.

## Smart But Dumb

Published:

This article is a brief summary of an article I found on Medium. The title of the original article is How Come You Know Everything But Do Nothing, written by Anna Asaieva.

## The New One Minute Manager

Published:

I read an interesting management book today titled The One Minute Manager, written by Ken Blanchard, PhD and Spencer Johnson, MD. The book is quiet short and the content is straightforward. It really provides an easily read story.

## Journey from Novice to Expert

Published:

I read an interesting book titled Pragmatic Thinking and Learning: Refactor Your Wetware by Andy Hunt.

## Do Not Judge People For Being Lazy

Published:

Taking a break is sometimes considered as lazy behaviour by our today’s society.

## Should You Trust Your First Impression?

Published:

Should you trust your first impression? In my opinion, it depends.

## Johari Window

Published:

To explain it simply, Johari Window is a diagram showing relationships between a person and others.

## Bad Luck Turns Into Good Luck

Published:

We, as humans, sometimes encounter events that lead us to aware that everything happens for certain rationales. Something that happens at the exact time might later change someone’s life completely.

## Solomon Paradox and Wiser Human Beings

Published:

King Solomon, the third leader of the Jewish Kingdom, is considered the nonsuch of wisdom. People travelled a long way just to ask for his exhortation. However, it’s known that his personal life and character are not in line with what his tact looks like to other people. This somewhat becomes a paradox.