Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Maximum Likelihood Estimation

3 minute read

Published:

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

Data Quality with Apache Griffin Overview

4 minute read

Published:

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

Standard Error of Mean Estimate Derivation

4 minute read

Published:

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

Tackling Covariate Shift in ML Using ML

2 minute read

Published:

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published:

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

Incremental Query for Large Streaming Data Operation

4 minute read

Published:

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published:

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published:

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published:

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

Obfuscation Modes in PyArmor

3 minute read

Published:

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published:

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published:

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

Spark History Server: Setting Up & How It Works

2 minute read

Published:

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

CAP Theorem

3 minute read

Published:

“Consistency, Availability, and Partition Tolerance” - choose two.

Apache Spark [PART 28]: Accessing a Kerberized HDFS Cluster

1 minute read

Published:

A Spark application deployed to a cluster might need to access an HDFS cluster. To establish a secure connection, one may want to utilize a network authentication protocol, such as Kerberos. Using Kerberos might add a little bit complexity to the connecting process. In this article I’m going to show you one of the cases encountered by my team and I recently.

Apache Spark [PART 25]: Resolving Attributes Data Inconsistency with Union By Name

1 minute read

Published:

If you read my previous article titled Apache Spark [PART 21]: Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

Apache Spark [PART 24]: A Little Bit Complicated Cumulative Sum

4 minute read

Published:

Suppose you have a dataframe consisting of several columns, such as the followings:

  • A: group indicator -> call it A value
  • B: has two different categories (b0 and b1) -> call it B value
  • C: let’s assume it contains integers -> call it C value
  • D: date and time -> call it D value
  • E: timestamp -> call it E value

Apache Cassandra: Begins with Docker

2 minute read

Published:

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

[MATHS] The Infinite Hotel Paradox by David Hilbert

2 minute read

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

[Maths] Infinitely Many Prime Numbers by Euclid

1 minute read

Published:

To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.

[Maths] Riemann Hypothesis and One Question in My Mind

7 minute read

Published:

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

Mastering Spark [PART 17]: Repartitioning Input Data Stream

3 minute read

Published:

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

Mastering Spark [PART 16]: How to Check the Size of a Dataframe?

1 minute read

Published:

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

Mastering Spark [PART 14]: Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Mastering Spark [PART 11]: Too Lazy to Process the Whole Dataframe

2 minute read

Published:

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

Mastering Spark [PART 10]: Lightning Fast Pandas UDF

5 minute read

Published:

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

Mastering Spark [PART 09]: An Optimized Approach for Multiple Dataframe Columns Operation

8 minute read

Published:

I came across an interesting problem when playing with ensembled learning. For those who don’t know about ensembled learning, it’s simply a machine learning approach that combines several weak classifiers to derive the final result. One of the simplest examples is random forest algorithm. In random forest, each tree learns different parts (features and data points) of the dataset. When predicting a new data point, each tree gives a vote for its class of choice. The final class is the one who is voted by the majority of trees.

Implementing Balanced Random Forest via imblearn

3 minute read

Published:

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

Mastering Spark [PART 04]: Accumulator

1 minute read

Published:

A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:

Mastering Kafka [PART 01]: WTF is Kafka? A High-level Overview

7 minute read

Published:

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

When GOD Granted That Opportunity: Part 1

8 minute read

Published:

On November 15th, 2018, I promised myself I would write down my journey of accomplishing one of my dreams. This post is the realization of that word.

Level 2. Hi~

14 minute read

Published:

Primary purpose:

Buffer Lab

5 minute read

Published:

Purpose:

Examples of Buffer Overflow Attack

4 minute read

Published:

In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.

Stack Frame

5 minute read

Published:

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published:

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

nontechnical

Book Summary: How to Win Friends and Influence People — PART 01

7 minute read

Published:

In this article I’ll write about a summary of a book which is quite interesting for me. The title of the book is How to Win Friends and Influence People, written by Dale Carnegie.

The book consists of several main parts in which there are some key principles that become the building blocks of the main part. So let’s start with part 1.

The New One Minute Manager

5 minute read

Published:

I read an interesting management book today titled The One Minute Manager, written by Ken Blanchard, PhD and Spencer Johnson, MD. The book is quiet short and the content is straightforward. It really provides an easily read story.

Journey from Novice to Expert

7 minute read

Published:

I read an interesting book titled Pragmatic Thinking and Learning: Refactor Your Wetware by Andy Hunt.

Johari Window

less than 1 minute read

Published:

To explain it simply, Johari Window is a diagram showing relationships between a person and others.

Bad Luck Turns Into Good Luck

1 minute read

Published:

We, as humans, sometimes encounter events that lead us to aware that everything happens for certain rationales. Something that happens at the exact time might later change someone’s life completely.

Solomon Paradox and Wiser Human Beings

2 minute read

Published:

King Solomon, the third leader of the Jewish Kingdom, is considered the nonsuch of wisdom. People travelled a long way just to ask for his exhortation. However, it’s known that his personal life and character are not in line with what his tact looks like to other people. This somewhat becomes a paradox.

The Greatest Salesman in the World by Og Mandino

5 minute read

Published:

One of the books that I read on the first week of this year was The Greatest Salesman in the World by Og Mandino. Basically, the primary content is about the fundamental principles in being a great salesman.

publications

research

talks

PyCon Indonesia 2017: Part-of-Speech Tagger for Bahasa Indonesia Using Hidden Markov Model and Viterbi Algorithm

Published:

Each word in a sentence has its own word class. In Natural Language Processing, the word class is also known as part of speech (POS). Some examples of word class are noun, verb, adverb, adjective, and so on. This word class denotes the role of a word in a sentence. Moreover, their sequence builds the structure of a sentence. For instance, a sentence has a general structure, namely the sequence of noun, verb, and noun.

PyCon Philippines 2018: Introduction to the Natural Language Processing with Python

Published:

Primarily, this talk is intended to the enthusiasts of NLP who would like to grasp the basic concepts of NLP and the way of implementing NLP tasks using Python. Therefore, this tutorial does not only give the concept of NLP, but it also provides the practical side by using Python programming language. Python can do the basic tasks of NLP, but it can not handle the standard tasks of NLP. Therefore, a special library called the Natural Language Toolkit (NLTK) is needed to solve such problem. This library provides some functions, wrappers, and the corpora samples that could be used to support the NLP researches. By the end of this talk, hopefully the audience could get some insights on the basic concepts of NLP. Moreover, they could utilize NLTK for assisting their works.

PyCon Italy 2018: Part-of-Speech Tagger for Bahasa Indonesia Using Hidden Markov Model and Viterbi Algorithm

Published:

Each word in a sentence has its own word class. In Natural Language Processing, the word class is also known as part of speech (POS). Some examples of word class are noun, verb, adverb, adjective, and so on. This word class denotes the role of a word in a sentence. Moreover, their sequence builds the structure of a sentence. For instance, a sentence has a general structure, namely the sequence of noun, verb, and noun.