Improving Attributes Management for Object Relational Mapper (ORM) Use Case

5 minute read

Published:

Let’s take a simple data management scenario.

You have a dataframe with greatly many columns. There are also two modules that manage those columns, and a module that needs the columns for further operation. To be specific, these modules are an ORM class, a configuration file, and a repository class.

The first module (ORM) is obvious. It’s simply a class whose attributes represent the column name of the dataframe (or table in the corresponding database). The second one (configuration file) stores all columns owned by the dataframe. If someone wants to add new columns or rename the existing ones, he/she needs to work on this configuration file. Meanwhile, the last one (repository) utilises the columns for database related operations, such as upsetting data into tables.

For the sake of clarity, let’s take a simple example.

File: TableORM.py

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class TableORM(Base):
	__tablename__ = my_table
	
	field_a = Column(String, primary_key=True)
	field_b = Column(String)
	field_c = Column(Integer)
	field_d = Column(String)
	
	
	
	
	field_final = Column(String)

File: TableConfig.py

def get_all_columns():
	return [
		field_a, field_b, field_c, field_d, 
		
		
		field_final
	]

File: TableRepository.py

class TableRepository(object):
	__metaclass__ = abc.ABCMeta

	def __init__(self):
		self._columns = [field_a, field_b, field_c, 
						
						
						field_final]

	@abc.abstractmethod
	def upsert(self, df: DataFrame):
		pass

Here, if someone wants to add new column called ‘field_new’ and rename column ‘field_a’ to ‘field_a_renamed’, these operations should be applied to the list in the TableConfig.py and to the self._columns in the TableRepository.py. In other words, the get_all_columns method should be like the following:

def get_all_columns():
	return [
		field_a_renamed, field_b, field_c, field_d, 
		
		
		field_final, field_new
	]

File: TableRepository.py

class TableRepository(object):
	__metaclass__ = abc.ABCMeta

	def __init__(self):
		self._columns = [field_a_renamed, field_b, field_c, 
						
						
						field_final, field_new]

	@abc.abstractmethod
	def upsert(self, df: DataFrame):
		pass

The problem is, since the dataframe would be stored in a database, we certainly need to update the columns in the ORM class too. This is done to ensure data consistency and prevent data loss.

Well, it’s not a big problem when the dataframe only has slight number of columns. But, what if the dataframe has vast amount of columns? When adding lots of new columns, we need to manually type down the variables’ name along with their Column properties. Meanwhile, modifying lots of existing columns (in terms of variables’ name or Columns properties) is also a tedious and error-prone task. In simpler terms, we’ll have to modify lots of modules when a change in columns data occurs.

Therefore, how would we be able to simplify such a data management?

Well, recently I encountered such a scenario as well. I decided to refactor the column management approach. Long story short, here are the modified modules.

File: TableORM.py

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class TableORM(Base):
	__tablename__ = my_table
	__table__ = Table(__tablename__, Base.metadata, Column(field_a, String, primary_key=True))

File: TableConfig.py

  • No modification
def get_all_columns():
	return [
		field_a, field_b, field_c, field_d, 
		
		
		field_final
	]

File: TableRepository.py

from TableConfig import get_all_columns

class TableRepository(object):
	__metaclass__ = abc.ABCMeta

	def __init__(self):
		self._columns = get_all_columns()

		orm_columns = []
		for column in self._columns:
			if column != field_a:
				column_type = ORM_SCHEMA_MAPPING[DATA_SCHEMA[column]]
				orm_columns.append(Column(column, column_type))

		TableORM.__table__ = Table(my_table, Base.metadata, *orm_columns, extend_existing=True)

	@abc.abstractmethod
	def upsert(self, df: DataFrame):
		pass

As you can see, I add a conditional check (if column != ‘field_a’). This was needed as we couldn’t define the same column name with different Column properties. The one defined in TableORM.py is a primary key, while the one defined in TableRepository.py is not.

Perhaps you ask, why didn’t I remove the primary key in TableORM.py and add an extra check in TableRepository.py to determine whether a column should have a primary key property.

I’ve tried that. And ended up with an error message telling me that a primary key is required in the ORM class.

In addition, I think it’s better to define columns with specific properties in the ORM class. It’s done to minimise extra checks that are needed in TableRepository.py.

Another problem occurred, actually.

The __init__ method in TableRepository.py is not the first modification version. Back then, I tried to implement the method using slightly different approach, like the following.

File: TableRepository.py

from TableConfig import get_all_columns

class TableRepository(object):
	__metaclass__ = abc.ABCMeta

	def __init__(self):
		self._columns = get_all_columns()

		for column in self._columns:
			if column != field_a:
				column_type = ORM_SCHEMA_MAPPING[DATA_SCHEMA[column]]

				# assign attribute column to column_type
				setattr(TableORM, column, column_type)

	@abc.abstractmethod
	def upsert(self, df: DataFrame):
		pass

The above approach used different TableORM.py, like the following.

File: TableORM.py

from sqlalchemy import Column, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class TableORM(Base):
	__tablename__ = my_table
	
	field_a = Column(String, primary_key=True)

Both versions of the __init__ method utilise different implementation of ORM_SCHEMA_MAPPING.

Using setattr method is a straightforward approach. However, I encountered an issue where not all the columns added or modified were committed to the database’s table.

Based on the issue, I did a quick recheck on the code. Using TableORM.__table__.columns command, I found that the table didn’t have all the columns specified before.

I think this denotes the end of this article.

Thank you for reading.