Designing with dataclasses
Assumed audience: Python programmers who aren’t in the habit of writing classes.
Python
dictionaries are
available without an import and extremely flexible, which means many Python programmers
default to representing data as a dict. Here’s why and when you should use
dataclasses
instead.
Note: I’m using
dataclasshere since it’s in the standard library. If you’re already using a similar 3rd-party library like the excellent attrs the advice here still applies, just replace uses ofdataclasswith that library.
What is a dataclass?
If you’re already familiar with dataclasses, skip ahead to the next section.
dataclass is a class
decorator which automatically
generates
special methods
like __init__ and __eq__, making for more concise class definitions. For instance,
this class declaration:
class Order:
def __init__(self, item_id: str, customer_id: str, amount: int):
self.item_id = item_id
self.customer_id = customer_id
self.amount = amount
def __eq__(self, other):
return (
self.item_id == other.item_id
and self.customer_id == other.customer_id
and self.amount == other.amount
)
can be replaced with:
from dataclasses import dataclass
@dataclass
class Order:
item_id: str
customer_id: str
amount: int
Why use a dataclass instead a dict?
Data classes have a few distinct advantages over dictionaries.
Readability
First, a dataclass can be more readable than a dict. When you see a dataclass like
Order, reading its definition tells you which fields it contains 1. On the other
hand, items can be added or removed from a dict at various points in the code, which
means you have to potentially read through much more code to know the shape of the data.
While this can be avoided with discipline (for instance, you can avoid inserting new
items into a dict after it’s instantiated), dataclass helps enforce this discipline
automatically.
Error checking & debugging
Representing data as a dataclass also makes debugging faster. For example, using the
same Order class as before, if you forgot to provide customer_id when instantiating,
it raises an error with the exact line where you forgot to provide the customer_id:
order = Order(item_id="i1435", amount=10)
----> 1 Order(item_id="i2345", amount=10)
TypeError: Order.__init__() missing 1 required positional argument: 'customer_id'.
However, if we represented the same data as a dict, this would not raise an error:
order = {
"item_id": "i1435",
"amount": 10,
}
If "customer_id" is accessed somewhere downstream,
customer = order["customer_id"]
you get a KeyError: 'customer_id' and you’re left backtracking through the code to
find where you forgot to add 'customer_id'.
Dataclasses also work well with type checkers like mypy. Since they encourage annotating each field with types, code using dataclasses can be type checked with very little extra effort.
When should you use a dataclass instead of a dict?
Leveraging dataclasses’ strengths requires knowing the structure of your data ahead of
time. So, lean towards using a dataclass when your data has a fixed structure known at
design time and access fields by hardcoded names throughout the codebase.
On the other hand, you should still use a dict if you want to loop over the keys
and/or values (dicts provide several facilities that make this convenient), especially
if the values are of a homogeneous type (for instance, if all the values in the dict
are floats), or if you aren’t accessing values by hardcoded names.
Case study
Let’s see how these heuristics apply in a larger program.
We have a function, upload_directory, which uploads a directory of text files to S3.
Each file’s object key in S3 will be {id}/{start_timestamp}/{session_name}. The data
used for this key is stored on the first line of each file in this format:
# id=53,started_at=2021-01-02T11:30:00Z,session_name=daring_foolion
import os
import boto3
def upload_directory(directory, s3_bucket):
headers_by_file = _get_headers(directory)
metadata_by_file = _parse_headers(headers_by_file)
s3_key_by_file = _build_s3_keys(metadata_by_file)
_upload_to_s3(s3_bucket, s3_key_by_file)
def _get_headers(directory):
headers = {} # (1)
for file_name in os.listdir(directory):
file_path = os.path.join(directory, file_name)
with open(file_path, "r") as f:
headers[file_path] = f.readline()
return headers
def _parse_headers(headers):
metadata_by_file = {}
for file_path, header in headers.items(): # (2)
header = header.removeprefix("# ")
pairs = header.split(",")s
metadata = {} # (3)
for key_value in pairs:
key, value = key_value.split("=")
metadata[key] = value
metadata_by_file[file_path] = metadata
return metadata_by_file
def _build_s3_keys(metadata_by_file):
object_keys = {}
for filepath, metadata in metadata_by_file.items():
recorder = metadata["id"] # (3)
started_at = metadata["started_at"]
session_name = metadata["session_name"]
object_keys[filepath] = f"{recorder}/{session_name}_{started_at}"
return object_keys
def _upload_to_s3(s3_bucket, s3_key_by_file):
s3_client = boto3.client("s3")
for filepath, s3_key in s3_key_by_file.items():
s3_client.upload_file(filepath, s3_bucket, s3_key)
The use of a dict for headers in (1) is appropriate: we don’t access or set any of
its items through hard-coded key names, and we loop over all the headers downstream in
parse_headers() (2). However, the dict in (3) fails our heuristics: we access items
through hard-coded key names downstream in _build_s3_keys() (4).
Here’s the same script after re-writing (3) to use a dataclass:
import os
from dataclasses import dataclass
import boto3
def upload_directory(directory, s3_bucket):
headers_by_file = _get_headers(directory)
metadata_by_file = _parse_headers(headers_by_file)
s3_key_by_file = _build_s3_keys(metadata_by_file)
_upload_to_s3(s3_bucket, s3_key_by_file)
def _get_headers(directory):
headers = {}
for file_name in os.listdir(directory):
file_path = os.path.join(directory, file_name)
with open(file_path, "r") as f:
headers[file_path] = f.readline()
return headers
@dataclass
class RecordingMetadata:
recorder_id: int
started_at: str
session_name: str
def _parse_headers(headers_by_file):
metadata_by_file = {}
for file_path, header in headers_by_file.items():
header = header.removeprefix("# ")
pairs = header.split(",")
metadata = {}
for key_value in pairs:
key, value = key_value.split("=")
metadata[key] = value
metadata_by_file[file_path] = RecordingMetadata(
recorder_id=metadata["id"],
started_at=metadata["started_at"],
session_name=metadata["session_name"],
)
return metadata_by_file
def _build_s3_keys(metadata_by_file):
object_keys = {}
for filepath, metadata in metadata_by_file.items():
object_keys[filepath] = (
f"{metadata.recorder_id}/{metadata.session_name}_{metadata.started_at}"
)
return object_keys
def _upload_to_s3(s3_bucket, s3_key_by_file):
s3_client = boto3.client("s3")
for filepath, s3_key in s3_key_by_file.items():
s3_client.upload_file(filepath, s3_bucket, s3_key)
The readability benefits are more obvious with type hints:
def upload_directory(directory: os.PathLike, s3_bucket: str):
headers_by_file = _get_headers(directory)
metadata_by_file = _parse_headers(headers_by_file)
s3_key_by_file = _build_s3_keys(metadata_by_file)
_upload_to_s3(s3_bucket, s3_key_by_file)
@dataclass
class RecordingMetadata:
recorder_id: int
started_at: str
session_name: str
def _get_headers(directory: os.PathLike) -> dict[str, str]:
headers = {}
for file_name in os.listdir(directory):
file_path = os.path.join(directory, file_name)
with open(file_path, "r") as f:
headers[file_path] = f.readline()
return headers
def _parse_headers(headers: dict[str, str]) -> dict[str, RecordingMetadata]:
metadata_by_file = {}
for file_path, header in headers.items():
header = header.removeprefix("# ")
pairs = header.split(",")
metadata = {}
for key_value in pairs:
key, value = key_value.split("=")
metadata[key] = value
metadata_by_file[file_path] = RecordingMetadata(
recorder_id=metadata["id"],
started_at=metadata["started_at"],
session_name=metadata["session_name"],
)
return metadata_by_file
def _build_s3_keys(metadata_by_file: dict[str, RecordingMetadata]) -> dict[str, str]:
object_keys = {}
for filepath, metadata in metadata_by_file.items():
object_keys[filepath] = (
f"{metadata.recorder_id}/{metadata.session_name}_{metadata.started_at}"
)
return object_keys
def _upload_to_s3(s3_bucket: str, s3_key_by_file: dict[str, str]):
s3_client = boto3.client("s3")
for filepath, s3_key in s3_key_by_file.items():
s3_client.upload_file(filepath, s3_bucket, s3_key)
When should you break these rules?
As always, there are cases where it’s OK to break the rules a little.
One of them is calling functions that takes a dict as a parameter, or returns one.
This is common when serializing or de-serializing data, like in the standard library’s
json module. If you’re building the data
in the same function where it’s used, it’s OK to just use a dict, even if there are
hard-coded keys.
Another one is performance. While accessing a dataclass attribute is only slightly
slower than accessing a key in a dict, instantiating a dataclass is ~5x slower than
creating a dict2. So, if you’re instantiating tens of thousands of dataclasses and
you’ve determined it’s a bottleneck, you can use dicts instead.
In both cases, if you’re using a type checker, you can annotate your code with TypedDicts to regain some readability and error checking.
-
This is not a guarantee - Python is very flexible, and most object attributes can be added or changed at any time. For instance, unless
slots=Trueis passed to@dataclass, you can assign attributes not defined in the original dataclass.slots=Truealso makes the class more memory-efficient! ↩︎