Python/Pydantic Pitfalls
Someone please throw me a rope
I'm going to be focusing on pydantic
in this post, since that's what I know
best, but reading through this discussion and having
glanced at some of the other serialization frameworks, they seem to have similar
problems or otherwise look awful to use. If you're trying to use FastAPI, you're
locked into pydantic
anyway too. We'll be trying to interact with the
following JSON schema, and let's say we need to support PersonId
(just the
ID), Person
(rest of the data without the ID), and PersonWithId
(data plus
ID, shown here) in the code for dictionary reasons:
{
"title": "PersonWithId",
"type": "object",
"properties": {
"firstName": {
"title": "Firstname",
"type": "string"
},
"lastName": {
"title": "Lastname",
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
},
"id": {
"title": "Id",
"type": "string",
"format": "uuid"
}
},
"required": [
"firstName",
"age",
"id"
]
}
This seems pretty straightforward, so let's try defining some models:
class PersonId(BaseModel):
# We have to do this because `id` is taken in the global namespace
which: UUID = Field(..., alias="id")
class Person(BaseModel):
first_name: str = Field(..., alias="firstName")
last_name: Optional[str] = Field(..., alias="lastName")
age: int = Field(...)
class PersonWithId(PersonId, Person):
pass
Cool, now some code to construct a value and convert it to JSON:
person_id = PersonId(which=uuid4())
person = Person(first_name="Charles", age=23)
person_with_id = PersonWithId(**person.dict(), **person_id.dict())
print(person_with_id.json())
Time to show the programming world what language is boss:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/charles/science/python/pydantic-sucks/pydantic_sucks/main.py", line 28, in main
person_id = PersonId(which=uuid4())
File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for PersonId
id
field required (type=value_error.missing)
Oh, nevermind, I guess. That's weird though, I set which=uuid4()
like right
there, what do you mean it's not present? Apparently,
you must explicitly tell pydantic
that you'd like to be able to populate
fields by their name. What!? Let's spam Config
classes everywhere to fix it:
class PersonId(BaseModel):
which: UUID = Field(..., alias="id")
+ class Config:
+ allow_population_by_field_name = True
+
class Person(BaseModel):
first_name: str = Field(..., alias="firstName")
last_name: Optional[str] = Field(..., alias="lastName")
age: int = Field(...)
+ class Config:
+ allow_population_by_field_name = True
+
class PersonWithId(PersonId, Person):
pass
Let's try running the code again:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/charles/science/python/pydantic-sucks/pydantic_sucks/main.py", line 29, in main
person = Person(first_name="Charles", age=23)
File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Person
lastName
field required (type=value_error.missing)
Great. Why is this happening? I specified the type for last_name
to be
Optional[T]
and the docs say it's aware of that type.
Scouring the docs for a while, we learn that
pydantic
actually distinguishes between "missing field" and "field is set to
None
/null
" for some reason. Whatever, let's fix it:
class Person(BaseModel):
first_name: str = Field(..., alias="firstName")
- last_name: Optional[str] = Field(..., alias="lastName")
+ last_name: Optional[str] = Field(None, alias="lastName")
age: int = Field(...)
And run our code again:
{
"first_name": "Charles",
"last_name": null,
"age": 23,
"which": "f49be13a-63f1-4ef6-b8f7-b948a32836ed"
}
Hooray, no more errors! Except wait, this doesn't look like our JSON schema at all! Why are they being serialized by their field names instead of the aliases? Y'know, the aliases I added for the express purpose of appearing in the JSON because the JSON field names are (mostly) (stylistically and sometimes semantically) invalid Python identifiers? The docs say you have to explicitly ask for them to be serialized by their aliases. Why though? If this isn't the default, what are the aliases even for? Whatever, let's try it:
-print(person_with_id.json())
+print(person_with_id.json(by_alias=True))
And the output:
{
"firstName": "Charles",
"lastName": null,
"age": 23,
"id": "41501697-8ed7-4d2c-8bd0-47aea0d5cd92"
}
We've finally done it. I should be excited to be getting this working but at
this point I'm actually just exhausted. Plus I've still got some questions.
There are still two calls to some_model.dict()
that don't take
by_alias=True
, but the code appears to work anyway. How am I supposed to
remember when it's required to add by_alias=True
and when it would break
things if I did add it? I think this issue is on Python itself, for baking the
concept of a constructor into the language and for allowing everything to be
represented as dicts. (More on that later). Also, what does pyright
think
about our code now? Let's find out:
error: No parameter named "which" (reportGeneralTypeIssues)
error: Argument missing for parameter "id" (reportGeneralTypeIssues)
error: No parameter named "first_name" (reportGeneralTypeIssues)
error: Arguments missing for parameters "firstName", "lastName" (reportGeneralTypeIssues)
A little digging suggests that pyright
is now expecting you to initialize the
fields based on their aliases instead of their actual names. Again, this seems
extremely backwards. How can I fix this without using (stylistically or
semantically) illegal identifiers? This discussion and
this comment specifically suggest that pyright
is
thinking too hard about what identifier a field name should be populated by. To
be fair to it, it's very weird for a typechecker to have a constraint that
a field is initialized by exactly one of two names in a generalized fashion. Oh
was it unclear that you can still populate the fields by both names? Because
yeah, you can do that. So anyway how do I appease pyright
? It seems like the
only way to do so is to add # type: ignore
everywhere. Remember, doing # pylint: disable=invalid-name
is not an option because of JSON field names that
are semantically invalid Python names, such as id
or anything using hyphens
for word separators. So, let's clutter the place with sad comments:
-person_id = PersonId(which=uuid4())
-person = Person(first_name="Charles", age=23)
+person_id = PersonId(which=uuid4()) # type: ignore
+person = Person(first_name="Charles", age=23) # type: ignore
Now pyright
, pylint
, and the Python interpreter (in the case of semantically
illegal names) are all satisfied and don't give us any issues. But at what cost?
This code now typechecks perfectly fine but will detonate at runtime:
-person = Person(first_name="Charles", age=23) # type: ignore
+person = Person(first_name="Charles", age="lol") # type: ignore
And this code typechecks fine and, somewhat surprisingly, works at runtime too. This is not very Parse, Don't Validate and thus is almost certainly prone to causing problems in the future:
-person = Person(first_name="Charles", age=23) # type: ignore
+person = Person(first_name="Charles", age="23") # type: ignore
This code typechecks and runs fine, but omits the extraneous value in the output that we potentially wanted to show up there:
-person = Person(first_name="Charles", age=23) # type: ignore
+person = Person(first_name="Charles", age=23, fingers=10) # type: ignore
We can at least configure pydantic
to give us errors about that situation so
that a test suite, assuming there is one and assuming it has good coverage, can
catch this before we deploy anything by adding extra = Extra.forbid
to all of
your Config
interior classes (or whatever they're called), which results in
errors like this:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/charles/science/python/pydantic-sucks/pydantic_sucks/main.py", line 31, in main
person = Person(first_name="Charles", age=23, fingers=10) # type: ignore
File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Person
fingers
extra fields not permitted (type=value_error.extra)
Speaking of errors, what happens if our code fails to serialize or deserialize? Maybe our code constructed a value that didn't pass a validator or is the wrong type or the incoming data was malformed or some other such problem. In that case, we have to deal with the pitfalls of Python's error handling facilities, which I've talked more about in a previous post.
Speaking of deserialization, what about doing that with pydantic
? You get at
least all the same problems as above, and at this point I don't really want to
talk about pydantic
anymore. Except to make the point again that way aliases
work in pydantic
(and similar libraries) is extremely backwards and the
default behavior is really bad; I hope at this point I've demonstrated clearly
why this is the case, but it's worth repeating just to be sure.
Let's talk about something else
What if we wanted to do this in Rust? Let's define our models:
#![allow(unused)] fn main() { #[derive(Serialize, Deserialize)] struct PersonId(Uuid); #[derive(Serialize, Deserialize)] struct Person { #[serde(rename = "firstName")] first_name: String, #[serde(rename = "lastName")] last_name: Option<String>, age: u8, } #[derive(Serialize, Deserialize)] struct PersonWithId { id: PersonId, #[serde(flatten)] person: Person, } }
And some code to construct a value and serialize it:
#![allow(unused)] fn main() { let id = PersonId(Uuid::new_v4()); let person = Person { first_name: "Charles".to_owned(), last_name: None, age: 23, }; let person_with_id = PersonWithId { id, person, }; let json = serde_json::to_string(&person_with_id)?; println!("{json}"); }
Now let's see what happens when we run it:
{
"id": "ecacbd24-2c68-45a0-aaf0-ebee4390c16a",
"firstName": "Charles",
"lastName": null,
"age": 23
}
Why did — oh this isn't Python; it actually worked first try. Sorry, force of habit. Anyway, why does this work, and why is it so much safer and more intuitive?
-
#[serde(rename = "whatever")]
does what you think it would, unlikepydantic
'salias
, and no extra configuration is required beyond setting the new name we want. (serde
also allows setting different names for serialization and deserialization, which can be handy when, for example, you want to use the same model to deserialize from your database and serialize for your web API) -
There are no weird typechecking issues because there are no constructors getting modified at runtime to accept multiple names for the same things
-
PersonWithId
doesn't require a roundtrip of its components through aHashMap
(akadict
) and because of that and point 1, there's no fear of whether you should have addedby_alias=True
-
It also means any inability to map fields correctly will cause a compile time error instead of a runtime one
-
Any attempt to add a new field in the
let person = Person { ... };
section will cause a compile error because the struct does not yet define that new field. You can acheive the same for deserialization by annotating the container with#[serde(deny_unknown_fields)]
-
Any places where a runtime error can occur is made clear by the
Result
type, in this case we're simply handling it with the?
operator -
There's no weird coercion happening, trying to set
age: "23"
without explicitly converting it from&str
tou8
viaparse()
or such would fail to compile instead of potentially accepting bad values, which is much more in the Parse, Don't Validate spirit
Conclusion
I don't really know how to end this article without either saying something overly sassy about Python or holding my head in my hands in sadness. I hope Python's situation improves, or maybe there's a better library out there that solves all of these problems somehow (you'd still have to deal with constructors) that I just haven't seen yet. In the meantime, I'm going to continue using Rust when I can and Python only when I have to. I guess I went the sadness route.
Addendum
For FastAPI in particular, it looks like it sets by_alias=True
by default for
returning responses built from pydantic
objects. This is notably not the same
as pydantic
's default, which is yet another violation of the Principle of
Least Astonishment in the Python ecosystem. I also see this
merged PR suggesting that you cannot disable this behavior
without either causing inconsistencies in the schema presented in the generated
OpenAPI spec or requiring you to implement a dummy model for the express purpose
of showing up in the OpenAPI spec with the correct field names, which is another
POLA violation and seems awful to maintain.
A large cause of problems with complex pydantic
code is the following pattern:
# NewThing inherits from BaseModel and OldThing, and adds its own extra fields
new_thing = NewThing(**old_thing.dict(), **other_thing.dict())
# Or
new_thing = NewThing(**old_thing.dict(), **{"new_field": new_value})
# Or
new_thing = NewThing(**old_thing.dict(), new_field=new_value)
Unlike Rust, Python has an explicit concept of constructors and also supports
inheritance. These two features prove themeselves to be problems here, because
their existence tricks people into using them. pydantic
makes this mistake by
requiring object composition to be facilitated through the use of inheritance.
Python provides them no choice but to use constructors to construct these
values. The result is that, in any of the above examples, it is impossible to
statically verify that every field of NewThing
gets set properly by its
constructor's arguments, and are all suspect for causing runtime errors.
This could be fixed by listing out every individual field of NewThing
,
including those inherited from OldThing
, and explicitly matching up every one
of those fields with the new value from the example constructors' arguments. The
massive downside is the maintenance burden: now you have to keep every instance
of this type of conversion in sync, which is fallible since accidentally
omitting Optional[T]
fields will run and typecheck fine even though they
should have been set to some value from other_thing
, for example. Rust and
serde
solve all of these issues by simply storing all of old_thing
inside
NewThing
via #[serde(flatten)]
, alongside other_thing
or any other new
fields NewThing
needs to have, as demonstrated above.
Python and some libraries (like pydantic
) treat names with
leading underscores specially, which may have adverse effects on the
serializability of fields with certain names. I know Go also uses a similar
method (casing) to convey privacy information, and I wonder if Go has similar
issues at the intersection of those and serialization. Again, Rust/serde
does
not exhibit this issue because naming has no real effect (single leading
underscore can silence unused-code warnings but that's it) and #[serde(rename = "whatever")]
is painless and straightforward to use.
FAQ
-
Q: I haven't heard of anyone else having this problem, are you just bad at programming?
A: You know, I haven't either. Maybe I am. *shrug*
-
Q: Why do
serde
-like frameworks combine the model definition and its configuration in the first place?A: Because if they didn't, you'd wind up with a worse version of the problem with
by_alias=True
. It makes it possible to try to serialize something with the wrong configuration, instead of simply baking it in so it happens every time transparently to you. Maybe it wouldn't be so bad if all you were doing was changing field names, but there's a lot of stuff you can do withserde
container and field attributes,pydantic
validators, and manually implementingserde
traits for complicated things. Inserde
's case, you'd also take a performance hit because now it would have to check a separate source for what to do with a model for every single model in the tree, instead of what it currently does, which is generate the correctly-configured code at compile time.