Python/Pydantic Pitfalls

Someone please throw me a rope


I'm going to be focusing on pydantic in this post, since that's what I know best, but reading through this discussion and having glanced at some of the other serialization frameworks, they seem to have similar problems or otherwise look awful to use. If you're trying to use FastAPI, you're locked into pydantic anyway too. We'll be trying to interact with the following JSON schema, and let's say we need to support PersonId (just the ID), Person (rest of the data without the ID), and PersonWithId (data plus ID, shown here) in the code for dictionary reasons:

{
  "title": "PersonWithId",
  "type": "object",
  "properties": {
    "firstName": {
      "title": "Firstname",
      "type": "string"
    },
    "lastName": {
      "title": "Lastname",
      "type": "string"
    },
    "age": {
      "title": "Age",
      "type": "integer"
    },
    "id": {
      "title": "Id",
      "type": "string",
      "format": "uuid"
    }
  },
  "required": [
    "firstName",
    "age",
    "id"
  ]
}

This seems pretty straightforward, so let's try defining some models:

class PersonId(BaseModel):
    # We have to do this because `id` is taken in the global namespace
    which: UUID = Field(..., alias="id")


class Person(BaseModel):
    first_name: str = Field(..., alias="firstName")
    last_name: Optional[str] = Field(..., alias="lastName")
    age: int = Field(...)


class PersonWithId(PersonId, Person):
    pass

Cool, now some code to construct a value and convert it to JSON:

person_id = PersonId(which=uuid4())
person = Person(first_name="Charles", age=23)
person_with_id = PersonWithId(**person.dict(), **person_id.dict())

print(person_with_id.json())

Time to show the programming world what language is boss:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/charles/science/python/pydantic-sucks/pydantic_sucks/main.py", line 28, in main
    person_id = PersonId(which=uuid4())
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for PersonId
id
  field required (type=value_error.missing)

Oh, nevermind, I guess. That's weird though, I set which=uuid4() like right there, what do you mean it's not present? Apparently, you must explicitly tell pydantic that you'd like to be able to populate fields by their name. What!? Let's spam Config classes everywhere to fix it:

 class PersonId(BaseModel):
     which: UUID = Field(..., alias="id")

+    class Config:
+        allow_population_by_field_name = True
+

 class Person(BaseModel):
     first_name: str = Field(..., alias="firstName")
     last_name: Optional[str] = Field(..., alias="lastName")
     age: int = Field(...)

+    class Config:
+        allow_population_by_field_name = True
+

 class PersonWithId(PersonId, Person):
     pass

Let's try running the code again:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/charles/science/python/pydantic-sucks/pydantic_sucks/main.py", line 29, in main
    person = Person(first_name="Charles", age=23)
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Person
lastName
  field required (type=value_error.missing)

Great. Why is this happening? I specified the type for last_name to be Optional[T] and the docs say it's aware of that type. Scouring the docs for a while, we learn that pydantic actually distinguishes between "missing field" and "field is set to None/null" for some reason. Whatever, let's fix it:

 class Person(BaseModel):
     first_name: str = Field(..., alias="firstName")
-    last_name: Optional[str] = Field(..., alias="lastName")
+    last_name: Optional[str] = Field(None, alias="lastName")
     age: int = Field(...)

And run our code again:

{
  "first_name": "Charles",
  "last_name": null,
  "age": 23,
  "which": "f49be13a-63f1-4ef6-b8f7-b948a32836ed"
}

Hooray, no more errors! Except wait, this doesn't look like our JSON schema at all! Why are they being serialized by their field names instead of the aliases? Y'know, the aliases I added for the express purpose of appearing in the JSON because the JSON field names are (mostly) (stylistically and sometimes semantically) invalid Python identifiers? The docs say you have to explicitly ask for them to be serialized by their aliases. Why though? If this isn't the default, what are the aliases even for? Whatever, let's try it:

-print(person_with_id.json())
+print(person_with_id.json(by_alias=True))

And the output:

{
  "firstName": "Charles",
  "lastName": null,
  "age": 23,
  "id": "41501697-8ed7-4d2c-8bd0-47aea0d5cd92"
}

We've finally done it. I should be excited to be getting this working but at this point I'm actually just exhausted. Plus I've still got some questions. There are still two calls to some_model.dict() that don't take by_alias=True, but the code appears to work anyway. How am I supposed to remember when it's required to add by_alias=True and when it would break things if I did add it? I think this issue is on Python itself, for baking the concept of a constructor into the language and for allowing everything to be represented as dicts. (More on that later). Also, what does pyright think about our code now? Let's find out:

error: No parameter named "which" (reportGeneralTypeIssues)
error: Argument missing for parameter "id" (reportGeneralTypeIssues)
error: No parameter named "first_name" (reportGeneralTypeIssues)
error: Arguments missing for parameters "firstName", "lastName" (reportGeneralTypeIssues)

A little digging suggests that pyright is now expecting you to initialize the fields based on their aliases instead of their actual names. Again, this seems extremely backwards. How can I fix this without using (stylistically or semantically) illegal identifiers? This discussion and this comment specifically suggest that pyright is thinking too hard about what identifier a field name should be populated by. To be fair to it, it's very weird for a typechecker to have a constraint that a field is initialized by exactly one of two names in a generalized fashion. Oh was it unclear that you can still populate the fields by both names? Because yeah, you can do that. So anyway how do I appease pyright? It seems like the only way to do so is to add # type: ignore everywhere. Remember, doing # pylint: disable=invalid-name is not an option because of JSON field names that are semantically invalid Python names, such as id or anything using hyphens for word separators. So, let's clutter the place with sad comments:

-person_id = PersonId(which=uuid4())
-person = Person(first_name="Charles", age=23)
+person_id = PersonId(which=uuid4())  # type: ignore
+person = Person(first_name="Charles", age=23)  # type: ignore

Now pyright, pylint, and the Python interpreter (in the case of semantically illegal names) are all satisfied and don't give us any issues. But at what cost? This code now typechecks perfectly fine but will detonate at runtime:

-person = Person(first_name="Charles", age=23)  # type: ignore
+person = Person(first_name="Charles", age="lol")  # type: ignore

And this code typechecks fine and, somewhat surprisingly, works at runtime too. This is not very Parse, Don't Validate and thus is almost certainly prone to causing problems in the future:

-person = Person(first_name="Charles", age=23)  # type: ignore
+person = Person(first_name="Charles", age="23")  # type: ignore

This code typechecks and runs fine, but omits the extraneous value in the output that we potentially wanted to show up there:

-person = Person(first_name="Charles", age=23)  # type: ignore
+person = Person(first_name="Charles", age=23, fingers=10)  # type: ignore

We can at least configure pydantic to give us errors about that situation so that a test suite, assuming there is one and assuming it has good coverage, can catch this before we deploy anything by adding extra = Extra.forbid to all of your Config interior classes (or whatever they're called), which results in errors like this:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/charles/science/python/pydantic-sucks/pydantic_sucks/main.py", line 31, in main
    person = Person(first_name="Charles", age=23, fingers=10)  # type: ignore
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Person
fingers
  extra fields not permitted (type=value_error.extra)

Speaking of errors, what happens if our code fails to serialize or deserialize? Maybe our code constructed a value that didn't pass a validator or is the wrong type or the incoming data was malformed or some other such problem. In that case, we have to deal with the pitfalls of Python's error handling facilities, which I've talked more about in a previous post.

Speaking of deserialization, what about doing that with pydantic? You get at least all the same problems as above, and at this point I don't really want to talk about pydantic anymore. Except to make the point again that way aliases work in pydantic (and similar libraries) is extremely backwards and the default behavior is really bad; I hope at this point I've demonstrated clearly why this is the case, but it's worth repeating just to be sure.

Let's talk about something else

What if we wanted to do this in Rust? Let's define our models:

#![allow(unused)]
fn main() {
#[derive(Serialize, Deserialize)]
struct PersonId(Uuid);

#[derive(Serialize, Deserialize)]
struct Person {
    #[serde(rename = "firstName")]
    first_name: String,

    #[serde(rename = "lastName")]
    last_name: Option<String>,
    age: u8,
}

#[derive(Serialize, Deserialize)]
struct PersonWithId {
    id: PersonId,

    #[serde(flatten)]
    person: Person,
}
}

And some code to construct a value and serialize it:

#![allow(unused)]
fn main() {
let id = PersonId(Uuid::new_v4());

let person = Person {
    first_name: "Charles".to_owned(),
    last_name: None,
    age: 23,
};

let person_with_id = PersonWithId {
    id,
    person,
};

let json = serde_json::to_string(&person_with_id)?;

println!("{json}");
}

Now let's see what happens when we run it:

{
  "id": "ecacbd24-2c68-45a0-aaf0-ebee4390c16a",
  "firstName": "Charles",
  "lastName": null,
  "age": 23
}

Why did — oh this isn't Python; it actually worked first try. Sorry, force of habit. Anyway, why does this work, and why is it so much safer and more intuitive?

  1. #[serde(rename = "whatever")] does what you think it would, unlike pydantic's alias, and no extra configuration is required beyond setting the new name we want. (serde also allows setting different names for serialization and deserialization, which can be handy when, for example, you want to use the same model to deserialize from your database and serialize for your web API)

  2. There are no weird typechecking issues because there are no constructors getting modified at runtime to accept multiple names for the same things

  3. PersonWithId doesn't require a roundtrip of its components through a HashMap (aka dict) and because of that and point 1, there's no fear of whether you should have added by_alias=True

  4. It also means any inability to map fields correctly will cause a compile time error instead of a runtime one

  5. Any attempt to add a new field in the let person = Person { ... }; section will cause a compile error because the struct does not yet define that new field. You can acheive the same for deserialization by annotating the container with #[serde(deny_unknown_fields)]

  6. Any places where a runtime error can occur is made clear by the Result type, in this case we're simply handling it with the ? operator

  7. There's no weird coercion happening, trying to set age: "23" without explicitly converting it from &str to u8 via parse() or such would fail to compile instead of potentially accepting bad values, which is much more in the Parse, Don't Validate spirit

Conclusion

I don't really know how to end this article without either saying something overly sassy about Python or holding my head in my hands in sadness. I hope Python's situation improves, or maybe there's a better library out there that solves all of these problems somehow (you'd still have to deal with constructors) that I just haven't seen yet. In the meantime, I'm going to continue using Rust when I can and Python only when I have to. I guess I went the sadness route.

Addendum

For FastAPI in particular, it looks like it sets by_alias=True by default for returning responses built from pydantic objects. This is notably not the same as pydantic's default, which is yet another violation of the Principle of Least Astonishment in the Python ecosystem. I also see this merged PR suggesting that you cannot disable this behavior without either causing inconsistencies in the schema presented in the generated OpenAPI spec or requiring you to implement a dummy model for the express purpose of showing up in the OpenAPI spec with the correct field names, which is another POLA violation and seems awful to maintain.

A large cause of problems with complex pydantic code is the following pattern:

# NewThing inherits from BaseModel and OldThing, and adds its own extra fields

new_thing = NewThing(**old_thing.dict(), **other_thing.dict())

# Or

new_thing = NewThing(**old_thing.dict(), **{"new_field": new_value})

# Or

new_thing = NewThing(**old_thing.dict(), new_field=new_value)

Unlike Rust, Python has an explicit concept of constructors and also supports inheritance. These two features prove themeselves to be problems here, because their existence tricks people into using them. pydantic makes this mistake by requiring object composition to be facilitated through the use of inheritance. Python provides them no choice but to use constructors to construct these values. The result is that, in any of the above examples, it is impossible to statically verify that every field of NewThing gets set properly by its constructor's arguments, and are all suspect for causing runtime errors.

This could be fixed by listing out every individual field of NewThing, including those inherited from OldThing, and explicitly matching up every one of those fields with the new value from the example constructors' arguments. The massive downside is the maintenance burden: now you have to keep every instance of this type of conversion in sync, which is fallible since accidentally omitting Optional[T] fields will run and typecheck fine even though they should have been set to some value from other_thing, for example. Rust and serde solve all of these issues by simply storing all of old_thing inside NewThing via #[serde(flatten)], alongside other_thing or any other new fields NewThing needs to have, as demonstrated above.

Python and some libraries (like pydantic) treat names with leading underscores specially, which may have adverse effects on the serializability of fields with certain names. I know Go also uses a similar method (casing) to convey privacy information, and I wonder if Go has similar issues at the intersection of those and serialization. Again, Rust/serde does not exhibit this issue because naming has no real effect (single leading underscore can silence unused-code warnings but that's it) and #[serde(rename = "whatever")] is painless and straightforward to use.

FAQ

  • Q: I haven't heard of anyone else having this problem, are you just bad at programming?

    A: You know, I haven't either. Maybe I am. *shrug*

  • Q: Why do serde-like frameworks combine the model definition and its configuration in the first place?

    A: Because if they didn't, you'd wind up with a worse version of the problem with by_alias=True. It makes it possible to try to serialize something with the wrong configuration, instead of simply baking it in so it happens every time transparently to you. Maybe it wouldn't be so bad if all you were doing was changing field names, but there's a lot of stuff you can do with serde container and field attributes, pydantic validators, and manually implementing serde traits for complicated things. In serde's case, you'd also take a performance hit because now it would have to check a separate source for what to do with a model for every single model in the tree, instead of what it currently does, which is generate the correctly-configured code at compile time.