Parse, don't validate — Elixir edition

January 2, 2020

Rationale

For historical reasons, web stacks today often rely on something called a ‘validation’ step for data integrity. This ‘validation step’ checks that incoming data meets particular criteria before being persisted in the app. We’d like to present a completely different approach, inspired by the clean code methodology, DDD, and continuing the line of thought presented in “Parse, don’t validate”.

Why are we trying to convince you to abandon persistence-related validations (and all validations in general)?

Here’s why:

If you’re validating your data just before you persist it, it means that all the layers above the persistence layer have already been exposed to potentially malicious or malformed payloads. Parts of your application may have already acted on these bad payloads, too. This is bad for business domain coherence and very bad for security.
If you’re tying your data validation to the particular method of its storage, you’re unnecessarily coupling what should be pure business logic with the arbitrary and non-business-related requirements of your DB technology. This also means that you’re in uncharted territory when you want diversity and granularity in your storage mechanisms. You might have to use different validation passes provided by different libraries, depending on the backend you’re using for storage.
If you’re even thinking about validation as a discrete step, you’ve already implicitly admitted that the two points above are acceptable for you. You’ve admitted that your application cannot trust the data it operates on. We want to convince you to abandon the concept of validation completely and take control of your models by making them correct-by-construction. Once that is achieved, your applications will be able to trust their own data, because invalid data will not even be expressible.

The program

Here’s a fictional piece of code that we’ll be modifying to demostrate how dropping validation can improve your software. We need to take an address from a web form and persist it, then use it to send a parcel.

with {:ok, street} <- Access.fetch(params, :street),
     {:ok, city} <- Access.fetch(params, :city),
     {:ok, postal_code} <- Access.fetch(params, :postal_code)
     address = %Address{street: street, city: city, postal_code: postal_code}
do
  if Address.valid?(address) do
    :ok = Repo.Address.insert(address)
    :ok = ParcelService.send_parcel(address)
  else
    log_error("invalid_address, #{inspect address}")
    return_error_to_user("address is invalid")
  end
else
  error_from_with ->
    return_error_to_user(error_from_with)
end

The code above has some problems. It has three exit points, two of which represent errors encountered when assembling the data. It creates an Address and performs some actions only when the Address is valid, but it also performs some actions on the address if it’s not valid. Even logging invalid data could potentially lead to a breach.

We will prevent the app from using invalid addresses, because it simply won’t be possible to construct them.
We will make it clear which errors are the result of faulty input data (parse errors), and which errors are caused by malfunctions in other modules.

The root cause of the problem:

All of the issues in the code snippet above come from the fact that the Address struct can contain any information – it is just a container for 3 string fields. Also: after the address is persisted we no longer know if it’s valid or not. If an Address was returned from this function, the upstream receiver would need to validate it again every time they wanted to use it. This exposes the fact that we’re not encoding validity in any way.

Solving the problem with a smart constructor

There’s just one thing we need to do in order to solve the issues mentioned above. We need to ensure that an Address struct can be created in a controlled way, so that after it’s created we can be certain that it represents a valid address.

We start by deciding that our struct can only be created in the module where it is defined. By convention — in a function called new or create that returns {:ok, %Address{}} if creation was successful (so the data must be valid) or {:error, term()} otherwise.

To restrict other modules from creating instances of Address, we’re going to use a special type annotation in the Address module:

@opaque t :: %__MODULE__{city: String.t(),
                         address: String.t(),
                         postal_code: String.t()}

With our type defined like this, Dialyzer will invalidate any code that tries to construct or deconstruct instances of %Address outside the Address module.

This doesn’t come without a price: Dialyzer will also prevent us from accessing struct fields using the dot notation. Still, we believe it’s worth it. Here’s a hand-wavy implementation of the Address module, including the crucial new function:

defmodule Address do
    defstruct [:street, :city, :postal_code]
    @opaque t :: %__MODULE__{city: String.t(),
                         address: String.t(),
                         postal_code: String.t()}

    @spec new(map()) :: {:ok, t()} | {:error, atom()}
    def new(params) do
      with
      ... business_logic ...
      do
        {:ok, %__MODULE__{street: valid_street,
                          city: valid_city,
                          postal_code: valid_postal_code}}
      else
         {:error, :validation_failed}
      end
    end
end

With this function, we can now refactor our original code to eliminate the possibility of invalid Addresses leaking out into the wider application:

case Address.new(params) do
  {:ok, valid_address} ->
    :ok = Repo.Address.insert(valid_address)
    :ok = ParcelService.send_parcel(valid_address)
  {:error, vaidation_error} ->
    log_error(validation_error)
    return_error_to_user(validation_error)
end

Notice how there isn’t even a variable named address that could be mistaken for valid data. We know that the address inside the {:ok, _} tuple is a valid one, and we make it explicit in the naming.

Now, there is no chance that someone will accidentally use an invalid Address, as in log_error below:

if Address.valid?(address) do
  :ok = Repo.Address.insert(address)
  :ok = ParcelService.send_parcel(address)
else
  log_error("invalid_address, #{address}")
  return_error_to_user("address is invalid")
end

Trying to instantiate an Address from outside the Address module angers the Dialyzer:

lib/example.ex:3: The specification for 'Elixir.MyApp.illegal_construction/0 has an opaque subtype
          'Elixir.Address':t() which is violated by the success typing
          () -> #{'__struct__' := 'Elixir.Address', 'city' := ... }
done (warnings were emitted)

Now, let’s see how it’s done.

Under the hood

The trick is to stop thinking about validating data in any way, and to frame creation of new instances of data as a parsing problem. Unstructured data comes in, and it either satisfies our parsers, yielding a successful parse result {:ok, result_type}, or something fails, and that something becomes our unsucessful result: {:error, the_error}

Our little data utility library has a built-in parser combinator for working with parsing maps to structs. The syntax should be self-explanatory:

defmodule Address do
    defstruct [:street, :city, :postal_code]
    @opaque t :: %__MODULE__{}

    @spec new(map()) :: {:ok, t()} | {:error, atom()}
    def new(params) do
      Data.Constructor.struct([
       {:street, Data.Parser.BuiltIn.string()},
       {:city, Data.Parser.BuiltIn.string()},
       {:postal_code, Data.Parser.BuiltIn.string()}],
      __MODULE__,
      params)
  end
end

Now, our function either successfully parses input maps according to our specs, or returns errors containing descriptions of where parsing failed:

iex(2)> Address.new(%{street: "1 Sunset Blvd.",
                      city: "Los Angeles",
                      postal_code: "90046"})
{:ok,
 %Address{city: "Los Angeles", postal_code: "90046", street: "1 Sunset Blvd."}}

iex(3)> Address.new(%{city: "Los Angeles", postal_code: "90046"})
{:error,
 %Error.DomainError{
   caused_by: :nothing,
   details: %{
     field: :street,
     input: %{city: "Los Angeles", postal_code: "90046"}
   },
   reason: :field_not_found_in_input
 }}

iex(4)> Address.new(%{street: "1 Sunset Blvd.",
                      city: "Los Angeles",
                      postal_code: 9000})
{:error,
 %Error.DomainError{
   caused_by: {:just,
    %Error.DomainError{caused_by: :nothing, details: %{}, reason: :not_a_string}},
   details: %{
     field: :postal_code,
     input: %{city: "Los Angeles", postal_code: 9000, street: "1 Sunset Blvd."}
   },
   reason: :failed_to_parse_field
 }}

Data.Constructor.struct gives us a toolset for creating smart construtctors, and ditching the idea of validations completely. However, we skimmed over the details of our ‘internal’ parsers, responsible for the fields in Address structs.

In our next post, we’ll go into these parsers in more depth and demonstrate that we can take domain modeling to a higher level, by modeling the component parts of Addresses as well. There’s no reason they should stay as String.ts.