I have been working with probability and machine learning lately, particularly with fitting distributions to datasets. Fitting data is something covered in conventional pre-college curriculum, but I have only ever done it when all of my data has been complete. More recently I ran into a problem.

One fundamental feature in autonomous driving is the distance to the car in front of you. This tends to be a real number, something hopefully a bit bigger than a few meters when travelling at high speeds and potentially quite large when traffic is low. If you have a set of sensors on your car, like radars or lidar, you can only pick up cars up to a certain distance away from yourself. What do you do if there is no car in front of you? Set the distance to infinity?

This is a good example of censored data, a feature where data must fall within a given range but you know when it does so. The other types are truncated and missing data. Data is truncated when the only data you have is within a certain range, and you do not know of the occurrences when it falls outside of that range. Missing Data occurs when you missed or corrupted a reading, or for some reason it is not available. A good example is the velocity of the car in front of you.

So how does one handle fitting distributions to such features?

The answer is a surprisingly straightforward application of Bayes’ theorem.

Consider first a toy problem, obtained from “Information Theory, Inference, and Learning Algorithms” by David MacKay:

Unstable particles are emitted from a source a decay at a distance \(x\), a real number that has an exponential probability distribution which characteristic length \(\lambda\). \(N\) decays are observed at locations \({x_1,\ldots,x_N}\). What is \(\lambda\)?

Solving this for the case with perfect data provides us some insight.

The probability distribution for a single sample point, given \(\lambda\), is:

$$P(x|\lambda) = \lambda e^{-\lambda x} $$

from the definition for the exponential probability distribution.

Applying *Bayes’ theorem*:

$$P(\lambda|{x_1,\ldots,x_N}) = \frac{P({x}|\lambda)P(\lambda)}{P({x})}$$

$$\propto \lambda^N exp\left( -\sum^N_1 x_n \lambda \right) P(\lambda)$$

We can see that simply by conditioning on the data available and setting a prior we can determine the likelihood distribution for the value of \(\lambda\). From here we can do what we wish, such as picking the most likely value for \(\lambda\).

Suppose, however, that the data is truncated. Suppose that we only get readings for particle decays between \(x_\text{min}\) and \(x_\text{max}\). Fitting in the same way is going to cause inconsistencies. Let us start again, following the same steps.

The probability distribution for a single sample point, potentially truncated, given \(\lambda\), is:

$$P(x|\lambda) = \begin{cases} \lambda e^{-\lambda x}/Z(\lambda) & x \in [x_\text{min} x_\text{max}] \\ 0 & \text{otherwise} \end{cases}$$

where $$Z(\lambda) = \int^{x_\text{max}}_{x_\text{min}} \lambda e^{-\lambda x} dx = \left(e^{-\lambda x_\text{min}} – e^{-\lambda x_\text{max}}\right)$$

We then apply *Bayes’ theorem*:

$$P(\lambda|{x_1,\ldots,x_N}) = \frac{P({x}|\lambda)P(\lambda)}{P({x})}$$

$$\propto \left(\frac{\lambda}{Z(\lambda)}\right)^N exp\left( -\sum^N_1 x_n \lambda \right) P(\lambda)$$

This is *very* similar, and was quite easy to determine.

Without going into detail, we can derive the model for censored data. Suppose \(x\) is censored to be less than \(x_\text{max}\):

$$P(x|\lambda) = \begin{cases} \lambda e^{-\lambda x}/Z(\lambda) & \text{for } x \leq x_\text{max} \\ Z'(\lambda)\delta(0) & \text{otherwise} \end{cases}$$

where \(Z(\lambda)\) is the probability of \(x\) being uncensored and \(\delta\) being the Dirac distribution.

$$Z(\lambda) = \int^{x_\text{max}}_{0} \lambda e^{-\lambda x} dx = \left(1 – e^{-\lambda x_\text{max}} \right)$$

$$Z'(\lambda) = 1 – Z(\lambda) = e^{-\lambda x_\text{max}}$$

The final case is missing data. Here, you know when the data is missing but you have no information about where it was. This sort of data must be fitted using the original method and only the observed values.