# Contact tracing to assess transmission of COVID-19 from people at high risk on public transport

### Datasets

#### Bus validation

Most of the bus passengers in Fortaleza ((about) 94%) pay for their bus tickets with a smart card. Each time a passenger passes their card at a bus counter, a validation record is created. Fortaleza City Hall has compiled and made available an anonymized data set of bus validations with the following information: Citizen ID (a hash code), Vehicle ID (another hash code), date and the time of the validation check-in and the estimated journey time. The dataset runs from March to December 2020, totaling 107,488,528 validation records referring to 1,426,569 different passengers.

#### Confirmed cases and deaths of COVID-19

The COVID-19 confirmed cases and deaths dataset is an anonymized list of all those diagnosed with the disease in Fortaleza from March to December 2020. This data was also processed and made available by the Fortaleza City Hall. This dataset is organized into columns as follows: Citizen ID (the same hash code used in the previous dataset), OS date, confirmed death flag, death date, and a health worker indicator. During the period covered by the data, there are 85,553 confirmed cases (5,960 healthcare workers) and 3,075 confirmed deaths (227 healthcare workers). These figures are slightly different from those found in official documents34 because they only considered cases with entered OS dates. We emphasize that these health workers are not only front line professionals but also people whose jobs are related to the health field. Finally, we found that 9,032 people (721 healthcare workers) were diagnosed with COVID-19 and used their smart cards on buses at least once from March to December 2020.

### Iterated ensemble Kalman filter

We use the Iterated Ensemble Kalman Filter (IEnKF) framework16.41.42.43 to deduce the parameters of the compartmental model and the initial subpopulations. The algorithm is based on the comparison of the model predictions F(.) obtained by the digital integration of Eqs. (3) – (7) of the main text with a set of T observations ( mathcal {O} _1, ldots, mathcal {O} _T ) taken at discrete times (t_1, ldots, t_T ) in an observation window (see Fig. S4 of Additional Information). The inference frame starts from an initial state vector (X ^ {(0)} = left {S, E, I_r, I_u, R, D_r right } ^ {(0)} ) (see Table 1), as well as an initial parameter vector ( theta ^ {(0)} = left { beta, mu, sigma, gamma, alpha, phi right } ^ {(0)} ) (see table 1). To these vectors, uncertainties are attributed in terms of variance matrices ( sigma _ {X} ) and ( sigma _ { theta} ), respectively. For each iteration m, a set of P “Particles” is generated such that each particle has the initial state at the time (t_0 ) taken from a multivariate normal distribution with mean (X ^ {(m-1)} ) and the gap (a ^ {(m-1)} sigma _ {X} ), or (0 is a “cooling factor”. The initial state vector for the particle I is designated by (X (t_0, i) = mathcal {N} (X ^ {(m-1)}, a ^ {(m-1)} sigma _ {X}) ). These states are also used to define (X_F (t_0, i) ), which define the distribution after time (t_0 ). Similarly, each particle I has an initial parameter vector ( theta (t_0, i) = mathcal {N} ( theta ^ {(m-1)}, b ^ {(m-1)} sigma _ { theta}) ), or (0 is another cooling factor. The inference proceeds by numerically integrating the model from these initial conditions, such that the vector state predicts for each particle I at the time (t_n ) is obtained from the a priori distribution, (X_P (t_n, i) = f (X_F (t_ {n-1}, i), theta (t_ {n-1}, i)) ). Based on these predictions, a weight (W (t_n, i) ) is assigned to each particle I, such as

begin {aligned} W (t_n, i) = exp left (- frac {| mathcal {O} (t_n, i) – mathcal {O} _ {n} |} { Theta} right), end {aligned}

(11)

or ( mathcal {O} (t_n, i) ) is the predicted value for the observed quantities at the time (t_n ) for particle I, and ( Theta ) is a “temperature”. In our case, ( mathcal {O} (t_n, i) ) is the prediction of the cumulative number of deaths reported daily (D_r (t_n, i) ). The filtering process is accomplished by keeping the particles with the greatest weights with probability ( mathcal {P} = W (t_n, i) / sum _j W (t_n, j) ). The states of the filtered particles will define the distribution after time (t_n ), (X_F (t_n, i) = X_P (t_n, i_ {best}) ), or (i_ {best} ) is the set of indices of the filtered particles42. The parameter vector is updated at the time (t_n ) using ( theta (t_ {n}, i) = mathcal {N} ( theta (t_ {n-1}, i_ {best}), b ^ {(m-1)} sigma _ { theta }) ). This filtering process continues until all observations ( mathcal {O} _1, ldots, mathcal {O} _N ) are compared. The iterative process continues by defining the initial state vector (X ^ {(m)} ) and parameter vector ( theta ^ {(m)} ) for the next iteration with the same observation window. The next parameter vector is given by41:

begin {aligned} theta ^ {(m)} = theta ^ {(m-1)} + V (t_1) sum _ {n = 1} ^ {N} V ^ {- 1} ( t_n) left ( bar { theta} (t_ {n}) – bar { theta} (t_ {n-1}) right), end {aligned}

(12)

or ( bar { theta} (t_ {n}) ) is the sample mean of ( theta (t_n, i_ {best}) ) and (V (t_n) ) is the gap41.42. The next state vector is given by the sample mean,

begin {aligned} X ^ {(m)} = frac {1} {P} sum _ {j_ {best} = 1} ^ PX (t_0, j_ {best}). end {aligned}

(13)

After each iteration m, the initial state vector (X ^ {(m)} ) and parameter vector ( theta ^ {(m)} ) are used to calculate the evolution of the model for the whole of the observation window (1, ldots, t_T ). The performance of the inferred model is calculated by evaluating the error

begin {aligned} varepsilon ^ {(m)} = frac {1} {T} sum _ {n = 1} ^ {T} left | mathcal {O} ^ {(m)} _ n – mathcal {O} _ {n} right | ^ 2. end {aligned}

(14)

The iteration continues until ( left | varepsilon ^ {(m)} – varepsilon ^ {(m-1)} right | , where the threshold used here is ( varepsilon _ {max} = 0.01 ). The goodness of fit is also checked by calculating the Pearson coefficient, (R ^ 2 ), between the cumulative number of deaths from the integrated model and the corresponding observations. The value of (R ^ 2> 0.96 ) for all observation windows. Supplementary Information Table S1 lists the inferred parameters for all windows. Figure S1 additional information shows the epidemiological curves obtained from inference with the SEIIR compartmental model. As illustrated, a good agreement can be observed between the model and the observations for the evolution of cumulative deaths.

### Ethical statements

This study has been approved by the Institutional Review Board (IRB) of the Universidade de Fortaleza (UNIFOR). All methods were carried out in accordance with relevant guidelines and regulations. Two datasets were used with the approval and consent obtained by the City Hall of Fortaleza, Ceará, Brazil. The first is a list of confirmed cases of COVID-19 and patient deaths in Fortaleza and the second consists of bus validation records from passenger smart cards, both collected during the period from March to December 2020.