---
title: "Epidemiology with pkmapr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Epidemiological Analysis with pkmapr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

```{r setup}
library(pkmapr)
library(spdep)
library(dplyr)
```

This vignette shows a full spatial epidemiology workflow using
`pk_neighbors()` to build spatial weights, test for spatial autocorrelation,
and identify local clusters across Pakistan's administrative districts.

## Build spatial weights

`pk_neighbors()` returns three elements in a named list:

- `w$nb`: the `spdep` neighbours list
- `w$listw`: row-standardised spatial weights, ready for `spdep`
- `w$data`: the `sf` object used to build the weights (see note below)

```r
districts <- get_districts()
w <- pk_neighbors(districts)
```

> **Always use `w$data` for subsequent analysis and mapping**, not the original
> `districts` object. This ensures row alignment between the spatial weights
> and your data, particularly when units have been excluded via the `disputed`
> argument.

## Disputed unit handling

Gilgit-Baltistan (GB) and Azad Jammu & Kashmir (AJK) are present in the
OCHA/HDX boundary data and share polygon edges with KP and Punjab. Under the
default (`disputed = "include"`) they participate in the neighbour graph as
normal administrative units.

The `disputed` argument lets analysts explicitly control whether these units
enter the spatial weights structure, which is useful for sensitivity analysis
when results may be influenced by their inclusion:

```r
# Default: all units included
w_all <- pk_neighbors(districts)

# Exclude both GB and AJK
w_excl <- pk_neighbors(districts, disputed = "exclude_both")

# Exclude only Gilgit-Baltistan
w_no_gb <- pk_neighbors(districts, disputed = "exclude_gb")

# Exclude only Azad Jammu & Kashmir
w_no_ajk <- pk_neighbors(districts, disputed = "exclude_ajk")
```

When any units are excluded, `w$data` contains the subsetted `sf` object with
those units removed. Always pass `w$data` to downstream functions:

```r
w_excl <- pk_neighbors(districts, disputed = "exclude_both")

# Use w_excl$data — not the original districts object
moran.test(w_excl$data$area_km2, w_excl$listw)
```

## Global Moran's I (synthetic data)

The example below generates spatially autocorrelated synthetic data to
demonstrate the workflow. Replace `districts$synthetic_rate` with your own
variable — disease incidence rates, prevalence estimates, and so on.

```r
set.seed(2023)

n     <- nrow(w$data)
rho   <- 0.6
W_mat <- listw2mat(w$listw)

epsilon               <- rnorm(n, mean = 50, sd = 10)
y                     <- as.numeric(solve(diag(n) - rho * W_mat) %*% epsilon)
w$data$synthetic_rate <- y

# Test for spatial autocorrelation
moran_result <- moran.test(w$data$synthetic_rate, w$listw)
print(moran_result)
```

A significant Moran's I indicates that districts with similar rates tend to
cluster spatially rather than being arranged at random.

## Local Indicators of Spatial Association (LISA)

LISA identifies local clusters (High-High, Low-Low) and spatial outliers:

```r
lisa <- localmoran(w$data$synthetic_rate, w$listw)

w$data$lisa_i <- lisa[, 1]   # local I statistic
w$data$lisa_p <- lisa[, 5]   # p-value

w$data <- w$data |>
  mutate(
    cluster = case_when(
      lisa_p > 0.05                                          ~ "Not significant",
      synthetic_rate > mean(synthetic_rate) & lisa_i > 0    ~ "High-High",
      synthetic_rate < mean(synthetic_rate) & lisa_i > 0    ~ "Low-Low",
      lisa_i < 0                                             ~ "Outlier",
      TRUE                                                   ~ "Other"
    )
  )

ggplot2::ggplot(w$data) +
  ggplot2::geom_sf(ggplot2::aes(fill = cluster)) +
  ggplot2::scale_fill_manual(values = c(
    "High-High"       = "#d7191c",
    "Low-Low"         = "#2c7bb6",
    "Outlier"         = "#fdae61",
    "Not significant" = "#f0f0f0",
    "Other"           = "#cccccc"
  )) +
  ggplot2::theme_void() +
  ggplot2::labs(title = "LISA Cluster Map", fill = "Cluster type")
```

## Hotspot detection (Getis-Ord Gi\*)

Getis-Ord Gi\* identifies statistically significant hotspots and coldspots
based on the concentration of high or low values:

```r
gi_star <- localG(w$data$synthetic_rate, w$listw)

w$data$gi_star <- as.numeric(gi_star)
w$data$hotspot <- case_when(
  w$data$gi_star >  1.96 ~ "Hotspot",
  w$data$gi_star < -1.96 ~ "Coldspot",
  TRUE                   ~ "Not significant"
)

pk_map(w$data, fill = "hotspot", title = "Hotspots (p < 0.05)")
```

## Sensitivity analysis: disputed units

To assess how much your results depend on the inclusion of GB and AJK, compare
Moran's I under each treatment:

```r
w_all  <- pk_neighbors(districts)
w_excl <- pk_neighbors(districts, disputed = "exclude_both")

# Attach the same variable to each subsetted dataset before comparing
moran_all  <- moran.test(w_all$data$area_km2,  w_all$listw)
moran_excl <- moran.test(w_excl$data$area_km2, w_excl$listw)

cat("Moran's I (all units):     ", round(moran_all$estimate[1],  4), "\n")
cat("Moran's I (GB + AJK excl):", round(moran_excl$estimate[1], 4), "\n")
```

If results differ materially, report both and document your choice of
boundary treatment explicitly in your methods section.

## Complete workflow

```r
# 1. Get data
districts <- get_districts()

# 2. Build spatial weights
w <- pk_neighbors(districts)
# or, for sensitivity: pk_neighbors(districts, disputed = "exclude_both")

# 3. Attach your case data — always join to w$data, not districts
# w$data <- pk_join(w$data, your_case_data, by = "district_code")

# 4. Calculate rates
# w$data <- w$data |>
#   mutate(rate = cases / population * 100000)

# 5. Test for spatial autocorrelation
# moran.test(w$data$rate, w$listw)

# 6. Identify clusters
# lisa <- localmoran(w$data$rate, w$listw)
# w$data$lisa_cluster <- attr(lisa, "quadr")$mean

# 7. Map results
# pk_map(w$data, fill = "lisa_cluster")
```

## References

- Bivand, R. S., Pebesma, E., & Gomez-Rubio, V. (2013). *Applied spatial data
  analysis with R*. Springer.
- OCHA Pakistan. (2023). *COD-AB Administrative Boundaries*.
  <https://data.humdata.org/dataset/cod-ab-pak>