Throughout, we will consider different geographic units, corresponding to various collections of counties, and other unified geographical areas corresponding to different parts of the landscape.
<- c(
core_counties "Fulton", "DeKalb", "Gwinnett", "Cobb", "Clayton"
)
<- c(
arc_counties "Fulton", "DeKalb", "Gwinnett", "Cobb", "Clayton", "Forsyth",
"Cherokee", "Douglas", "Fayette", "Henry", "Rockdale"
)
<- c(
atlanta_msa_1 'Fulton', 'Gwinnett',
'Cobb', 'DeKalb', 'Clayton',
'Cherokee', 'Forsyth',
'Henry', 'Paulding', 'Coweta',
'Douglas', 'Fayette', 'Carroll',
'Newton', 'Bartow', 'Walton',
'Rockdale'
)
<- c(
atlanta_msa_2 'Barrow', 'Spalding', 'Pickens',
'Haralson', 'Dawson', 'Butts',
'Meriwether', 'Morgan',
'Pike', 'Lamar', 'Jasper', 'Heard'
)
<- c(atlanta_msa_1, atlanta_msa_2) atlanta_msa
Throughout, we will frequently use ACS data; for ACS 1 year estimates, we use data from 2009 to 2019 (we note that 2020 data has not been released as a result of the COVID pandemic adversely affecting data collection), and for ACS 5 year estimates, we use data in the range from the 2006-to-2010 estimates to the 2016-to-2020.
<- lst(
years_acs1 2009, 2010, 2011, 2012, 2013, 2014,
2015, 2016, 2017, 2018, 2019
)
<- lst(
years_acs5 2010, 2011, 2012, 2013, 2014,
2015, 2016, 2017, 2018, 2019, 2020
)
Here, we will consider the I-285 ringroad as defining the “perimeter” of the “main” part of Atlanta, and will consider anything inside of this as being “inside the perimeter” (or ITP), and anything outside as “outside the perimeter” (or OTP). The shape file for the ringroad is available online.
<- sf::st_read("../data/georgia-highways.geojson") highways
## Reading layer `Expressways_Georgia' from data source `/mnt/c/Users/adavi/Documents/comp-journal/ajc-satellite/data/georgia-highways.geojson' using driver `GeoJSON'
## Simple feature collection with 44 features and 10 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: -85.55382 ymin: 30.62627 xmax: -81.09842 ymax: 34.98588
## Geodetic CRS: WGS 84
<- highways %>%
it_285 filter(str_detect(ROAD_NAME, "285")) %>%
select(ROAD_NAME, geometry)
<- it_285[c(1, 3), ]
it_285 <- st_union(st_convex_hull(it_285)) itp
As is visible from the plot below, the ringroad does not necessarily line up nicely with the county divisions within Georgia: the perimeter is highlighted in red, with the regular counties highlighted in black. However, the census tracts (in grey) do line up more nicely with the perimeter, albeit not exactly.
<- sf::st_read("../ga-counties/Counties_Georgia.shp") %>%
arc_counties_shp filter(NAME10 %in% arc_counties) %>%
select(NAME10, COUNTYFP10, geometry)
## Reading layer `Counties_Georgia' from data source `/mnt/c/Users/adavi/Documents/comp-journal/ajc-satellite/ga-counties/Counties_Georgia.shp' using driver `ESRI Shapefile'
## Simple feature collection with 159 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -85.60517 ymin: 30.35576 xmax: -80.75143 ymax: 35.00067
## Geodetic CRS: WGS 84
<- sf::st_read("../ga-tracts-2019/tl_2019_13_tract.shp") %>%
arc_tracts_shp filter(COUNTYFP %in% arc_counties_shp$COUNTYFP10)
## Reading layer `tl_2019_13_tract' from data source `/mnt/c/Users/adavi/Documents/comp-journal/ajc-satellite/ga-tracts-2019/tl_2019_13_tract.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1969 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -85.60516 ymin: 30.35576 xmax: -80.78296 ymax: 35.0013
## Geodetic CRS: NAD83
<- st_intersection(highways, arc_counties_shp)
highway_arc
<- sf::st_read("../ga-counties/Counties_Georgia.shp") %>%
atlanta_msa_shp filter(NAME10 %in% atlanta_msa) %>%
select(NAME10, COUNTYFP10, geometry)
## Reading layer `Counties_Georgia' from data source `/mnt/c/Users/adavi/Documents/comp-journal/ajc-satellite/ga-counties/Counties_Georgia.shp' using driver `ESRI Shapefile'
## Simple feature collection with 159 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -85.60517 ymin: 30.35576 xmax: -80.75143 ymax: 35.00067
## Geodetic CRS: WGS 84
sf_use_s2(FALSE)
<- st_intersection(
highway_arc_msa ::st_buffer(atlanta_msa_shp, dist = 0)
highways, sf
)
ggplot() +
geom_sf(data = arc_tracts_shp, colour = "grey", fill = NA) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = itp, colour = "red", fill = NA)
Ultimately, we want to try and assign each census tract to being ITP or OTP. Firstly, to handle this, we count the number of tracts (where we use the 2020 Census Tracts) which intersect with the “perimeter” itself.
<- st_intersects(
ics st_transform(st_boundary(itp), crs = 4269),
arc_tracts_shp
)
print(paste(
"Number of census tracts intersecting the interstate:",
length(ics[[1]])
))
## [1] "Number of census tracts intersecting the interstate: 57"
As we can see, a small but non-zero number fall inside of this category; we will count these tracts as being inside of the perimeter.
<- arc_tracts_shp %>%
arc_tracts_shp select(NAME, COUNTYFP, TRACTCE, geometry) %>%
mutate(
tp = st_intersects(
st_transform(st_boundary(itp), crs = 4269),
sparse = FALSE
geometry, 1, ],
)[itp = st_within(
geometry,st_transform(itp, crs = 4269), sparse = FALSE
1],
)[, itp = tp | itp,
otp = !itp
%>% select(-tp)
)
%>%
arc_tracts_shp st_drop_geometry() %>%
summarize(
itp = sum(itp), otp = sum(otp)
)
<- arc_tracts_shp %>%
arc_tracts_shp2 pivot_longer(
cols = c(itp, otp),
names_to = "class",
values_to = "in_class"
%>%
) filter(in_class == TRUE) %>%
select(-in_class)
ggplot() +
geom_sf(
data = arc_tracts_shp2, colour = "grey",
mapping = aes(fill = class)
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = itp, colour = "red", fill = NA) +
coord_sf(xlim = c(-84.6, -84), ylim = c(33.4, 34), expand = TRUE)
ACS 1-year data from 2009 to 2019, showing the number of households (with error estimates) which lie in particular income bands across the ARC counties.
df_income_distribution
write_csv(
df_income_distribution,"data-files/income_distribution_by_year_county.csv"
)
To answer this, we summarize over all the counties the number of households in each income bracket, along with the total number, and do so for each year. The margin of errors give 90% confidence intervals of each estimate; to get the margin of error for a sum of variables, we sum the square of the margin of error across all the variables, and then take the square root. For this analysis, we look at all the counties which belong to the Altanta Regional Commission.
From the plots below, we can see that the majority of the shift in the income distribution has been a large increase in the number of households earning between $150,000 and $199,999 and more than $200,000 (by almost 100,000 from 2009 to 2019 in both groups), with a similar drop of almost 100,000 households earning less than $50,000.
<- df_income_distribution %>%
df_income_dist_summary select(year, county, variable, estimate, moe) %>%
group_by(year, variable) %>%
summarize(
estimate = sum(estimate),
moe = sqrt(sum(moe**2))
%>%
) ungroup() %>%
mutate(
variable = factor(variable, ordered = TRUE,
levels = income_var_labels)
%>%
) filter(variable != "Total")
df_income_dist_summary
write_csv(
df_income_dist_summary,"data-files/income_distribution_by_year.csv"
)
require(scales)
ggplot(df_income_dist_summary, aes(x = variable, y = estimate)) +
facet_wrap(~ year) +
geom_col() +
geom_errorbar(aes(ymin = estimate - moe, ymax = estimate + moe)) +
labs(
x = "Income band", y = "Number of households",
title = "Distribution of Household Income in the Atlanta area Between 2009 and 2019"
+ scale_y_continuous(labels = comma) +
) coord_flip()
%>%
df_income_dist_summary mutate(year = as.Date(paste(year, 1, 1, sep = "-"))) %>%
ggplot(aes(
x = year, y = estimate, col = variable, fill = variable,
ymin = estimate - moe, ymax = estimate + moe)
+ geom_point(show.legend = FALSE) + geom_line() +
) geom_ribbon(alpha = 0.2, show.legend = FALSE) +
labs(
x = "Year", y = "Number of households",
col = "Income band"
+
) theme(legend.position = "bottom")
<- df_income_distribution %>%
df_income_prop_summary select(year, county, variable, estimate, moe) %>%
group_by(year, variable) %>%
summarize(
estimate = sum(estimate),
moe = sqrt(sum(moe**2))
%>%
) ungroup() %>%
mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) filter(variable != "Total") %>%
group_by(year) %>%
mutate(
moe = moe / sum(estimate),
estimate = estimate / sum(estimate),
)
df_income_prop_summary
ggplot(df_income_prop_summary, aes(x = variable, y = estimate)) +
facet_wrap(~ year) +
geom_col() +
geom_errorbar(aes(ymin = estimate - moe, ymax = estimate + moe)) +
labs(
x = "Income band", y = "Proportion of households",
title = "Distribution of Household Income in the Atlanta area Between 2009 and 2019"
+ scale_y_continuous(labels = comma) + coord_flip() )
Here we focus particularly on the change between the years in 2009 and 2019.
<- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
cbPalette
%>%
df_income_dist_summary filter(year %in% c(2009, 2019)) %>%
mutate(year = factor(year)) %>%
ggplot(aes(x = variable, y = estimate, fill = year)) +
geom_col(position = "dodge") +
geom_errorbar(
aes(ymin = estimate - moe, ymax = estimate + moe),
position = "dodge"
+ scale_fill_manual(values = cbPalette) +
) geom_text(
aes(label = year, y = 10000),
position = position_dodge(width = 1),
size = 5, hjust = 'left',
col = "white"
+
) labs(
y = "Number of households", x = "Income band",
title = "Distribution of Household Income in the Atlanta area (2009 and 2019)", fill = "Year",
subtitle = "Data from the 1-year ACS from 2009 and 2019"
+ scale_y_continuous(labels = comma) +
) coord_flip() + theme(legend.position = "none")
ggsave("img-files/income_distribution_by_year.pdf")
%>%
df_income_distribution select(year, county, variable, estimate, moe) %>%
filter(variable != "Total") %>%
mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) ggplot(aes(x = year, y = estimate, fill = forcats::fct_rev(variable))) +
facet_wrap(~ county, scales = "free") +
geom_col() + coord_flip() +
labs(
x = "Year", y = "Number of Households", fill = "Income Band",
title = "Distribution of Household Income across Counties and Year",
subtitle = "Data from the 1-year ACS from 2009 to 2019"
+
) theme(legend.position = "bottom")
From the plot below, we can see that the number of households inside of the perimeter earning less than $50,000 has decreased by approximately 25000, whereas that outside the perimeter has stayed roughly constant.
<- map_dfr(
df_income_tracts lst(2009, 2019),
~ get_acs(
geography = "tract",
state = "GA",
variables = income_var_list,
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
tract = gsub("^(.*?),.*", "\\1", NAME),
county = gsub("^(.*?), (.*?),.*", "\\2", NAME)
%>%
) select(-NAME) %>%
relocate(tract, county, .after = GEOID)
<- map_dfr(
df_tracts lst(2009, 2019),
~ get_acs(
geography = "tract",
state = "GA",
variables = "B19001_001",
year = .x,
county = arc_counties,
survey = "acs5",
geometry = TRUE
),.id = "year") %>%
select(year, GEOID, geometry) %>%
mutate(
tp = st_intersects(
st_transform(st_boundary(itp), crs = 4269),
sparse = FALSE
geometry, 1, ],
)[itp = st_within(
geometry,st_transform(itp, crs = 4269), sparse = FALSE
1],
)[, itp = tp | itp,
otp = !itp
%>% select(-tp) %>%
) pivot_longer(
cols = c(itp, otp),
names_to = "class",
values_to = "in_class"
%>%
) filter(in_class == TRUE) %>%
select(-in_class)
<- df_tracts %>% st_drop_geometry()
df_tracts_nogeom
<- df_income_tracts %>%
df_income_by_perimeter filter(variable != "Total") %>%
left_join(df_tracts_nogeom) %>%
select(year, class, variable, estimate, moe) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) group_by(year, class, variable) %>%
summarize(
estimate = sum(estimate),
moe = sqrt(sum(moe**2))
%>%
) mutate(
variable = factor(variable,
ordered = TRUE, levels = income_var_labels
),class = recode_factor(factor(class),
itp = "Inside the perimeter", otp = "Outside the perimeter"
)
)
df_income_by_perimeterwrite_csv(
df_income_by_perimeter,"data-files/income_distribution_inside_outside_perimeter.csv"
)
%>%
df_income_by_perimeter ggplot(aes(x = variable, y = estimate, fill = year)) +
geom_col(position = "dodge") +
geom_errorbar(
aes(ymin = estimate - moe, ymax = estimate + moe),
position = "dodge"
+ scale_fill_manual(values = cbPalette) +
) geom_text(
aes(label = year, y = 10000),
position = position_dodge(width = 1),
size = 5, hjust = 'left',
col = "#000000"
+
) facet_grid(~ class, scales = "free") +
labs(
x = "Income band", y = "Number of households", fill = "Year",
title = "Distribution of household income in the Atlanta inside and outside the perimeter",
subtitle = "Data from the 1-year ACS from 2009 and 2019"
+
) coord_flip() +
theme(legend.position = "none")
To compare the income distribution across tracts over time, for each census tract, we will look at the proportion of the population in the different income bands, and then compare these between 2010 and 2019. (These time periods were chosen so that there are no issues in switching between different census tracts.) To do so, we will compute the total variation distance between the two distributions; formally, if we have \(K\) different possible values, and we have two sets of proportions labelled \(p(i)\) and \(q(i)\) (where \(i\) can take on \(1\), \(2\), …, or \(K\)), then we compute the value \[ \frac{1}{2} \times \Big( |p(1) - q(1)| + |p(2) - q(2)| + \cdots + |p(K) - q(K)| \Big), \] and use this as a measure of how different \(p\) and \(q\) are. If this distance is zero, then the two distributions are identical; the closer the distance is to one, the greater the difference between the distributions. To try and observe what is happening in relation to some other geographical structures, the major highways beyond that of the interstate have also been added to the plots below as dashed black lines (with the county boundaries given as solid black lines, and the census tract boundaries as solid grey lines).
<- map_dfr(
df_income_tracts lst(2010, 2019),
~ get_acs(
geography = "tract",
state = "GA",
variables = income_var_list,
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
tract = gsub("^(.*?),.*", "\\1", NAME),
county = gsub("^(.*?), (.*?),.*", "\\2", NAME)
%>%
) select(-NAME) %>%
relocate(tract, county, .after = GEOID)
<- map_dfr(
df_tracts lst(2010),
~ get_acs(
geography = "tract",
state = "GA",
variables = "B19001_001",
year = .x,
county = arc_counties,
survey = "acs5",
geometry = TRUE
),.id = "year") %>%
select(GEOID, geometry) %>%
mutate(
tp = st_intersects(
st_transform(st_boundary(itp), crs = 4269),
sparse = FALSE
geometry, 1, ],
)[itp = st_within(
geometry,st_transform(itp, crs = 4269), sparse = FALSE
1],
)[, itp = tp | itp,
otp = !itp
%>% select(-tp) %>%
) pivot_longer(
cols = c(itp, otp),
names_to = "class",
values_to = "in_class"
%>%
) filter(in_class == TRUE) %>%
select(-in_class)
<- df_income_tracts %>%
df_total_households_tracts filter(variable == 'Total') %>%
select(year, GEOID, estimate, moe)
<- df_total_households_tracts %>%
df_total_households_tracts2 filter(year == '2019') %>% select(GEOID, estimate)
<- df_income_tracts %>%
df_income_change_long filter(variable != "Total") %>%
select(year, GEOID, variable, estimate) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) group_by(year, GEOID, variable) %>%
summarize(estimate = sum(estimate)) %>%
ungroup() %>%
mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
)
<- df_income_change_long %>%
df_income_change_tracts group_by(year, GEOID) %>%
mutate(estimate = estimate / sum(estimate)) %>%
pivot_wider(
names_from = year,
names_prefix = "estimate_",
values_from = estimate
%>%
) mutate(
diff = abs(estimate_2019 - estimate_2010)
%>%
) group_by(GEOID) %>%
summarize(income_dist_change = 0.5*sum(diff)) %>%
left_join(df_tracts)
<- df_income_change_long %>%
df_income_change_overall group_by(year) %>%
mutate(estimate = estimate / sum(estimate)) %>%
pivot_wider(
names_from = year,
names_prefix = "estimate_",
values_from = estimate
%>%
) mutate(
diff = abs(estimate_2010 - estimate_2019)
%>%
) summarize(income_dist_change = 0.5 * sum(diff))
plot(
$estimate, df_income_change_tracts$income_dist_change,
df_total_households_tracts2col = factor(df_tracts$class),
xlab = "Number of households in census tract in 2019",
ylab = "Total variation of income distribution change in census tract",
main = "Number of households vs income change across census tracts",
sub = "(Horizontal red line corresponds to level of income change across all Atlanta)"
) abline(h = as.numeric(df_income_change_overall), col = "red", lwd = 3)
%>%
df_income_change_tracts mutate(
class = recode_factor(factor(class),
itp = "Inside the perimeter", otp = "Outside the perimeter"
%>%
)) select(-geometry) %>%
ggplot(aes(x = income_dist_change, after_stat(density))) +
geom_histogram() + facet_wrap( ~ class) +
geom_vline(
xintercept = as.numeric(df_income_change_overall),
col = "red"
+
) labs(
x = "Total variation distance between income distributions in 2010 and 2019",
title = "Histogram of total variation distances between income distributions across census tracts",
y = "Density of census tracts in class",
subtitle = "(Vertical red line represents the overall distribution change across Atlanta, vertical blue line represents this plus the IQR overall all"
)
ggplot() +
geom_sf(
data = df_income_change_tracts %>%
mutate(income_dist_change = pmin(income_dist_change, 0.31)),
aes(geometry = geometry, fill = income_dist_change),
colour = "grey"
+ scale_fill_viridis(
) labels = c(0, 0.1, 0.2, "0.3+")
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "#000000", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Total variation of income distributions",
title = "Change in income distribution by census tract",
subtitle = "(Data from the 5 year 2006-2010 and 2015-2019 ACS)"
)
This seems pretty abstract, and so instead we’ll look at some “extreme groups”:
and see where the numbers of these households/proportions have been increasing and decreasing in number the largest.
<- df_income_tracts %>%
df_rdincome_tracts filter(variable != "Total") %>%
select(year, GEOID, variable, estimate) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) group_by(year, GEOID, variable) %>%
summarize(estimate = sum(estimate)) %>%
left_join(df_tracts)
<- ggplot() +
p2010 geom_sf(
data = df_rdincome_tracts %>% filter(year == '2010'),
aes(geometry = geometry, fill = estimate),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(option = "viridis", trans = 'sqrt') +
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "#f100e5", fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Number of households",
title = "Number of households by income band and tract",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
<- ggplot() +
p2019 geom_sf(
data = df_rdincome_tracts %>% filter(year == '2019'),
aes(geometry = geometry, fill = estimate),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(option = "viridis", trans = 'log10') +
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "#f100e5", fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Number of households",
title = "Number of households by income band and tract",
subtitle = "(Data using 5-year estimates from the 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
<- df_income_tracts %>%
df_rdincome_change_tracts filter(variable != "Total") %>%
select(year, GEOID, variable, estimate) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) group_by(year, GEOID, variable) %>%
summarize(estimate = sum(estimate)) %>%
pivot_wider(
names_from = "year",
names_prefix = "estimate_",
values_from = "estimate"
%>%
) mutate(change = estimate_2019 - estimate_2010) %>%
left_join(df_tracts)
%>%
df_rdincome_change_tracts ggplot(aes(x = change)) + geom_histogram() +
facet_wrap(~ variable) +
labs(
x = "Change in number of households",
y = "Number of census tracts",
title = "Change in number of households in different income bands across census tracts",
subtitle = "(Data from the 5-year ACS 2006-2010 and 2015-2019)"
)
<- df_income_tracts %>%
df_rdincome_prop_tracts filter(variable != "Total") %>%
select(year, GEOID, variable, estimate) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) group_by(year, GEOID, variable) %>%
summarize(estimate = sum(estimate)) %>%
group_by(year, GEOID) %>%
mutate(estimate = estimate / sum(estimate)) %>%
left_join(df_tracts)
<- ggplot() +
pincome2010 geom_sf(
data = df_rdincome_prop_tracts %>% filter(year == '2010'),
aes(geometry = geometry, fill = estimate),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(option = "viridis", trans = 'sqrt') +
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "#f100e5", fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Proportion of households",
title = "Proportion of households by income band in census tracts",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
<- ggplot() +
pincome2019 geom_sf(
data = df_rdincome_prop_tracts %>% filter(year == '2019'),
aes(geometry = geometry, fill = estimate),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(option = "viridis", trans = 'log10') +
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "#f100e5", fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Proportion of households",
title = "Proportion of households by income band in census tracts",
subtitle = "(Data using 5-year estimates from the 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
<- df_income_tracts %>%
df_rdincome_prop_change_tracts filter(variable != "Total") %>%
select(year, GEOID, variable, estimate) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) group_by(year, GEOID, variable) %>%
summarize(estimate = sum(estimate)) %>%
group_by(year, GEOID) %>%
mutate(estimate = estimate / sum(estimate)) %>%
ungroup() %>%
pivot_wider(
names_from = "year",
names_prefix = "estimate_",
values_from = "estimate"
%>%
) mutate(change = estimate_2019 - estimate_2010) %>%
left_join(df_tracts)
%>%
df_rdincome_prop_change_tracts ggplot(aes(x = change)) + geom_histogram() +
facet_wrap(~ variable)
<- ggplot() +
p1 geom_sf(
data = df_rdincome_prop_change_tracts %>%
mutate(change = pminmax(change, -0.2, 0.2)),
aes(geometry = geometry, fill = change),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "turbo",
labels = c("<= -0.2", -0.1, 0, 0.1, "0.2 <=")
+
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Change in proportion of households",
title = "Change in proportion of households in income band by tract",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS and 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
p1
ggsave("img-files/income_changes_tract_income_prop.pdf")
<- ggplot() +
p2 geom_sf(
data = df_rdincome_prop_change_tracts %>%
mutate(change = pminmax(change, -0.2, 0.2)) %>%
filter(variable %in% c("Less than $50,000", "$200,000 or more")),
aes(geometry = geometry, fill = change),
colour = "grey", lwd = 0.25
+
) scale_fill_viridis(
option = "turbo",
labels = c("<= -0.2", -0.1, 0, 0.1, "0.2 <=")
+
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Change in proportion of households",
title = "Change in proportion of households in income band by tract",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS and 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
p2
ggsave("img-files/income_changes_tract_income_prop_highlow.pdf")
Let’s zoom inside the perimeter to see what is happening here a bit more carefully.
+ coord_sf(xlim = c(-84.53, -84.2), ylim = c(33.6, 33.95), expand = FALSE) p1
ggsave("img-files/income_changes_tract_income_prop_itp.pdf")
+ coord_sf(xlim = c(-84.53, -84.2), ylim = c(33.6, 33.95), expand = FALSE) p2
ggsave("img-files/income_changes_tract_income_prop_highlow_itp.pdf")
Let’s also just look at the actual numbers of households, rather than the change in proportions over time. (This should let us see if the effects we are seeing are due to chance, or are of evidence of an actual pattern).
<- df_income_tracts %>%
df_rdincome_total_change_tracts filter(variable != "Total") %>%
select(year, GEOID, variable, estimate, moe) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) group_by(year, GEOID, variable) %>%
summarize(
estimate = sum(estimate),
moe = sqrt(sum(moe**2))
%>%
) ungroup() %>%
pivot_wider(
names_from = "year",
values_from = c("estimate", "moe")
%>%
) mutate(
change = estimate_2019 - estimate_2010,
ratio = estimate_2019 / estimate_2010,
log_ratio_moe = sqrt((moe_2019/estimate_2019)**2 + (moe_2010/estimate_2010)**2)
%>%
) left_join(df_tracts)
<- ggplot() +
p1_total geom_sf(
data = df_rdincome_total_change_tracts %>%
mutate(change = pminmax(change, -500, 500)),
aes(geometry = geometry, fill = change),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "turbo",
labels = c("<= -500", -250, 0, 250, "500 <=")
+
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Change in number of households",
title = "Change in number of households in income band by tract",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS and 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
p1_total
ggsave("img-files/income_changes_tract_income_net_change.pdf")
<- ggplot() +
p2_total geom_sf(
data = df_rdincome_total_change_tracts %>%
mutate(change = pminmax(change, -500, 500)) %>%
filter(variable %in% c("Less than $50,000", "$200,000 or more")),
aes(geometry = geometry, fill = change),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "turbo",
labels = c("<= -500", -250, 0, 250, "500 <=")
+
) facet_wrap(~ variable) +
geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Change in number of households",
title = "Change in number of households in income band by tract",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS and 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
p2_total
ggsave("img-files/income_changes_tract_income_net_change_highlow.pdf")
+ coord_sf(xlim = c(-84.53, -84.2), ylim = c(33.6, 33.95), expand = FALSE) p1_total
ggsave("img-files/income_changes_tract_income_net_change_itp.pdf")
+ coord_sf(xlim = c(-84.53, -84.2), ylim = c(33.6, 33.95), expand = FALSE) p2_total
ggsave("img-files/income_changes_tract_income_net_change_highlow_itp.pdf")
<- paste0("B07010_0", str_pad(1:66, 2, pad = "0"))
income_variables
<- v11 %>%
income_variables_df filter(name %in% income_variables) %>%
select(name, label) %>%
separate(
label, c('estimate', 'total', 'moved', 'income', 'band'),
sep = "!!"
%>%
) drop_na() %>%
select(name, moved, band) %>%
filter(moved != "Moved from abroad") %>%
mutate(
in_out_state = fct_collapse(
as_factor(moved),
"Same house 1 year ago" = "Same house 1 year ago",
"Moved in-state" = c(
"Moved within same county",
"Moved from different county within same state",
"Moved from different state"
),"Moved from out-of-state" = c("Moved from different state")
),band = fct_collapse(
as_factor(band),
"$1 to $14,999" = c(
"$1 to $9,999 or loss", "$10,000 to $14,999"
),"$15,000 to $34,999" = c(
"$15,000 to $24,999", "$25,000 to $34,999"
),"$35,000 to $49,999" = "$35,000 to $49,999",
"$50,000 to $74,999" = c(
"$50,000 to $64,999", "$65,000 to $74,999"
),"$75,000 or more" = "$75,000 or more"
)
)
<- map_dfr(
df_income_in
years_acs1,~ get_acs(
geography = "county",
state = "GA",
variables = income_variables,
year = .x,
county = arc_counties,
survey = "acs1",
geometry = FALSE
),.id = "year") %>%
mutate(county = gsub("^(.*?),.*", "\\1", NAME)) %>%
select(-NAME) %>%
relocate(county, .after = GEOID)
%>%
df_income_in left_join(income_variables_df, by = c("variable" = "name")) %>%
mutate(year = as.Date(paste(year, 1, 1, sep = "-"))) %>%
drop_na() %>%
group_by(year, in_out_state, band) %>%
summarize(
estimate = sum(estimate),
moe = sqrt(sum(moe**2))
%>%
) filter(str_detect(in_out_state, "Moved")) %>%
ggplot(aes(
x = year, y = estimate, ymin = estimate - moe,
ymax = estimate + moe, col = band, fill = band)
+ geom_line() +
) geom_point() +
geom_ribbon(alpha = 0.15, show.legend = FALSE) +
theme(legend.position = "bottom") +
facet_wrap( ~ in_out_state, scales = "free") +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
labs(
x = "Year", y = "Number of Individuals",
col = "Individual Income",
title = "Distribution of income of individuals who moved into Atlanta from out of state",
subtitle = "Data from 1-year ACS from 2009 to 2019"
+
) guides(fill = "none")
ggsave("img-files/income_out_of_state.pdf")
<- map_dfr(
df_income_in_5yr
years_acs5,~ get_acs(
geography = "county",
state = "GA",
variables = income_variables,
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(county = gsub("^(.*?),.*", "\\1", NAME)) %>%
select(-NAME) %>%
relocate(county, .after = GEOID)
%>%
df_income_in_5yr left_join(income_variables_df, by = c("variable" = "name")) %>%
mutate(year = as.Date(paste(year, 1, 1, sep = "-"))) %>%
drop_na() %>%
group_by(year, in_out_state, band) %>%
summarize(
estimate = sum(estimate),
moe = sqrt(sum(moe**2))
%>%
) filter(in_out_state == "Moved from out-of-state") %>%
ggplot(aes(
x = year, y = estimate, ymin = estimate - moe,
ymax = estimate + moe, col = band, fill = band)
+ geom_line() +
) geom_point() +
geom_ribbon(alpha = 0.15, show.legend = FALSE) +
theme(legend.position = "bottom") +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
labs(
x = "Year", y = "Number of Individuals",
col = "Individual Income",
title = "Distribution of income of individuals who moved into Atlanta from out of state",
subtitle = "Data from 5-year ACS from 2005-2009 to 2016-2020 (years not directly comparable)"
+
) guides(fill = "none")
# df_income_change_corr <- df_income_tracts %>%
# filter(variable != "Total") %>%
# select(year, GEOID, variable, estimate, moe) %>%
# mutate(variable = fct_collapse(variable,
# "Less than $75,000" = c(
# "Less than $10,000", "$10,000 to $14,999",
# "$15,000 to $19,999", "$20,000 to $24,999",
# "$25,000 to $29,999", "$30,000 to $34,999",
# "$35,000 to $39,999", "$40,000 to $44,999",
# "$45,000 to $49,999", "$50,000 to $59,999",
# "$60,000 to $74,999"
# ),
# "$75,000 to $99,999" = c("$75,000 to $99,999"),
# "$100,000 to $149,999" = c(
# "$100,000 to $124,999", "$125,000 to $149,999"
# ),
# "$150,000 or more" = c(
# "$150,000 to $199,999", "$200,000 or more"
# )
# )) %>%
# mutate(
# variable = factor(
# variable, ordered = TRUE,
# levels = c(
# "Less than $75,000", "$75,000 to $99,999",
# "$100,000 to $149,999", "$150,000 or more"
# )
# )
# ) %>%
# group_by(year, GEOID, variable) %>%
# summarize(
# estimate = sum(estimate),
# moe = sqrt(sum(moe**2))
# ) %>%
# ungroup() %>%
# pivot_wider(
# names_from = "year",
# values_from = c("estimate", "moe")
# ) %>%
# mutate(
# change = estimate_2019 - estimate_2010,
# ratio = estimate_2019 / estimate_2010,
# log_ratio_moe = sqrt((moe_2019/estimate_2019)**2 + (moe_2010/estimate_2010)**2)
# ) %>%
# left_join(df_tracts) %>%
# select(GEOID, variable, ratio, log_ratio_moe) %>%
# filter(variable %in% c("Less than $75,000", "$150,000 or more")) %>%
# mutate(variable = fct_recode(
# variable, "lt_75k" = "Less than $75,000", "gt_150k" = "$150,000 or more"
# )) %>%
# pivot_wider(
# names_from = variable,
# values_from = c("ratio", "log_ratio_moe")
# )
# require(deming)
# dr <- deming(
# log(ratio_lt_75k) ~ log(ratio_gt_150k),
# data = df_income_change_corr %>% filter(
# ratio_gt_150k < 20, ratio_gt_150k > 0.01
# ),
# ystd = log_ratio_moe_lt_75k, xstd = log_ratio_moe_gt_150k
# )
# dr
In order to try and examine more carefully where people in the lower income band are moving, we look at the change in households over the entire Atlanta Metropoliton Statistical Area in the $75,000 or less income bracket. We note that this roughly corresponds to the current living wage) estimate for a household in Atlanta, assuming a two adult household with two children and one adult working.
<- map_dfr(
df_tracts_msa lst(2010),
~ get_acs(
geography = "tract",
state = "GA",
variables = "B19001_001",
year = .x,
county = atlanta_msa,
survey = "acs5",
geometry = TRUE
),.id = "year") %>%
select(GEOID, geometry) %>%
mutate(
tp = st_intersects(
st_transform(st_boundary(itp), crs = 4269),
sparse = FALSE
geometry, 1, ],
)[itp = st_within(
geometry,st_transform(itp, crs = 4269), sparse = FALSE
1],
)[, itp = tp | itp,
otp = !itp
%>% select(-tp) %>%
) pivot_longer(
cols = c(itp, otp),
names_to = "class",
values_to = "in_class"
%>%
) filter(in_class == TRUE) %>%
select(-in_class)
<- map_dfr(
df_income_tracts_msa lst(2010, 2019),
~ get_acs(
geography = "tract",
state = "GA",
variables = income_var_list,
year = .x,
county = atlanta_msa,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
tract = gsub("^(.*?),.*", "\\1", NAME),
county = gsub("^(.*?), (.*?),.*", "\\2", NAME)
%>%
) select(-NAME) %>%
relocate(tract, county, .after = GEOID) %>%
filter(variable != "Total") %>%
select(year, GEOID, variable, estimate) %>%
mutate(variable = fct_collapse(variable,
"Less than $50,000" = c(
"Less than $10,000", "$10,000 to $14,999",
"$15,000 to $19,999", "$20,000 to $24,999",
"$25,000 to $29,999", "$30,000 to $34,999",
"$35,000 to $39,999", "$40,000 to $44,999",
"$45,000 to $49,999"
),"$50,000 to $74,999" = c("$50,000 to $59,999", "$60,000 to $74,999"),
"$75,000 to $99,999" = c("$75,000 to $99,999"),
"$100,000 to $149,999" = c(
"$100,000 to $124,999", "$125,000 to $149,999"
),"$150,000 to $199,999" = c("$150,000 to $199,999"),
"$200,000 or more" = c("$200,000 or more")
%>%
)) mutate(
variable = factor(variable, ordered = TRUE, levels = income_var_labels)
%>%
) group_by(year, GEOID, variable) %>%
summarize(estimate = sum(estimate)) %>%
pivot_wider(
names_from = "year",
names_prefix = "estimate_",
values_from = "estimate"
%>%
) mutate(change = estimate_2019 - estimate_2010) %>%
left_join(df_tracts_msa)
<- ggplot() +
p_wider geom_sf(
data = df_income_tracts_msa %>%
filter(variable == "Less than $50,000") %>%
mutate(change = pminmax(change, -500, 500)),
aes(geometry = geometry, fill = change),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "turbo",
labels = c("<= -500", -250, 0, 250, "500 <=")
+ geom_sf(data = atlanta_msa_shp, colour = "black", fill = NA) +
) geom_sf(data = highway_arc_msa, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Change in number of households earning less than $50,000",
title = "Change in number of households in income band by tract",
subtitle = "(Data using 5-year estimates from the 2006-2010 ACS and 2015-2019 ACS)"
+
) theme(legend.key.width = unit(2, 'cm'))
p_wider
From the data and graph below, we can see that prior to the recession, the majority of the amount paid per month in rent was less than $1,249 a month, whereas now a considerable number of units require at least this much to pay. That said, this is not accounting directly for changes in inflation; however, when looking at the median and lower quartile monthly rent, these are typically around 20% and 10% higher than what they were ten years ago, even when adjusting for inflation. The same occurs in most of the counties for the upper quartile of monthly rent.
<- load_variables(2020, "acs5", cache = TRUE)
v20
<- paste0("B25063_0", str_pad(1:24, 2, pad = "0"))
rent_variables_2010
<- v11 %>%
rent_var_labels_2010 filter(name %in% rent_variables_2010) %>%
select(label) %>% as_vector()
<- sapply(
rent_var_labels_2010 str_split(rent_var_labels_2010, "!!"), tail, 1
)
<- setNames(
rent_var_list_2010 as.list(rent_variables_2010), rent_var_labels_2010
)
<- map_dfr(
df_rent_distribution_2010 lst(2010),
~ get_acs(
geography = "county",
state = "GA",
variables = rent_var_list_2010,
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
county = gsub("^(.*?),.*", "\\1", NAME)
%>% select(-NAME) %>% relocate(county, .after = GEOID)
)
<- paste0("B25063_0", str_pad(1:27, 2, pad = "0"))
rent_variables_2020
<- v20 %>%
rent_var_labels_2020 filter(name %in% rent_variables_2020) %>%
select(label) %>% as_vector()
<- sapply(
rent_var_labels_2020 str_split(rent_var_labels_2020, "!!"), tail, 1
)
<- setNames(
rent_var_list_2020 as.list(rent_variables_2020), rent_var_labels_2020
)
<- map_dfr(
df_rent_distribution_2020 lst(2020),
~ get_acs(
geography = "county",
state = "GA",
variables = rent_var_list_2020,
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
county = gsub("^(.*?),.*", "\\1", NAME)
%>% select(-NAME) %>% relocate(county, .after = GEOID)
)
<- bind_rows(
df_rent_distribution
df_rent_distribution_2010, df_rent_distribution_2020
)
<- c(
collapsed_factors "Less than $750", "$750 to $999", "$1,000 to $1,249",
"$1,250 to $1,499", "$1,500 to $1,999", "$2,000 or more"
)
<- df_rent_distribution %>%
df_rent_dist_summary mutate(variable = fct_collapse(
variable,"Less than $750" = c(
"Less than $100", "$100 to $149", "$150 to $199", "$200 to $249",
"$250 to $299", "$300 to $349", "$350 to $399",
"$400 to $449", "$450 to $499", "$500 to $549",
"$550 to $599", "$600 to $649", "$650 to $699",
"$700 to $749"
),"$750 to $999" = c(
"$750 to $799", "$800 to $899", "$900 to $999"
),"$1,000 to $1,249" = c(
"$1,000 to $1,249"
),"$1,250 to $1,499" = c(
"$1,250 to $1,499"
),"$1,500 to $1,999" = c(
"$1,500 to $1,999"
),"$2,000 or more" = c(
"$2,000 or more", "$2,000 to $2,499",
"$2,500 to $2,999", "$3,000 to $3,499",
"$3,500 or more"
)%>% group_by(year, variable) %>%
)) summarize(
estimate = sum(estimate), moe = sqrt(sum(moe**2))
%>%
) filter(variable %in% collapsed_factors) %>%
mutate(
variable = factor(variable,
ordered = "true", levels = collapsed_factors)
)
df_rent_dist_summary
%>%
df_rent_dist_summary ggplot(aes(x = variable, y = estimate, fill = year)) +
geom_col(position = "dodge") +
geom_errorbar(
aes(ymin = estimate - moe, ymax = estimate + moe),
position = "dodge"
+ scale_fill_manual(values = cbPalette) +
) geom_text(
aes(label = year, y = 10000),
position = position_dodge(width = 1),
size = 5, hjust = 'left',
col = "black"
+
) labs(
x = "Gross Rent", y = "Number of renter-occupied housing units", fill = "Year",
title = "Distribution of monthly gross rent in the Atlanta area from 2010 to 2019",
subtitle = "Data from the 5-year ACS from 2006-2010 to 2016-2020"
+
) coord_flip() +
theme(legend.position = "none")
<- map_dfr(
df_rent_lower_quartile lst(2010, 2020),
~ get_acs(
geography = "county",
state = "GA",
variables = "B25057_001",
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
county = gsub("^(.*?),.*", "\\1", NAME)
%>% select(-NAME) %>% relocate(county, .after = GEOID) %>%
) select(county, year, estimate) %>%
pivot_wider(
names_from = year, values_from = estimate,
names_prefix = "rent_"
%>%
) mutate(
rent_2010_in_2020_dollars = rent_2010 * 1.19,
ratio_adjusting_for_inflation = rent_2020 / (rent_2010 * 1.19)
)
df_rent_lower_quartile
<- map_dfr(
df_rent_median_quartile lst(2010, 2020),
~ get_acs(
geography = "county",
state = "GA",
variables = "B25058_001",
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
county = gsub("^(.*?),.*", "\\1", NAME)
%>% select(-NAME) %>% relocate(county, .after = GEOID) %>%
) select(county, year, estimate) %>%
pivot_wider(
names_from = year, values_from = estimate,
names_prefix = "rent_"
%>%
) mutate(
rent_2010_in_2020_dollars = rent_2010 * 1.19,
ratio_adjusting_for_inflation = rent_2020 / (rent_2010 * 1.19)
)
df_rent_median_quartile
<- map_dfr(
df_rent_upper_quartile lst(2010, 2020),
~ get_acs(
geography = "county",
state = "GA",
variables = "B25059_001",
year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>%
mutate(
county = gsub("^(.*?),.*", "\\1", NAME)
%>%
) select(-NAME) %>%
relocate(county, .after = GEOID) %>%
select(county, year, estimate) %>%
pivot_wider(
names_from = year, values_from = estimate,
names_prefix = "rent_"
%>%
) mutate(
rent_2010_in_2020_dollars = rent_2010 * 1.19,
ratio_adjusting_for_inflation = rent_2020 / (rent_2010 * 1.19)
)
df_rent_upper_quartile
<- map_dfr(
df_income_quintiles lst(2010, 2019),
~ get_acs(
geography = "tract",
state = "GA",
variables = c(
first_qui = "B19080_001",
second_qui = "B19080_002",
third_qui = "B19080_003",
fourth_qui = "B19080_004",
top_5per = "B19080_005"
),year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>% mutate(
tract = gsub("^(.*?),.*", "\\1", NAME),
county = gsub("^(.*?), (.*?),.*", "\\2", NAME)
%>%
) select(-NAME) %>%
relocate(tract, county, .after = GEOID) %>%
select(year, GEOID, variable, estimate, moe) %>%
mutate(
inflation_factor = ifelse(year == "2019", 1, 1.175),
estimate = estimate * inflation_factor,
moe = moe * inflation_factor
%>%
) select(-inflation_factor)
<- df_income_quintiles %>%
df_income_quintiles_change filter(variable %in% c("first_qui", "top_5per")) %>%
pivot_wider(
names_from = year,
values_from = c(estimate, moe)
%>%
) mutate(
estimate_diff = estimate_2019 - estimate_2010,
moe_diff = sqrt(moe_2019**2 + moe_2010**2)
%>%
) select(GEOID, variable, estimate_diff, moe_diff) %>%
pivot_wider(
names_from = variable,
values_from = c(estimate_diff, moe_diff)
)
%>%
df_income_quintiles_change ggplot(aes(x = estimate_diff_top_5per)) +
geom_histogram() +
labs(
x = "Adjusted difference in 95% percentile of income between 2010 and 2019",
subtitle = "Spike occurs due to max income being capped at a particular level"
)
The Gini coefficient is a measure of income inequality frequently used, giving a number between 0 and 1 (with larger values corresponding to greater inequality). We examine the change in the estimate Gini coefficient in each census tract, comparing the periods between 2006-2010 and 2015-2019, and when doing so, we find that there has been a statistically significant increase in the Gini coefficient generally across tracts, with the Gini coefficient increasing on average by 0.014 across tracts.
<- map_dfr(
df_gini lst(2010, 2019),
~ get_acs(
geography = "tract",
state = "GA",
variables = c(
gini = "B19083_001"
),year = .x,
county = arc_counties,
survey = "acs5",
geometry = FALSE
),.id = "year") %>% mutate(
tract = gsub("^(.*?),.*", "\\1", NAME),
county = gsub("^(.*?), (.*?),.*", "\\2", NAME)
%>%
) select(-NAME) %>%
relocate(tract, county, .after = GEOID) %>%
select(year, GEOID, estimate, moe)
<- df_gini %>%
df_gini_wide pivot_wider(
values_from = c(estimate, moe),
names_from = year,
names_prefix = "gini_",
names_sep = "_"
%>%
) mutate(
diff = estimate_gini_2019 - estimate_gini_2010,
diff_moe = sqrt(moe_gini_2019**2 + moe_gini_2010**2),
diff_sd = diff_moe / 1.645,
z_score = diff / diff_sd,
ratio = estimate_gini_2019 / estimate_gini_2010,
log_ratio_moe = sqrt( (moe_gini_2010 / estimate_gini_2010)**2 + (moe_gini_2019 / estimate_gini_2019)**2 ),
log_ratio_sd = log_ratio_moe / 1.645
)
<- mean(df_gini_wide$diff, na.rm = TRUE)
mean_est <- sd(df_gini_wide$diff, na.rm = TRUE)
sd1 <- sqrt(mean(df_gini_wide$diff_sd**2, na.rm = TRUE))
sd2 <- sum(!is.na(df_gini_wide$diff))
samp_n
<- sqrt(samp_n)*mean_est/sqrt(sd1**2 + sd2**2)
t_stat <- pt(t_stat, df = samp_n, lower.tail = FALSE)
p_val print(paste("p-value of hypothesis test:", p_val))
## [1] "p-value of hypothesis test: 0.000000173458002202825"
ggplot() +
geom_sf(
data = df_tracts %>% left_join(df_gini_wide),
aes(geometry = geometry, fill = estimate_gini_2019),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "magma"
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Gini coefficient",
title = "Gini coefficient estimates over 2015-2019 across Atlanta",
subtitle = "(Data using ACS 2015-2019 estimates)"
+
) theme(legend.key.width = unit(2, 'cm'))
<- read_csv("data-files/ncld_class_labels.csv") %>%
land_use_labels select(layer_vals, labels) %>%
add_row(layer_vals = 0, labels = "Roads")
%>% print(n = Inf) land_use_labels
## # A tibble: 21 × 2
## layer_vals labels
## <dbl> <chr>
## 1 11 Open water
## 2 12 Perennial ice/snow
## 3 21 Developed, open space
## 4 22 Developed, low intensity
## 5 23 Developed, medium intensity
## 6 24 Developed high intensity
## 7 31 Barren land (rock/sand/clay)
## 8 41 Deciduous forest
## 9 42 Evergreen forest
## 10 43 Mixed forest
## 11 51 Dwarf scrub
## 12 52 Shrub/scrub
## 13 71 Grassland/herbaceous
## 14 72 Sedge/herbaceous
## 15 73 Lichens
## 16 74 Moss
## 17 81 Pasture/hay
## 18 82 Cultivated crops
## 19 90 Woody wetlands
## 20 95 Emergent herbaceous wetlands
## 21 0 Roads
<- read_csv("data-files/ncld_histogram_data.csv")
land_use_df
<- land_use_df %>%
land_use_df complete(GEOID, year, class) %>%
mutate(value = replace_na(value, 0))
<- land_use_df %>%
land_use_change filter(year %in% c(2008, 2019)) %>%
group_by(GEOID, year) %>%
mutate(value = value / sum(value)) %>%
ungroup() %>%
pivot_wider(
names_from = year,
values_from = value,
names_prefix = "value_"
%>%
) mutate(abs_diff = abs(value_2019 - value_2008)) %>%
group_by(GEOID) %>%
summarize(tv_dist = 0.5*sum(abs_diff))
%>%
land_use_change ggplot(aes(x = tv_dist)) + geom_histogram() +
labs(
x = "Total variation change in land use proportions of tract",
y = "Number of census tracts",
subtitle = "Examining 2008 vs 2019"
)
<- land_use_change %>%
land_use_change_shp mutate(GEOID = as.character(GEOID)) %>%
left_join(df_tracts)
<- ggplot() +
land_use_plot geom_sf(
data = land_use_change_shp %>%
mutate(tv_dist = pminmax(tv_dist, 0, 0.05)),
aes(geometry = geometry, fill = tv_dist),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "viridis",
labels = c("0", "0.01", "0.02", "0.03", "0.04", "0.05+")
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Overall change in land type",
title = "Total variation between type of land use between 2008 and 2019",
subtitle = "(Data using NCLD National Land Cover Database)"
+
) theme(legend.key.width = unit(2, 'cm'))
land_use_plot
ggsave("img-files/land_use_change.pdf")
<- ggplot() +
land_use_plot_outliers geom_sf(
data = land_use_change_shp,
aes(geometry = geometry, fill = tv_dist),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "viridis"
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Overall change in land type\n(Not truncated)",
title = "Total variation between type of land use between 2008 and 2019",
subtitle = "(Data using NCLD National Land Cover Database)"
+
) theme(legend.key.width = unit(2, 'cm'))
land_use_plot_outliers
+
land_use_plot coord_sf(xlim = c(-84.53, -84.2), ylim = c(33.6, 33.95), expand = FALSE)
ggsave("img-files/land_use_change_itp.pdf")
<- land_use_df %>%
land_use_change_2001 filter(year %in% c(2001, 2019)) %>%
group_by(GEOID, year) %>%
mutate(value = value / sum(value)) %>%
ungroup() %>%
pivot_wider(
names_from = year,
values_from = value,
names_prefix = "value_"
%>%
) mutate(abs_diff = abs(value_2019 - value_2001)) %>%
group_by(GEOID) %>%
summarize(tv_dist = 0.5 * sum(abs_diff))
%>%
land_use_change_2001 ggplot(aes(x = tv_dist)) + geom_histogram() +
labs(
x = "Total variation change in land use proportions of tract",
y = "Number of census tracts",
subtitle = "Examining 2001 vs 2019"
)
<- land_use_change_2001 %>%
land_use_change_shp_2001 mutate(GEOID = as.character(GEOID)) %>%
left_join(df_tracts)
<- ggplot() +
land_use_plot geom_sf(
data = land_use_change_shp_2001 %>%
mutate(tv_dist = pminmax(tv_dist, 0, 0.15)),
aes(geometry = geometry, fill = tv_dist),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "viridis",
labels = c("0", "0.05", "0.10", "0.15+")
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Overall change in land type",
title = "Total variation between type of land use between 2001 and 2019",
subtitle = "(Data using NCLD National Land Cover Database)"
+
) theme(legend.key.width = unit(2, 'cm'))
land_use_plot
ggsave("img-files/land_use_change_2001.pdf")
<- ggplot() +
land_use_plot_outliers geom_sf(
data = land_use_change_shp_2001,
aes(geometry = geometry, fill = pminmax(tv_dist, 0, 0.3)),
colour = "grey", lwd = 0.25
+ scale_fill_viridis(
) option = "viridis"
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
theme(legend.position = "bottom") +
labs(
fill = "Overall change in land type\n(Truncated at 0.3)",
title = "Total variation between type of land use between 2001 and 2019",
subtitle = "(Data using NCLD National Land Cover Database)"
+
) theme(legend.key.width = unit(2, 'cm'))
land_use_plot_outliers
+
land_use_plot coord_sf(xlim = c(-84.53, -84.2), ylim = c(33.6, 33.95), expand = FALSE)
ggsave("img-files/land_use_change_itp_2001.pdf")
As we can see, looking at the most frequent “class” (i.e, the way in which the land has been used the most) tends to not change much across the two time periods, although there has been a growth in the more developed areas.
<- land_use_df %>%
land_use_df_summary mutate(class = fct_collapse(
factor(class), "Roads" = "0",
"Developed, open space" = "21",
"Developed, low intensity" = "22",
"Developed, medium intensity" = "23",
"Developed, high intensity" = "24",
"Undeveloped / greenland" = c(
"11", "12", "31", "41", "42",
"43", "51", "52", "71", "72", "73", "74",
"81", "82", "90", "95"
)%>%
)) group_by(GEOID, year, class) %>%
summarize(value = sum(value)) %>%
filter(year %in% c(2008, 2019)) %>%
group_by(GEOID, year) %>%
mutate(value = value / sum(value))
%>%
land_use_df_summary pivot_wider(
names_from = year,
values_from = value,
names_prefix = "value_"
%>%
) mutate(value_diff = value_2019 - value_2008) %>%
group_by(GEOID) %>%
summarize(tv_diff = 0.5*sum(abs(value_diff))) %>%
ggplot(aes(x = tv_diff)) + geom_histogram() +
labs(
x = "Total variation difference between proportion of land use by tract from 2008 to 2019",
title = "Change in land-use (using reduced classes)"
)
%>%
land_use_df_summary group_by(GEOID, year) %>%
mutate(max_value = max(value)) %>%
ungroup() %>%
filter(value == max_value) %>%
select(year, class) %>%
group_by(year, class) %>%
summarize(count = n()) %>%
pivot_wider(
values_from = count,
names_from = year
)
We can also see that both inside and outside of the perimeter, there is generally a trend towards more developed areas.
%>%
land_use_df mutate(class = fct_collapse(
factor(class), "Roads" = "0",
"Developed, open space" = "21",
"Developed, low intensity" = "22",
"Developed, medium intensity" = "23",
"Developed, high intensity" = "24",
"Undeveloped / greenland" = c(
"11", "12", "31", "41", "42",
"43", "51", "52", "71", "72", "73", "74",
"81", "82", "90", "95"
)%>%
)) mutate(GEOID = as.character(GEOID)) %>%
left_join(
%>%
df_tracts select(GEOID, class) %>%
rename(peri_class = class) %>%
st_drop_geometry()
%>%
) group_by(year, class, peri_class) %>%
summarize(value = sum(value)) %>%
ungroup() %>%
pivot_wider(
values_from = value,
names_from = peri_class
%>%
) group_by(year) %>%
mutate(itp = itp / sum(itp), otp = otp / sum(otp)) %>%
filter(year %in% c(2008, 2019)) %>%
pivot_wider(
names_from = year,
values_from = c(itp, otp)
%>% filter(class != "Roads") )
To interpret the types of land better, we fit a K-mediods clustering algorithm to the vector of proportions of land use, after having reduced the number of classes slightly - for example, we our purposes we do not need three different types of forest class. To decide on the number of clusters, we used a gap statistic to decide on an appriopate number of clusters, which gave the choice of 8 clusters. The clusters were labelled according to the mediods of the clusters; we then plot the cluster assignments by tract across 2001 and 2019. Overall, 287 of the 783 areas considering changed in their type of use.
<- land_use_df %>%
land_use_cluster filter(year %in% c(2001, 2019)) %>%
filter(class != 0) %>%
mutate(class = fct_collapse(factor(class),
"1X" = c("11", "12"),
"21" = "21",
"22" = "22",
"23" = "23",
"24" = "24",
"31" = "31",
"4X" = c("41", "42", "43"),
"5X" = c("51", "52"),
"7X" = c("71", "72", "73", "74"),
"8X" = c("81", "82"),
"9X" = c("90", "95")
%>%
)) group_by(GEOID, year, class) %>%
summarize(value = sum(value)) %>%
group_by(GEOID, year) %>%
mutate(value = value / sum(value)) %>%
ungroup() %>%
pivot_wider(
names_from = class,
values_from = value,
names_prefix = "class_"
)
<- clusGap(
cg 3:13], clara, samples = 100,
land_use_cluster[, K.max = 15, B = 100
)
print(cg)
<- clara(
land_clusters 3:13], 8, metric = "manhattan",
land_use_cluster[, samples = 100
)
print(land_clusters)
write_csv(
as_tibble(land_clusters$medoids),
"data-files/cluster_mediods.csv"
)
write_csv(
as_tibble(land_clusters$clustering),
"data-files/cluster-assignments.csv"
)
<- read_csv(
cluster_mediods "data-files/cluster_mediods.csv"
)
<- read_csv(
cluster_assignments "data-files/cluster-assignments.csv"
)
cluster_mediods
%>% print(n = Inf) land_use_labels
## # A tibble: 21 × 2
## layer_vals labels
## <dbl> <chr>
## 1 11 Open water
## 2 12 Perennial ice/snow
## 3 21 Developed, open space
## 4 22 Developed, low intensity
## 5 23 Developed, medium intensity
## 6 24 Developed high intensity
## 7 31 Barren land (rock/sand/clay)
## 8 41 Deciduous forest
## 9 42 Evergreen forest
## 10 43 Mixed forest
## 11 51 Dwarf scrub
## 12 52 Shrub/scrub
## 13 71 Grassland/herbaceous
## 14 72 Sedge/herbaceous
## 15 73 Lichens
## 16 74 Moss
## 17 81 Pasture/hay
## 18 82 Cultivated crops
## 19 90 Woody wetlands
## 20 95 Emergent herbaceous wetlands
## 21 0 Roads
%>%
land_use_cluster cbind(cluster_assignments) %>%
mutate(cluster = value) %>%
select(GEOID, year, cluster) %>%
pivot_wider(
names_from = year,
values_from = cluster,
names_prefix = "cluster_"
%>%
) summarize(changed_tracts = sum(cluster_2001 != cluster_2019))
<- land_use_cluster %>%
land_use_cluster_classes cbind(cluster_assignments) %>%
mutate(cluster = value, GEOID = as.character(GEOID)) %>%
mutate(
cluster = fct_recode(factor(cluster),
"Mostly forest / farmland" = "1",
"Mostly forest / some developed open + low intensity space" = "2",
"Mostly developed open + low intensity / some forest" = "3",
"Mostly developed space (all intensities) / some forest" = "4",
"Mostly developed open + low intensity, little medium/high" = "5",
"Mostly developed open + low intensity, some medium/high" = "6",
"Mostly developed space (all intensities)" = "7",
"Mostly medium to high intensity developed space" = "8",
%>%
)) select(GEOID, year, cluster)
<- c(
pal "#4EB265", "#CAE0AB",
"#F7F056", "#EE8026",
"#7BAFDE", "#1965B0",
"#882E72", "#DC050C"
)
ggplot() +
geom_sf(
data = land_use_cluster_classes %>% left_join(df_tracts),
aes(geometry = geometry, fill = cluster),
colour = "grey", lwd = 0.25
+
) geom_sf(data = arc_counties_shp, colour = "black", fill = NA) +
geom_sf(data = highway_arc, colour = "black", lty = 3, fill = NA) +
facet_wrap( ~ year) + scale_fill_manual(values = pal) +
theme(
legend.position = "bottom",
legend.direction = "vertical"
+ guides(fill = guide_legend(ncol = 2)) +
) labs(
fill = "Type of land",
title = "Classification of type of land use in 2001 and 2019",
subtitle = "(Data using NCLD National Land Cover Database)"
)
ggsave("img-files/land_use_change_2001_2019_clusters.pdf")
Suppose we make noisy observations \(((X_i, Y_i))_{i =1, \ldots, N}\) of some underlying quantities \(( (\mu_{X, i}, \mu_{Y, i}) )_{i = 1, \ldots, N}\) with known measurement standard errors \(\sigma_{X, i}\) and \(\sigma_{Y, i}\) respectively, and that the observations are made independently of each other. This is an assumption which is usually satisified by ACS data, where we make observations of multiple variables across different census tracts, with the noise arising from sampling uncertainity in not being able to survey all households. Here, \(i\) acts as the index for census tracts, and \(\mu_{X, i}\) and \(\mu_{Y, i}\) represent some underlying variables (e.g median income and the Gini index in a census tract) which we are interested in).
In such a scenario, it is typical to assume that conditional on the \(\mu_{X, i}\) and \(\mu_{Y, i}\) we have that \[ \begin{pmatrix} X_i \\ Y_i \end{pmatrix} \sim N\Big( \begin{pmatrix}
\mu_{X, i} \\ \mu_{Y, i} \end{pmatrix}, \begin{pmatrix} \sigma_{X, i}^2 & 0
\\ 0 & \sigma_{Y, i}^2 \end{pmatrix} \Big). \] Making the additional assumption that \[ \begin{pmatrix} \mu_{X, i} \\ \mu_{Y, i} \end{pmatrix} \sim
N\Big( \begin{pmatrix} \mu_{X} \\ \mu_{Y} \end{pmatrix}, \begin{pmatrix} \tau_X^2 & \rho \tau_X \tau_Y \\ \rho \tau_X \tau_Y & \tau_Y^2 \end{pmatrix} \Big)
\] independently across the units \(i\), then we are interested in making inference about \(\rho\). To do so, we make use of the fact that under the above assumptions, the marginal distribution of the \((X_i, Y_i)\) is \[ \begin{pmatrix} X_i \\ Y_i \end{pmatrix} \sim
N\Big( \begin{pmatrix} \mu_{X} \\ \mu_{Y} \end{pmatrix}, \begin{pmatrix}
\sigma_{X, i}^2 + \tau_X^2 & \rho \tau_X \tau_Y \\ \rho \tau_X \tau_Y & \sigma_{Y, i}^2 + \tau_Y^2
\end{pmatrix} \Big).
\] As a result, under the assumption that \(N^{-2} \sum_{i=1}^N (\sigma_{X, i}^4 + \sigma_{Y, i}^4) \to 0\) as \(N \to \infty\), a consistent estimator for \(\rho\) is given by \[\frac{ N^{-1} \sum_{i=1}^N (X_i - \bar{X})(Y_i - \bar{Y}) }{ \big( N^{-1} \sum_{i=1}^N [ (X_i - \bar{X})^2 - \sigma_{X, i}^2 ] \big)^{1/2} \big( N^{-1} \sum_{i=1}^N [ (Y_i - \bar{Y})^2 - \sigma_{Y, i}^2 ] \big)^{1/2} }\] where we write \(\bar{X} = N^{-1} \sum_{i=1}^N X_i\) and \(\bar{Y} = N^{-1} \sum_{i=1}^N Y_i\). In order to perform hypothesis testing, a bootstrap of the observed data \((X_i, Y_i, \sigma_{X, i}, \sigma_{Y, i})\) is used to approximate the sampling distribution of \(\rho\). We note that in finite samples, one of the square root terms may be negative; in such a scenario, we return the value of \(\rho\) as a NA
; if this occurs frequently enough, then we will just consider the data too noisy to make reasonable inference for the presence of a correlation.
Suppose we have observations \(X\) and \(Y\) whose expected values are \(\mu_X\) and \(\mu_Y\), and whose standard errors are \(\sigma_X\) and \(\sigma_Y\) respectively, and moreover suppose that \(X\) and \(Y\) are independent. In such a scenario, the estimate of \(\log(X/Y)\) is approximated by \(\log( \mu_X / \mu_Y )\) and the standard error is approximated by \[ \Big( \frac{ \sigma_X^2}{ \mu_X^2} + \frac{ \sigma_Y^2}{\mu_Y^2} \Big)^{1/2}.\] Looking at the ratio of quantities is suitable when we are interesting in looking at quantities which may be on different scales naturally, such as e.g the number of households in an census tract in different income bands, where census tracts may vary significantly in the number of households, but we expect the overall ratios between quantities to be similar.