Massachusetts Building Analysis Dashboard

Interactive Visualization of Building Inventory Data with Soil Analysis

NSI-Enhanced USA Structures Dataset

Dashboard Overview

Comprehensive analysis of Massachusetts building inventory from NSI-Enhanced USA Structures Dataset

Loading...
Total Buildings(Cleaned)
Loading...
Average Year Built
Loading...
Avg Area (sqm)
Loading...
Identified Clusters

About This Dashboard

This interactive dashboard analyzes building data from the NSI-Enhanced USA Structures Dataset for Massachusetts. The analysis includes clustering patterns, temporal distributions, material characteristics, and soil properties of buildings across different time periods. All visualizations use color-blind friendly palettes and are fully interactive. Developed by Lang Shao (Fall 2025) and Tanvi Agarwal (Spring 2026) under the supervision of Prof. Demi Fang of the Structural Futures Lab. Data visualizations may not be suitable for distribution at this time and should include attribution. If you have any questions, please contact us.

All

MA Building Hierarchical Distribution

Multi-level breakdown: Occupancy → Area → Height → Year → Drainage

Construction Year → Occupancy → Material → Foundation → Soil

Base: Year → Occupancy. Toggle columns to the right.

Metric: Export:

Occupancy Class Hierarchy

Breakdown of Occupancy Classes (OCC_CLS) into Primary Occupancy types (PRIM_OCC).

OCC_CLS → NSI occtype matches

Each link sums the number of NSI points in polygons whose OCC_CLS equals the left-hand class. Counts are pooled per class (RES pool, COM pool, ...); points in other classes do not affect this pool.

Notes on NSI Damage Categories vs. Our Sankey Labels

The NSI technical documentation states that certain occtypes are folded into broader ‘damage categories’ : AGR and REL are counted under Commercial, while GOV and EDU are counted under Public. In this Sankey, we intentionally retain the original occtype labels and do not re-bucket them into those damage-category umbrellas (e.g., REL is not folded into Commercial).

Occupancy Homogeneity Score (MIX_SC) Distribution

Distribution of buildings based on the homogeneity of NSI point types within their footprint.

MIX_SC Categories Explained

Same Type Only (NaN in data): All NSI points inside the building polygon are of the same primary type as the building itself.
1 Conflict Type (MIX_SC1): No NSI points of the same type as the building, and all conflicting points are of a single different type.
Same & Different Types (MIX_SC2): The building contains NSI points of its own type plus one or more conflicting types.
>1 Conflict Types (MIX_SC3): The building contains no NSI points of the same type as the building, and has two or more different conflicting types.

Data Pipelines & Processing Pipeline

Understanding the data sources, predictions, cleaning, and distribution

Data Pipeline Overview

This section visualizes the journey of forging our powerful, multi-layered dataset from three distinct sources. We began with the USA Structures building inventory(MA only*) as our foundational layer. This base was then systematically enriched, first by incorporating structural characteristics('Year Built', 'Foundation Type', etc.) from the National Structure Inventory(NSI), and second, by adding crucial geotechnical context from the Web Soil Survey. The following diagrams visualize these complex joins, data cleaning procedures, and the final composition of the dataset...

NSI-Enhanced USA Structures Dataset Composition

Click on any data source to explore its contributed columns

Stage 1: Spatial Join to Create NSI Enhanced Version

USA Structures (Base)

2,091,488 Records

38 Columns

+

NSI Data Points

2,095,529 Records

15 Columns Added

Operation: Advanced Multi-Stage Spatial Join

An enhanced, multi-stage process was implemented to accurately enrich building footprints with NSI point data. This updated methodology features flexible handling of mixed-use properties, a precise nearest-neighbor buffer match, and systematic occupancy conflict detection to ensure data quality.

  • Strategy 1: Intelligent Single-Family Matching
    • For buildings classified as 'Single Family', the process now flexibly considers both residential (RES) and commercial (COM) NSI points inside. This accommodates mixed-use scenarios like in-home businesses.
    • If one point is found, a direct one-to-one match is made.
    • If multiple points are found, their attributes are aggregated to create a composite profile, replacing the previous centroid-based selection. This robustly handles properties with multiple distinct units (e.g., a house with a separate commercial unit).
  • Strategy 2: Standard Aggregation for Other Buildings
    • For all other building types (multi-family, commercial, etc.), all NSI points falling within the footprint are used.
    • Their attributes are aggregated to create a comprehensive profile for the building:
      • Value & Area (`structure_value`, `nsi_sqft`): Summed to get a total.
      • Stories (`nsi_num_story`): The maximum value is taken.
      • Characteristics (`year_built`, `material_type`): The statistical mode (most frequent value) is used.
  • Strategy 3: Nearest Neighbor Buffer Match
    • For NSI points that remain unmatched, this strategy finds the single nearest building polygon within a 5-meter radius.
    • This ensures each point is uniquely assigned to its closest building, correcting for minor spatial inaccuracies. A single building can "absorb" multiple nearby points via this method.
    • A configurable option also allows buildings already matched in earlier stages to absorb additional nearby points, capturing features like adjacent garages or utility structures.
  • Extra Feature: Systematic Occupancy Conflict Detection
    • Throughout the process, the script will actively compare the land use category of the NSI point (e.g., 'Commercial') against the category of the building polygon it falls into (e.g., 'Residential').
Stage Details & Unmatched Points(Click for detail)

Result: NSI Enhanced Structures v1

2,091,488 Records

53 Columns (38 Base + 15 from NSI)

A Left Join was performed, so all original buildings were retained. Unmatched buildings(405,037 in total) have NaN values for NSI columns(year_built, foundation_type, etc.).

Stage 1.5: 'Unclassified' buildings from USA Structures were re-defined using NSI point data

How we re-label “Unclassified” using OCC_DICT

  • Vote by counts in OCC_DICT (e.g., RES: 0, COM: 8, IND: 1, GOV: 0, EDU: 0 ... → Commercial).
  • REL is counted as Assembly according to USA Structure PRIM_OCC column (e.g., RES: 1, IND: 1, REL: 2 → Assembly).
  • If all RES/COM/IND/GOV/EDU/AGR/REL are 0 → keep Unclassified.
  • If there is a tie in the vote (e.g., RES: 1, COM: 1), the building will remain Unclassified.

This relabeling occurs in the data cleaning step before any downstream charts/tables, so all occupancy analyses reflect the updated OCC_CLS.

Unclassified Reclassification Summary

How many Unclassified records were re-labeled into each class

Total Unclassified (Before)
With OCC_DICT
Changed
Kept Unclassified
Tie-Breaker Situations (Kept as Unclassified) - Click to expand

Stage 2: Building the Enhanced Soil Layer & Final Join

NSI-Enhanced USA Structures Dataset v1

2,091,488 Records (Polygons)

53 Columns

Input: GPKG (Preserves footprints)
+

Web Soil Survey (WSS) Data

Sources: gsmsoilmu_a_ma.shp

Tables: comp.txt, chorizon.txt

12 Columns

Operation: Soil Enrichment & Area-Weighted Spatial Join

  • Part A: Preparing the Enhanced Soil Layer
    • 1. Component Filtering: Reads comp.txt and selects only the single Dominant Component (highest percentage) for each Map Unit.
    • 2. Horizon Filtering: Reads chorizon.txt and selects only Topsoil properties (depth < 10cm) to capture engineering characteristics relevant to foundations.
    • 3. Merge: Attributes are merged onto the Soil Shapefile to create a single, simplified soil layer.
  • Part B: Spatial Intersection (EPSG:26986)
    • Data is projected to Mass State Plane (Meters) for accurate area measurement.
    • A Polygon-on-Polygon intersection (predicate='intersects') is performed between buildings and soil layers.
  • Part C: Area-Weighted Conflict Resolution
    • Problem: Some buildings straddle the boundary between two or more soil map units.
    • Solution: The script calculates the exact Overlap Area for every match. If a building touches multiple soil units, it is assigned to the one with the largest intersection area.

Result: NSI-Enhanced USA Structures Dataset v2

2,091,488 Records

65 Total Columns (53 + 12 from Soil)

Preserved Geometry: Polygons

Cleanup: IDs renamed to soil_mukey/soil_cokey. Buildings outside soil map coverage are labeled as "Unmatched" (retained via Left Join).

Stage 2.5: CLF-Based Foundation Classification

NSI-Enhanced USA Structures Dataset v2

Input: GPKG (Polygons)

Target Column: foundation_type

65 Columns

+

CLF Categorization

Carbon Leadership Forum

Target Column: str_fdn_type

Operation: Dictionary Mapping & Type Cleaning

Mapping specific NSI foundation codes to broader str_fdn_type for standardized Carbon Analysis.

NSI Codes (Original) General Type (Mapped)
C, B, S, W, F
(Crawl, Basement, Slab, Wall, Fill)
Shallow Foundation
P, I
(Pier, Pile)
Deep Foundation < 50' (15m)

Result: NSI-Enhanced USA Structures Dataset v2.5

2,091,488 Records (Polygons Preserved)

+1 Column: general_fnd_type

66 Columns

Data Cleaning: All object columns converted to String to ensure GPKG stability and prevent "Error adding field" issues.

Stage 3: Enriching with Demolition Permit Data

NSI-Enhanced USA Structures Dataset v2.5

Input: GPKG (Polygons)

66 Columns

Includes general_fnd_type
+

Boston Approved Permit Dataset

Source: tmpbtz4x7bc.csv

Filter: EXTDEM, INTDEM, RAZE

3 Key Columns

Operation: Spatial Join (Polygons Preserved)

  • 1. Priority-Based Deduplication: Filters permits and selects the "best" record per address.
    Priority Rule: Closed/Completed > Open > Most Recent Date.
  • 2. CRS Alignment (EPSG:2249): Both datasets are projected to MA State Plane (Meters) for precise distance calculation.
  • 3. Nearest Neighbor Join (5m Radius): Uses sjoin_nearest to find the single closest permit within 5 meters (from polygon edge).
    Matching Statistics (Total: 5,018):
    Exact Matches (In Polygon): 4,922 (98.1%)
    Buffer Matches (<5m): 96 (1.9%)
  • 4. Non-Destructive Merge: New attributes (`DEMOLITION_TYPE`, `DATE`, `STATUS`) are joined back using Index Alignment. This guarantees zero data loss and perfectly preserves original Polygon geometry.

Result: NSI-Enhanced USA Structures Dataset v3

2,091,488 Records

69 Total Columns (66 + 3 from Permits)

Geometry: MultiPolygon (Unchanged)

Validation: Original row count (2,091,488) perfectly preserved.
Final Format: GPKG (Layer: structures_demolition)

Stage 4: MassGIS Parcel Integration & Temporal Fusion Strategy

NSI-Enhanced USA Structures Dataset v3

Input: GPKG (Polygons)

2,091,488 Records

68 Columns

Contains original NSI Year Data
+

MassGIS Parcels L3 Data

Source: Parquet (Chunked)

Unique Parcels: 2,623,246

1 Column

Processed via Dask LocalCluster

Operation: High-Performance Spatial Join & Logic-Based Fusion

  • 1. Centroid-Based Spatial Indexing: To optimize performance and accuracy, building geometries were converted to Centroids before performing a `within` spatial join against the MassGIS Parcel polygons.
  • 2. Temporal Conflict Resolution (Latest Year Heuristic):
    Handling parcels with multiple build years:
    • When a single building matched multiple parcel records (potential duplicates or subdivisions), the system prioritized the most recent construction year.
    • Method: Data was sorted by [BUILD_ID, YEAR_BUILT] in [Ascending, Descending] order, retaining only the top record.
  • 3. Data Cleaning: Outliers were removed by filtering massgis_yr_built to the valid range of 1630 - 2025. (1,709 invalid values detected and removed).
  • 4. Source Prioritization Strategy:

    The decision logic for merging MassGIS and NSI year data is illustrated below:

    Source Tracking Workflow Logic

Data Fusion Statistics

97.79%
Sourced from MassGIS
(2,045,196 buildings)
0.66%
Filled by NSI
(13,720 buildings)
1.55%
No Year Data
(32,572 buildings)

Result: Final Integrated Dataset

2,091,488 Records (Polygons Preserved)

+3 Columns: massgis_yr_built, nsi_yr_built, yr_built_belong

72 Columns Total

Output: ma_structures_FINAL_with_YR_SOURCE.gpkg

Data Sources vs. Final Result

Compare the original data sources (NSI vs MassGIS) or view the final cleaned distribution.

Geopackage data to Json data - Cleaning Process

How .gpkg data is filtered, cleaned and preprocessed to couple .json files

NSI Methodology Explained

The National Structural Inventory (NSI) sources key building attributes—such as year built and construction material—primarily from the commercial data provider Lightbox. When gaps or missing values occur in the Lightbox data, the NSI applies a logical random imputation methodology based on HAZUS tables to fill in those gaps. This process helps ensure the dataset’s overall completeness and quality. The diagram below shows the fill rate of attributes obtained directly from Lightbox. For any missing data, the NSI may have used HAZUS tables as substitutes.

NSI Data Sources & Predictions

How building material and foundation type data are obtained

Data Source Information

Lightbox provides 2,542,265 total MA building data records. Building material data is available for 1,208,023 records (47.52% coverage), and foundation type data for 54,497 records (2.14% coverage). Missing values are predicted using HAZUS methodology.

Removed Data Analysis

Explore buildings removed during data cleaning, categorized by missing features, geography, size, and year.

Why This Matters

This section evaluates whether data removal introduces bias — such as removing buildings disproportionately from certain cities, time periods, sizes, or materials.

Loading...
Total Removed Buildings
Loading...
Most Common Removal Reason
Loading...
Avg Year (Removed)
Loading...
Avg Size (sqm)

Year Analysis

City Analysis

Size Distribution

Occupancy Analysis

Material & Foundation Distribution

Material & Foundation Removal Rates

Soil Properties and Risk Analysis

Comprehensive analysis of soil conditions and their impact on building infrastructure

Loading...
High Risk Buildings
Loading...
Avg Water Table (cm)
Loading...
Poor Drainage Sites
Loading...
Flood Risk Buildings

Soil Data Categories

Drainage Classes: Well drained, Moderately well drained, Somewhat excessively drained, Poorly drained, Very poorly drained, Excessively drained
Flooding Frequency: Low, Moderate, High
Engineering Properties: <= 0.17 Favorable,> 0.17 and <= 0.24 Fair,> 0.24 and <= 0.32 Poor,> 0.32 Very poor
Soil Component: Various soil types identified by compname field

Risk Assessment Methodology

High-risk buildings are identified based on poor drainage conditions (Poorly drained or Very poorly drained) and/or frequent flooding risk (Occasional or Frequent). These conditions can impact foundation stability, basement flooding potential, and overall structural integrity over time. Buildings in high-risk zones may require additional maintenance and waterproofing measures.

Clustering Analysis

K-means clustering results based on building area, year built, and occupancy class (using a random sample for visualization)

Geographic Distribution of Clusters

Visualizing how the K-means clusters identified above are distributed spatially.

Temporal Distribution (1630 - 2025)

Building construction patterns over four centuries

to

Multi-Dimensional Occupancy Clustering Analysis

Advanced clustering analysis with dynamic feature selection for true multi-dimensional clustering

Dynamic Clustering Features

Base Dimensions (4D): Year Built, Footprint Area (SQMETERS), Height (HEIGHT_USED — measured HEIGHT when available, otherwise PRED_HEIGHT), Occupancy Class
+ Material Type (5D): Adds material type as a clustering dimension
+ Foundation Type (5D): Adds foundation type as a clustering dimension
+ Both (6D): Includes all dimensions for comprehensive clustering
Real-time Reclustering: Each toggle change triggers new clustering calculations based on selected features

Current View: Balanced Sample - Shows equal representation of all occupancy classes for better pattern visibility
Active Clustering Dimensions: Year, Area, Occupancy (3D)
Clustering Status: Using pre-computed base clustering

Building Materials & Foundation Analysis

Correlation between material types and foundation types - Click on any cell to see occupancy breakdown

Material & Foundation Type Codes

Material Types: M = Masonry, W = Wood, H = Manufactured, S = Steel, C = Concrete
Foundation Types: C = Crawl Space, B = Basement, S = Slab, P = Pier, I = Pile, F = Fill, W = Solid Wall
👉 Click on any cell in either heatmap to see the occupancy class distribution for that combination

Material Usage Trends Over Time

Normalized percentage of material types for new construction in each decade.

Boston's Historic Shoreline and Filled Land

Visualizing buildings constructed on land reclaimed since 1630.

The Filling of Boston

The map of Boston has changed dramatically since its founding in 1630. Much of what is now considered central Boston was once tidal flats and marshes. Through extensive land reclamation projects over centuries, areas like Back Bay, the South End, and parts of Downtown were created from fill. This historic map shows the original 1630 shoreline, and the interactive map below displays modern buildings that now stand on this reclaimed land.

Historic Shoreline Map (c. 1630)

Historic Map of Boston Shoreline

Buildings on Reclaimed Land

An interactive map of structures located on areas that were filled after 1630.

Boston Foundation Type Analysis by Building Height

Comprehensive analysis of foundation types on Original vs. Filled Land across height bins.

Loading...
Total Boston Buildings
Loading...
Shoreline (Filled Land)
Loading...
Original Land
Loading...
With Foundation Type

Methodology & Data Processing

CLF Foundation Type: CLF (Carbon Leadership Forum) is a non-profit organization that provides building embodied carbon data. The buildings_metadata.xlsx contains structural and foundation information for buildings across North America. Foundation types are grouped into CLF categories: Shallow foundation, Deep foundation < 50' (15m), Deep foundation > 50' (15m), and Other Foundation System.

Shoreline Detection: Buildings are classified using the 1630 historic shoreline.

Height Binning: Buildings are categorized into 5 bins based on height.


Section 1: Original Land vs Shoreline Land Comparison

Original Land
Shoreline Land

Section 2: Height bin comparison within same land type

Bin 1 Comparison
Bin 2 Comparison

Section 3: Complete Data Overview (Click to Expand)


Section 4: CLF Metadata Height vs Foundation Analysis

Data source: CLF buildings_metadata.xlsx (covers all of North America)

Height Bin Mapping (CLF → Our Bins):
CLF Height Bin Our Height Bin
0-7.5 m 0-24 ft
7.6-15 m 24-72 ft
15.1-22.5 m 24-72 ft
22.6-30 m 72-147 ft
31-45 m 147+ ft
46-60 m 147+ ft
61-90 m 147+ ft
Over 90 m 147+ ft

Bin 1
Bin 2

Cost Analysis

Explore structural cost patterns across building size, occupancy, and materials

Cost Metrics Overview

This section analyzes building structural value (structure_value) and cost intensity relative to Gross Floor Area (GFA). Visualizations reveal how cost scales by occupancy and construction material.

Log-Log Regression: Structure Value ~ GFA

Cost Intensity by Occupancy Class

Cost Intensity by Material Type

Interactive Data Explorer

Explore the data with custom filters and advanced visualizations (*Data from 75,000 random sampled data from 1.7M cleaned dataset)

to
to

Tips for Interactive Explorer

• 3D Scatter: Rotate with mouse, zoom with scroll wheel
• Sunburst: Click segments to zoom in, click center to zoom out
• Parallel Coordinates: Drag axes to reorder, brush to filter
• All charts: Hover for details, double-click to reset view

CLF Data Analysis

Analysis of Carbon Leadership Forum dataset for Massachusett

CLF Data Preprocessing

This dataset originates from the New Construction MA Projects from the CLF building metadata, processed to be compatible with the NSI Enhanced USA Structure dataset. Key transformations include:

Occupancy Classification (OCC_CLS)

Detailed CLF building uses were mapped to NSI Enhanced USA Structure dataset categories. This mapping is primarily based on the definitions from the USA Structure dataset's PRIM_OCC column.

CLF Building Use Mapped NSI Category (OCC_CLS)
Multifamily (5 or more units) Residential
Lodging Residential
Office Commercial
Mercantile Commercial
Food Service Commercial
Laboratory Commercial
Healthcare Commercial
Parking Commercial
Public Order and Safety Government
Warehouse and Storage Industrial
Industrial Industrial
Public Assembly Assembly
Religious Worship Assembly
Transportation Hub Assembly
Education Education
Other Utility and Misc

Material Type Encoding (material_type)

CLF structural systems were mapped to single-letter codes. This mapping was inferred by combining several CLF columns: str_prim_horiz_sys, str_prim_vert_sys, str_lat_sys, and str_sec_vert_sys.

CLF Structural System Mapped Code (material_type)
Steel S
Concrete C
Steel/Concrete S
Steel/Masonry S
Wood: Mass Timber W
Wood: Light-frame W
Other H
M = Masonry, W = Wood, H = Manufactured, S = Steel, C = Concrete (in NSI Enhanced USA Strucuture dataset)

Other Key Transformations

  • bldg_compl_year was mapped to year_built
  • bldg_cfa was mapped to Est GFA sqmeters
  • str_fdn_type was mapped to general_fnd_type
  • Height Standardization: Text descriptions (e.g., "10-12 m") were converted to numeric averages (e.g., 11 in the HEIGHT column, which is in meters).
  • Data Cleaning: 2 Records with missing floor area data (Est GFA sqmeters) were removed.
  • Finally 16 projects from CLF are analyzed

Scatter Plot CLF MA Data Explorer

Compare GFA, Total Mass, and GWP, colored by Occupancy Class.

CLF Heatmap Analysis

Correlation between foundation types and structural systems.

Mapped Material Type vs. Foundation Type

Original Structural System vs. Foundation Type

GFA Distribution: Main Dataset vs. CLF Dataset

Comparison of Est GFA (sqm) by Occupancy Class. Boxes represent the main dataset (from 75,000 random sampled data from 1.7M cleaned dataset); 'x' markers represent the CLF dataset.

Error Loading Data

Unable to load building data. Please ensure building_data.json is in the same directory.