OpenFEMA Guide to Working with Large Data Sets

Several data sets offered by OpenFEMA are very large, requiring a significant amount of time to download and exceed the capabilities for common spreadsheet applications like Microsoft Excel. The purpose of this guide is to provide recommendations and techniques for working with OpenFEMA’s large data files. Specifically, this guide focuses on:

  1. Options for downloading large data sets and performing data refreshes
  2. Suggestions for viewing, storing, manipulating, and analyzing large data sets
alert - info

FEMA does not endorse any of the products mentioned in this guide and makes no claim as to the assurance of their efficacy or to the security of the products. They are used here for illustrative purposes only.

Several supporting appendices are mentioned throughout this guide:

Downloading Data

For very large data sets, OpenFEMA limits the download format type to CSV. While JSON downloads are available from the API, be aware that as the JSON format adds structure, the file size will increase. Although the full source file is uncompressed, data should be compressed by GZIP during transmission and uncompressed by your browser automatically. However, depending on available download speed, it still may take several hours for downloads to complete.

API Use versus Full File Download

  • If the purpose of your download is to perform historical analysis where the use of specific fields are unknown and frequent updates are not important, a full download from the CSV link on the data set page is recommended.
  • If, as above, full data is needed, but updates are also required, an initial full download can be performed, with updates provided via the API. The API returns JSON data and will require a slightly different processing mechanism. Additionally, if more than 1,000 records exist in the update, paging is required. (See Appendix A – OpenFEMA API Data Paging)
  • If a subset of data is desired (either limited fields or filtered records), consider retrieving data from the API rather than a full CSV download. Specific fields can be selected and results filtered directly as part of the API call. Using the $filename parameter and the $format parameters will permit the download of just the requested fields and filtered data. An example use case would be an app or website reporting on specific, limited data. Keep in mind that the JSON file format will be larger than the same data in a CSV format. (See Appendix B – Checking for Data Updates.)

Using Data

Once data has been successfully downloaded, viewing, manipulating, and analyzing data can be a challenge. A spreadsheet program such as Microsoft Excel has a data size limit. Large data sets will exceed spreadsheet tool row limits and will not open without data loss. Common text editors such as Notepad++ or Sublime have 2 GB file limits, again preventing the opening, search, and editing capabilities from working. Using different tools, extracting subsets, and/or aggregating detailed data are good approaches for making analysis easier.

Pre-Processing, Viewing, and Editing

  • Use built-in operating system (OS) tools. Linux and Mac operating systems (including the Windows subsystem for Linux or a tool like Cygwin) have many built-in commands for viewing, manipulating, and even modifying files. It may be possible to efficiently pre-process data such that intermediate results can be used in a spreadsheet program. (See Appendix C – Operating System Tools for Viewing and File Manipulation.)
  • Different tools may be procured that are built to manage large files. Editors such as EditPad Lite, 010 Editor, and UltraEdit can read, search, and to a limited extent, edit large text-based files.
  • Some Interactive Development Environments such as Microsoft VS Code have extensions (e.g., PreviewCSV) that can view very large files. Extensions such as this work by viewing portions of a file individually rather than reading the entire file into memory.

Storage

  • Import downloaded data into a database. Database tools are built to handle much more data than can be managed within a spreadsheet. SQLite is a good local database solution. Server based Relational Database Management Systems (RDBMS) such as PostgreSQL and MariaDB are also good storage solutions for large data sets but may require additional configuration.
  • Non-CSV based (e.g., XML, JSON) OpenFEMA data sets can be stored as text (and sometimes queried) within an RDBMS. Microsoft SQL and PostgreSQL offer direct support for XML storage and manipulation. These databases, as well as SQLite and MySQL support JSON objects. A “no SQL” or document database such as MongoDB, Cassandra, and CouchDB can also store and manipulate non-tabular data natively.
  • Convert and store large CSV files to formats more appropriate to big-data analysis such as Apache Parquet, AVRO, and ORC. These formats are self-describing, language-independent, and are optimized for storage and processing with big-data tools/frameworks/platforms such as Hadoop, Apache Spark, R, and DataBricks. However, they are not human readable and require programming or use of other tools to manipulate.

Analysis

  • Querying, aggregation, and visualization within a very large spreadsheet can be incredibly slow. Tools like Microsoft Power Query are intuitive, connect to many data sources, and enhance query and transformation activities, but can still take time on large data. Database tools listed above are more performant at querying and aggregating data.
  • Many proprietary tools are designed to work with larger datasets such as KNIME and DataMiner. Open source tools such as R Studio can also connect to a variety of large data sources for visualization purposes.
  • Utilize a programming language such as R, Python, Julia, or PHP to access, manipulate, and analyze the data – either in a raw CSV format or by directly accessing data within a database.

Summary

Working with large data sets can be challenging with commonly used tools. These challenges can be mitigated with the following strategies: 

  1. Download only what you need.
  2. To refresh data, attempt to perform updates rather than a full re-download.
  3. Pre-process (filter and aggregate) prior to analysis using more standard tools.
  4. Use tools more appropriate for working with large amounts of data—including programming languages.
  5. Seek additional resources from the web regarding data science, data wrangling, and data analysis tools used to work with large amounts of data.

Appendix A – OpenFEMA API Data Paging 

For performance reasons, only 1,000 records are returned per API endpoint call. If more than 1,000 records exist, it will be necessary to page through the data, using the $skip and $inlinecount parameters to retrieve every record. The metadata header returned as part of the data set JSON response will only display the full record count if the $inlinecount parameter is used—otherwise, it will have a value of 0. Computer code containing a loop is written to continue making API calls, incrementing the $skip parameter accordingly, until the number of records retrieved equals the total record count. See the OpenFEMA Documentation, URI commands section for additional information regarding these parameters.

As an example, the following API calls will retrieve Disaster Declaration data for the state of Virginia. More than 2,100 records exist. The first call will return the first 1,000 disasters, while the second will return the next 1,000 records. See the Developer Resources page or the OpenFEMA GitHub Samples Repository for working examples.

https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$filter=state%20eq%20%27VA%27&$inlinecount=allpages&$skip=0

https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$filter=state%20eq%20%27VA%27&$inlinecount=allpages&$skip=1000

Appendix B – Checking for Data Updates

Data updates are possible for situations where a full set of historical data is required as well as any data added since the initial download. For the OpenFEMA API, it is important to note the distinction between the dataset being updated or refreshed and the actual data or records being updated.

Dataset Updates

The OpenFEMA data store will either reload or update a specific dataset according to a refresh interval based on the data owner’s recommendations, the source systems speed of data processing, the size of the dataset, and the complexity of retrieving the data. The refresh interval for each dataset varies and can found on the associated data set documentation page.

Whether you are refreshing the entire data set or just trying to add/update changed records since the last update, your refresh interval should not be more frequent than the data set refresh interval. Further, it would be prudent to check the data set update status prior to executing your own refresh. There may be situations when the OpenFEMA data store is unable to refresh from the source data.

A special metadata API endpoint exists to describe each OpenFEMA dataset: DataSets. The “lastDataSetRefresh” field may be used to determine the date and time an individual dataset was refreshed from the FEMA source system. A call to this endpoint is faster than querying the data set for the most recently updated record. For example, the following call will return the API endpoint version and the last date/time the DisasterDeclarationSummaries data was refreshed from the source data. Note that this value does not indicate if the actual data has been modified, just that the OpenFEMA data store has been refreshed. Also, it is important to be mindful of the dataset version you are working with as multiple versions, each returning different data, may exist for a limited period.

https://www.fema.gov/api/open/v1/DataSets?$select=version,lastDataSetRefresh&$filter=name%20eq%20%27DisasterDeclarationsSummaries%27

Data or Record Updates

For some datasets, it is possible to retrieve updated data, meaning it is possible to identify a record by the date it was changed within an OpenFEMA dataset. If a dataset contains a field called “lastRefresh”, OpenFEMA is receiving and adding updates to the record set, and lastRefresh represents the date when the record was added or updated in the dataset. Data sets that do not contain this field are refreshed by performing a full reload on all the data; there is no way to tell when an individual record was added or changed. Note that the lastRefresh date does not represent when the data was modified in the source system, just the date when it changed in the OpenFEMA dataset. Exceptions may exist; such exceptions will be documented on the appropriate dataset page.

For cases where this date exists, it is possible to query recent data. For example, the following call will return all the records added after the date specified. If you wanted to return a rolling 30-day list, adjust the date specified to be 30 days prior to the current date:

https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$filter=lastRefresh%20gt%20%272022-07-27T04:00:00.000z%27&$inlinecount=allpages
alert - warning

Occasionally files that are normally updated must be fully reloaded. When this occurs, the lastRefresh date for every record is updated. Please see the Special Dataset Fields section of the OpenFEMA API Documentation page for more information.

Appendix C – Operating System Tools for Viewing and File Manipulation

Most of these free tools are available in the Linux and Mac operating systems as well as the Windows Subsystem for Linux. They may have to be downloaded separately for Windows 10. While there are many more that are available, these commands seem the most useful for OpenFEMA operations at the time this guide was written.

Getting Data

  • wget – network downloader tool
  • curl – transfer data from a URL

File Viewing and Searching

  • head – output the first part of a file to standard output
  • tail – output the last part of a file to standard output
  • grep – find and output lines that match a pattern

Analysis Tools

  • shuf – generate random permutations from an input
  • sort – sort lines of text
  • uniq – report or eliminate duplicate/repeated lines of text
  • wc – output newline, word, and byte counts for a specified file

File Manipulation

  • cat – concatenate files and print to standard output
  • split – split a file into sections based on lines or bytes
  • csplit – split a file into sections based on context/pattern matching
  • sed – stream editor for finding and replacing data
  • awk – a programming language for pattern scanning and text processing
  • jq – a command-line JSON processor
  • xmllint – a command-line XML processor

Examples:

  • Download data in a JSON format from the OpenFEMA API using wget and curl respectively:
wget –output-document=output.txt https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$filter=disasterNumber%20eq%201491'

curl 'https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries?$filter=disasterNumber%20eq%201491' >> output.txt 
  • Count the number of disasters for calendar year 2020:
curl -s 'https://www.fema.gov/api/open/v2/FemaWebDisasterDeclarations?$inlinecount=allpages&$select=disasterNumber&$filter=declarationDate%20gt%20%272019-12-31%27' | jq '.metadata.count' 
  • List the Total Obligated Public Assistance amount for each disaster and Public Assistance project for calendar year 2020:
curl -s 'https://www.fema.gov/api/open/v1/PublicAssistanceFundedProjectsDetails?$top=0&$inlinecount=allpages&$select=disasterNumber,pwNumber,obligatedDate,totalObligated&$orderby=disasterNumber&$filter=obligatedDate%20gt%20%272019-12-31%27' | jq '.PublicAssistanceFundedProjectsDetails[].totalObligated'
  • Extract identifier and date from a previously downloaded IPAWS file, sort, count unique records, and save those that are duplicates:
grep -oP "<identifier>(.*)</sent>" ipaws_historical_download.csv | sort | uniq -cd > output.txt

Appendix D – OpenFEMA Tips and Tricks

See the OpenFEMA Documentation page for API usage and URI options.

Counting Results

If, only a count is required rather than full data, the returned dataset metadata value “count” may be used. Although the OpenFEMA API cannot yet aggregate results as part of a query, in some cases the “count” attribute can reduce returned data and eliminate the need for you to aggregate data.

Consider the following example to retrieve a count the number of disasters for calendar year 2020. Although Linux tools are used in this example, many programming languages can perform the same actions.

curl -s 'https://www.fema.gov/api/open/v2/FemaWebDisasterDeclarations?$inlinecount=allpages&$select=disasterNumber&$filter=declarationDate%20gt%20%272019-12-31%27' | jq '.metadata.count'
  1. The utility “curl” is used to request data from the FemaWebDisasterDeclarations data set.
  2. The “$inlinecount” URI options is used to request a count of all records. If “$allpages” is not specified, the value will default to 0.
  3. Only a minimal amount of data is returned. In this case, an integer containing the disaster number. The returned data will not actually be used.
  4. A “$filter” is included to limit data to the calendar year 2020.
  5. The utility “jq” is used to extract the metadata attribute containing the count of records found.

Fields Containing Encoded Data

The JSON response for some data sets contain encoded data such that a separate tool or method must be used to view the data in its original format. For example, the “originalMessage” element of the IPAWS Archived Messages data set contains XML content that contains escape sequences to preserve special characters (commas, double quotes). The following example in the Python programming language will properly print the decoded field to the console – the data will appear in its raw, unmodified state:

import json
with open('c:\my_ipaws_data_in_a_file.json') as fh:
    data = json.load(fh)
print(data['IpawsArchivedAlerts'][0]['originalMessage'])
Last updated