
(Anton Balazh/Shutterstock)
NASA collects all kinds of data. Some of it comes from satellites orbiting the planet. Some of it travels from instruments floating through deep space. Over the years, these efforts have built up a massive collection: images, measurements, signals, scans. It is a goldmine of information, but getting to it, and making sense of it, is not always simple.
For many scientists, the trouble starts with the basics. A file might not say when it was recorded, what tool gathered it, or what the numbers mean. Without that information, even experienced researchers can get stuck.
With AI systems, the challenges are even more complex. Machines can learn from patterns, but they still need some structure. If the data is vague or missing key labels, the model cannot do much with it or it may have to connect dots that are just too far apart. This means that some of the most valuable data ends up overlooked or the output is not reliable.
NASA has developed new tools to address the problem. These include automated metadata pipelines that process and standardize information about the agency’s vast datasets.
These automated pipelines clean up and clarify the metadata, which is the information about the data itself. Once that layer is solid, datasets become easier to find, easier to sort, and more useful to both humans and machines. The goal is to make this improved metadata available on familiar platforms like Data.gov, GeoPlatform, and NASA’s own data portals. The hope is that this shift will support faster research and better results across a wide range of projects.
Part of this effort is about opening access beyond NASA’s usual networks. Not everyone looking for data is familiar with internal tools or technical systems. That challenge is part of the reason these pipelines exist. “In NASA Earth science, we do have our own online catalog, called the Common Metadata Repository (CMR), that is particularly geared towards our NASA user community,” said Newman.
“CMR works great in this case, but people outside of our immediate community might not have the familiarity and specific knowledge required to get the data they need. More general portals, such as Data.gov, are a natural place for them to go for government data, so it’s important that we have a presence there.”
NASA’s new metadata pipelines are an attempt to make those stories easier to find and easier to understand. The first phase of the effort is centered on more than 10,000 public data collections, covering over 1.8 billion individual science records. These are being reformatted and aligned with open standards so they can be shared through platforms like Data.gov and GeoPlatform, where researchers outside NASA are more likely to search. This shift also helps AI systems. When the structure is clear and consistent, models are better able to interpret the data and apply it without making unnecessary assumptions.
Improving structure is only part of the process. NASA is also looking closely at the quality of the metadata itself. That work is handled through the ARC project, short for Analysis and Review of CMR. The goal is to make sure records are not just formatted properly, but also accurate, complete, and consistent. By reviewing and strengthening these records, ARC helps ensure that what shows up in search results is not only visible, but also reliable enough to be used with confidence.
Translating NASA’s internal metadata into formats that work across public platforms takes detailed and technical work. That effort is being led by Kaylin Bugbee, a data manager with NASA’s Office of the Chief Science Data Officer. She helps run the Science Discovery Engine, a system that supports open access to NASA’s research tools, data, and software.
Bugbee and her team are building a process that gathers metadata from across the agency and maps it to the formats used by platforms like Data.gov. It is a careful, step-by-step workflow that needs to match NASA’s unique terms with more universal standards. “We’re in the process of testing out each step of the way and continuing to improve the metadata mapping so that it works well with the portals,” Bugbee said.
NASA is also working on geospatial data. Some of these datasets are used by other agencies for things like mapping, transportation, and emergency planning. They are known as National Geospatial Data Assets, or NGDAs.
Bugbee’s team is building a system that helps connect these files to Geoplatform.gov, with links that send users straight to NASA’s Earthdata Search. The process builds on metadata NASA already has, which saves time and reduces the need to start from scratch. They began with MODIS and ASTER products from the Terra platform and will expand from there. The goal is to make these datasets easier to access, while keeping the structure clear and consistent across platforms that serve both public and scientific users.
Related Items
IBM’s New Geospatial AI Model on Hugging Face Harnesses NASA Data for Climate Science
Agentic AI and the Scientific Data Revolution in Life Sciences
NIH Highlights AI and Advanced Computing in New Data Science Strategic Plan