Tuesday 29 March 2011

Dust storms test the limits of spatial scientific computing

Many of the systems studied by scientists evolve through time and space, like diseases spreading through a population, weather fronts rolling across a continent, and solar flares churning the atmosphere. A paper, part of a special series on what's termed "spatial computing," argues that understanding these processes will necessarily require computing systems to have a structure that reflects the corresponding spatial and temporal limits. According to its authors, the spatial nature of the systems being studied is reflected in everything from data gathering to its processing and the visualization of the results. So, unless we engineer the computing systems to handle spatial issues, we're not going to have very good results.
It would be easy to dismiss spatial computing as a buzzword, except the authors provide concrete examples that demonstrate spatial issues do play a major role in scientific computing. Unfortunately, they also demonstrate that they play several roles, some of them largely unrelated to each other; as a result, the spatial tag ends up getting applied to several unrelated problems, which dilutes the message to an extent.
Still, each of the individual problems, which they illustrate with the example of forecasting dust storms, make for compelling descriptions of spatial issues in scientific computing.
Given our current knowledge of dust storms, there are over 100 variables that have to be recorded, like wind speed, relative humidity, temperature, etc. This data is both dynamic (it can change rapidly with time) and spatial, in that even neighboring areas may experience very different conditions. The data itself is recorded by a variety of instruments, with different locations and capabilities. Even within the US, these are run by different government agencies—just finding all of the available stations could be a challenge.
Integrating these readings into a single, coherent whole is a nightmare; response times can be anywhere from a second to hours, and the different capabilities of the stations require some extensive processing before the data can be used. To tackle these issues, the authors set up a system of local servers to process the data on-site, and created a central portal to provide rapid access to their data; in short, they eliminated the spatial component of data access while retaining the spatial nature of the data itself. The end result was a set of data from 200 stations that could be accessed within a second from anywhere in the US.
That data can then be used for dust storm forecasting, where other spatial issues come into play. When the authors started, their algorithm took a full day to perform a three-day forecast when using a grid with a 10km square resolution. Obviously, that's not especially useful, and the clear solution was to generate parallel code so that each grid square could be assigned to an individual core.
Still, the spatial nature of the problem limited the benefits of parallel code. Each grid square will influence its nearest neighbors, so the simulation runs into a lot of dependencies that require communication between the parallel processes. You can speed things up considerably if you get neighboring nodes onto individual cores of a single processor; balancing all of this requires that the spatial arrangement of the grid be considered when dividing up the tasks. But that quickly runs into limits both in terms of how many cores a node has. The authors found significant gains up to about 20 cores (that system could do a five-day forecast in three hours), but things tailed off after that.
A similar issue comes into play when you try to visualize the results. Physically adjacent grid squares are likely to be accessed at the same time, so it's best to keep them close in memory and processed on the same node. But again, threads increase performance to a point before hitting limits, and nodes had a tendency to run out of RAM.
In the end, the need for spatial resolution pushed the problem up against some hard limits. It was possible to create and visualize a forecast with a grid size of four square kilometers, but trying to go down to three simply failed, and the only solution that could improve it was faster processors; more RAM and a faster network only had a minimal impact.
In the end, the authors make a good case that the spatial properties of a scientific data set require a degree of consideration when designing methods for analyzing that data. But the problems lumped in under the umbrella of spatial computing seem to include everything from differences in data gathering equipment down to getting threads to be executed on the same processing node. These are very different problems, and they're handled by correspondingly different approaches. The work done by the authors also clearly demonstrates that, even when spatial considerations are taken into account, doing so has its limits.

No comments:

Post a Comment