iDocSlide.Com

Free Online Documents. Like!

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Share

Description

A k-nearest-neighbor simulator for daily precipitation and other variables

Tags

Transcript

A
k
–nearest-neighbor simulator for daily precipitationand other weather variables
Balaji Rajagopalan
Lamont-Doherty Earth Observatory, Columbia University, Palisades, New York
Upmanu Lall
Utah Water Research Laboratory, Utah State University, Logan
Abstract.
A multivariate, nonparametric time series simulation method is provided togenerate random sequences of daily weather variables that “honor” the statisticalproperties of the historical data of the same weather variables at the site. A vector of weather variables (solar radiation, maximum temperature, minimum temperature, averagedew point temperature, average wind speed, and precipitation) on a day of interest isresampled from the historical data by conditioning on the vector of the same variables(feature vector) on the preceding day. The resampling is done from the
k
nearestneighbors in state space of the feature vector using a weight function. This approach isequivalent to a nonparametric approximation of a multivariate, lag 1 Markov process. Itdoes not require prior assumptions as to the form of the joint probability density functionof the variables. An application of the resampling scheme with 30 years of daily weatherdata at Salt Lake City, Utah, is provided. Results are compared with those from theapplication of a multivariate autoregressive model similar to that of
Richardson
[1981].
1. Introduction
Crop yields and hydrological processes such as runoff anderosion are driven by weather variations. Recognizing the in-herent variability in climate, it is often desirable to assessmanagement scenarios for a number of likely weather se-quences. Stochastic models are useful for simulating scenariosthat are representative of the data. While there is a substantialliterature for rainfall simulation and for other variables one ata time, only a few multivariate weather simulation models havebeen developed. An objective of the work presented here was to generatedaily weather sequences as inputs to the Weather ErosionPrediction Project (WEPP) of the U.S. Department of Agri-culture (USDA). Six variables (solar radiation (SRAD), max-imum temperature (TMX), minimum temperature (TMN), av-erage wind speed (WSPD), average dew point temperature(DPT), and precipitation (
P
)) that are of interest to WEPP were considered to represent the daily weather state. Gener-ally, a statistical method for generating daily weather se-quences needs to consider the statistical dependence or corre-lation of the weather variables with each other on the sameday, as well as their “persistence,” i.e., dependence on the weather state on previous days. Solar radiation, dew pointtemperature, and maximum temperature are likely to be loweron rainy days than on dry days, while the wind speed andminimum temperature may be higher on rainy days than on drydays. Consequently, precipitation is chosen as the driving vari-able in a number of existing models. Typically [see
Jones et al
.,1972;
Nicks and Harp
, 1980;
Richardson
, 1981;
Rajagopalan et al
., 1997], daily precipitation is generated independently, andthe other variables are generated by conditioning on precipi-tation events (i.e., whether a day is wet or dry). A precipitationoccurrence and amount model (e.g., a two-state Markovmodel, with exponentially distributed rainfall amounts) is usedto generate the sequence of dry and wet days and precipitationamount. The other variables are simulated using a lag 1 mul-tivariate, autoregressive model with exogenous precipitationinput (MAR-1). The work of
Rajagopalan et al
. [1997] differedfrom the earlier work. They used kernel density estimation tospecify the univariate and multivariate probability densitiesneeded for describing the stochastic processes of interest. Pre-cipitation was generated independently from a nonparametric wet/dry spell model [
Lall et al
., 1996], and the other variableson a given day were generated by conditioning on the precip-itation magnitude (rather than just the precipitation state) forthedayandonthepreviousday’svaluesfortheweathervariables.The precipitation amount on a rainy day may also depend onthe wind, the temperature, and the humidity as measured bythe dew point temperature. Consequently, there is reason toconsider dependence of the daily weather process on morethan just precipitation as has traditionally been done.
Young
[1994], in a model similar in spirit to the one presented here,considers such dependence. In the approach adopted in thispaper, precipitation is simulated along with the other variables,thereby capturing the mutual dependence of all six weather variables. The simulation strategy used is a direct resamplingof the data using a conditional bootstrap based on nearest-neighbor probability density estimation. This approach doesnot require the speciﬁcation of and estimation of the param-eters of a parametric model (e.g., normal or lognormal) for the joint or conditional probability density of the variables. A brief review of traditional methods for simulating weather variables is ﬁrst provided. The general framework for the re-sampling strategy proposed here is presented next. The
k
–nearest-neighbor (
k
-NN) bootstrap algorithm is outlined. An application of the method to data from Salt Lake City is
Copyright 1999 by the American Geophysical Union.Paper number 1999WR900028.0043-1397/99/1999WR900028$09.00
WATER RESOURCES RESEARCH, VOL. 35, NO. 10, PAGES 3089–3101, OCTOBER 19993089
then presented. Comparisons of the simulations from the
k
-NN bootstrap and from a more traditional autoregressivesimulation model are provided.
2. Background
The general structure of some traditional methods [see
Jones et al
., 1972;
Bruhn et al
., 1980;
Nicks and Harp
, 1980;
Lane and Nearing
, 1989;
Richardson
, 1981] for simulating daily weather isdiscussed in this section. Precipitation is ﬁrst generated inde-pendently, and the other variables are conditioned on thegenerated state of precipitation (i.e., rain or no rain on theday). The other variables are generated either from indepen-dent statistical distributions ﬁtted separately to each of the variables for each of the two precipitation states (i.e., rain, norain) or from independently or jointly ﬁtted autoregressivemodels of order 1 (AR-1).Usually, the year is divided into periods (seasons), and mo-ments (mean, standard deviation, and skew) are calculated foreach variable for each period for each precipitation state. Theseasonal moments are used to ﬁt probability distributions ormodels. Homogeneity of the process in each season is as-sumed.
Jones et al
. [1972],
Bruhn et al
. [1980],
Nicks and Harp
[1980], and
Lane and Nearing
[1989] divide the year into 14-dayor 1-month periods.
Richardson
[1981] smoothed the meansand standard deviations of each period and each precipitationstate using Fourier series. The smoothed daily values of themeans and standard deviations are subsequently used fordeseasonalization.Daily precipitation occurrence in these models is presumedto follow a ﬁrst-order Markov chain with the daily precipitationamount generated from an assumed probability distribution(such as gamma, exponential, truncated normal, etc.) ﬁtted tothe historical daily amounts for each period. One approach togenerate the other variables is to ﬁt distributions indepen-dently for each variable for each period and for each precipi-tation state, under the assumption that each variable is condi-tionally independent and identically distributed (i.i.d.). Thisapproach and its variants are used by
Jones et al
. [1972],
Bruhn et al
. [1980], and
Lane and Nearing
[1989]. In Lane and Near-ing’s model CLIGEN each variable is assumed to be an inde-pendent Gaussian variable for each month, with parametersdependent on the precipitation state transition (e.g., wet to wet, dry to wet, etc.). This approach does not consider thedependence between the variables and the serial dependencefor each variable.
Nicks and Harp
[1980] considered serial dependence of weather variables. They ﬁt autoregressive models of order 1(AR-1) independently to each variable for each period.
Rich- ardson
[1981], who used a multivariate autoregressive model of order 1 (MAR-1), added the consideration of dependenceacross variables. These models suffer from the drawback of assuming the data to be normally distributed. As a result, onlylinear dependence between variables and precipitation statesfrom one day to the next can be reproduced.These approaches have four main drawbacks. First, sinceprecipitation is exogenously provided, lag 0 and lag 1 correla-tions of the variables are often not properly reproduced. Sec-ond, the choice of a probability distribution function is oftensubjective and is rarely formally tested on a site-by-site basis.Third, there is reliance on an implicit Gaussian framework(e.g., AR or MAR) which preserves only linear dependenceand poses problems for bounded variables. Fourth, the ﬁttedmodels have limited portability in the sense that procedures/ distributions used at one site may not be best at other sites.Transformations of variables can be used to justify the Gauss-ian AR or MAR framework. However, it is difﬁcult to developappropriate transformations in the setting considered here andpreserve the proper statistical relationships in the untrans-formed space. All six of the variables considered here are insome sense bounded.
Katz
[1996] observes that the Richardson model (1) does notpreserve the lag 1 autocorrelation of the weather variables thatare conditioned on precipitation amount, (2) underestimatesthe observed variance of monthly values of the weather vari-ables, and (3) because of its conditional form (conditioning onprecipitation state), leads to effects unanticipated by the user,as model parameters are varied. He notes that these problemsare endemic to this class of models and provides ways by whichthe unconditional distributions of the weather variables in sucha model can be derived and examined. The model of
Rajago- palan et al
. [1997] circumvents some of these problems; sincethe nonparametric density estimation does not require thetransformation of the variables, wet and dry spell statistics areexplicitly preserved, and nonlinear relations between the vari-ables are approximated. However, it does not address theproblems introduced by having an exogenous precipitationsimulator. The kernel density estimation procedures also donot adapt the degree of density smoothing to the state space as well as the
k
-NN density estimates employed here. A multivariate chain model for simulating daily minimumand maximum temperatures and precipitation was presentedby
Young
[1994]. This model is similar to the model presentedhere in that a
k
-NN strategy is employed to select a day atrandom from the historical data set as a simulation for thethree variables for the next day. Young uses multiple discrimi-nant analysis to identify patterns in the three-dimensional data.The
k
nearest neighbors of the current day in terms of thesepatterns are identiﬁed, one of them is randomly selected, andits “next” day’s values are adopted as the simulation for thecurrent day’s successor. Seasonal variations are not considered,and the number of nearest neighbors is selected by comparingthe autocorrelograms of the simulated variables with those of the corresponding historical variables. The number of nearestneighbors selected (three to ﬁve) by this criterion is quitesmall. Young demonstrates the superiority of the approachover a ﬁrst-order Markov chain model for the three variablesin terms of a variety of statistics. His model preserves mostnotably the cross correlation between temperature and precip-itation and the wet/dry spell statistics. He also notes somebiases (e.g., reduced persistence and underestimation of thefraction of dry months) in the sequences simulated by hismethod. The work presented here is philosophically similar tothe model of Young, but it differs in operational details. A connection to the Markov process, nonparametric density es-timation, and nonlinear dynamical systems literature is alsoprovided. All the techniques discussed in this section focused on“short-range” statistical properties. It is known that such mod-els will not likely reproduce the variance and related statisticalattributes at longer aggregation periods (e.g., the interannual variance and dependence of seasonal precipitation). Themodel presented in this paper does not explicitly address thisconcern either.Figures 1 and 2 show the pairwise scatterplot of the six variables for wet and dry days, respectively, for season 1 (Janu-
RAJAGOPALAN AND LALL: A
k
–NEAREST-NEIGHBOR SIMULATOR3090
ary–March) of the 1961–1991 data from Salt Lake City. Theline in each scatterplot is a locally weighted scatterplot smooth(LOWESS: a moving-window–weighted local regression from
Cleveland
[1979]). We observe that the pairwise relationshipsbetween the variables can (1) be nonlinear and (2) differ for wet and dry days. There is also evidence (bottom row of Figure1) for the dependence of the precipitation amount on some of the other variables (notably dew point temperature). This in-dicates that a strategy that directly includes precipitation in theset to be simulated may be better than one in which precipi-tation is generated exogenously to the other variables. Het-eroskedasticity (nonconstant variance of errors from thesmooth in each frame) is also observed. Transforms of indi- vidual variables are often used to develop cross-dependencerelations that are approximately linear with relatively uniformscatter about the regression line. Given the varying “curvature”of the mean response and scatter in the pairwise relationships,it is not obvious that a useful set of univariate transformationsthat can address the multivariate dependence is feasible. Thelikely utility of a scheme that recognizes these factors andapproximates the behavior locally in some sense is obvious.
3. Multivariate Markov Model and Bootstrap
Let us denote the time series of length
n
of the daily valuesof the six variables by
x
t
,
t
1,
,
n
. For now, assume thatseasonality has been taken care of in some fashion and we areinterested in resampling daily values
x
t
, focusing only on de-pendence on
m
past values, i.e.,
x
t
1
,
x
t
2
,
,
x
t
m
. Theprocess
x
t
is thus considered to be a
m
-dependent multivariateMarkov process. Synthetic sequences from such a model canbe simulated if we specify the conditional distribution function
F
(
x
t
x
t
1
,
x
t
2
,
,
x
t
m
). The models discussed in section2 belong to this general framework, with
m
1 and with theconditional distribution function
F
(
x
t
x
t
1
) described usingparametric functions (Gaussian distributions for all variablesexcept precipitation). The primary difference in this paper isthat we implicitly use a nonparametric density estimate toresample from
F
(
x
t
x
t
1
).The bootstrap [
Efron
, 1979] is a technique that prescribes adata-resampling strategy using the random mechanism thatgenerated the data. Its applications for estimating conﬁdenceintervals and parameter uncertainty are well known [see
Ha¨rdle and Bowman
, 1988;
Tasker
, 1987;
Woo
, 1989;
Zucchini and Adamson
, 1989]. Usually, the bootstrap resamples with re-placement from the empirical distribution function
F
n
(
x
) of independent, identically distributed data,
x
i
,
i
1,
,
n
.This is equivalent to resampling the observations
x
i
with aprobability of 1/
n
. An algorithm for bootstrapping time seriesconsidering Markovian dependence was developed by
Lall andSharma
[1996], who applied it to univariate, monthly stream-ﬂow data. This algorithm was motivated by nonparametricapproaches to time series analysis using nearest-neighbor den-
Figure 1.
Pairwise scatterplot of SRAD, TMX, TMN, WSPD, DPT, and
P
for wet days, for season 1 at SaltLake City. The lines in each section are the locally weighted scatterplot smoother (LOWESS) smooths.
3091RAJAGOPALAN AND LALL: A
k
–NEAREST-NEIGHBOR SIMULATOR
sity and regression estimators of
Yakowitz
[1973, 1979, 1985,1993]. We shall brieﬂy motivate this algorithm in the context of the present work.The Markov chain model for precipitation occurrence usu-ally considers two states (wet and dry) and transition proba-bilities
p
ij
for transitions from state
i
to state
j
in the next timeperiod. This is a nonparametric model, with an intuitively ap-pealing structure. It has been noted [
Lall et al
., 1996;
Rajago- palan et al
., 1996] that it may be desirable to have more thantwo states in such models to recognize the role of precipitationmagnitude. Increasing the number of states can provide a bet-ter stepwise approximation to the conditional distributionfunction
F
(
P
t
P
t
1
) of the associated Markov process forrainfall.One can extend this thinking to the other ﬁve variables as well. Let us say that we partition each of these variables into
p
states and consider a Markov chain model for all the variables.For the multivariate problem in six variables, there are a totalof
p
6
states at each time step. Thus even for the rather coarsedescription of the process for
p
2 one needs to computetransition probabilities from 64 states to 64 states at the nexttime step. Clearly, the sample sizes needed to reliably estimatetransition probabilities under this framework would be verylarge. As the number of states considered increases, the situ-ation becomes rapidly intractable (
p
5 yields 15,625 states,and
p
10 gives 10
6
states). This is the well-known curse of dimensionality. Conceptually, we shall retain the nonparamet-ric ﬂavor of the Markov chain approach, but we shall strive toapproximate the conditional distribution function
F
(
x
t
x
t
1
)in a more adaptive manner using nearest-neighbor densityestimators.We motivate this idea through Figure 3, where we show aplot between successive values for a synthetic, univariate timeseries. Note that while the correlation between
x
t
and
x
t
1
iszero,
x
t
depends directly on
x
t
1
, with no random terms. Fourstates equally spaced between 0 and 1 for a Markov chainrepresentation are considered. Consider resampling an
x
t
,given that
x
t
1
corresponds to the whisker in the windowmarked as A. If we had observed this value of
x
t
1
severaltimes, we could directly apply the bootstrap and resampledirectly from the successors (i.e.,
x
t
values corresponding toeach such occurrence) to these observations. Since we do nothave such information, assuming that the conditional distribu-tion function
F
(
x
t
x
t
1
) is smooth (i.e., differentiable withbounded derivatives) in a neighborhood of the point of inter-est, we can “borrow” the successors of neighboring values of
x
t
1
for the purpose. The windows A and B were based on 10neighbors of the marked point. We can see that these moving windows are quite effective in capturing the local attributes of the transitions from
x
t
1
to
x
t
. For the situation correspondingto window A, if we had used the four-state Markov chainmodel, all we would know is that 0.75
x
t
1 with probability1 for all values in the range 0.25
x
t
1
0.5. Asymptotically,i.e., as the sample size tends to inﬁnity, the size of the neigh-
Figure 2.
Pairwise scatterplot of SRAD, TMX, TMN, WSPD, and DPT for dry days for season 1 at SaltLake City. The lines in each section are the LOWESS smooths.
RAJAGOPALAN AND LALL: A
k
–NEAREST-NEIGHBOR SIMULATOR3092
borhood dictated by a given number of neighbors
k
will shrink,and the approximation of the underlying conditional distribu-tion function will improve.In the multivariate setting, neighbors of the conditioningpoint correspond to data patterns that are similar to the pat-tern at the conditioning point. For a day with no rain, that is warm, with little wind, and no humidity, neighbors establishedby calculating the vector distance between the observations willbe similar days. The values for the weather variables for thenext day will be sampled as a vector from a historically similarday. Clearly, there is some utility to giving a higher probabilityto a day that is more similar to the conditioning day than theother “neighbors.” Using a weight function that decayssmoothly with distance can reduce the sensitivity to the num-ber of nearest neighbors used for resampling. A weight func-tion applied to the nearest neighbors that is natural in a certainsense and the choice of the number of nearest neighbors to useare discussed in some detail by
Lall and Sharma
[1996].
4. The
k
-NN Resampling Algorithm
The
k
-NN conditional resampling scheme is described inthis section. All six daily weather variables (including precipi-tation) are considered simultaneously as members of a daily weather vector. Denote the vector time series of weather vari-ables by
x
t
,
t
1,
,
n
, and assume for now that we havedecided on a dependence structure, i.e., which and how manylags the future values will depend on and the number of near-est neighbors
k
to use. We shall call this conditioning set a“feature vector” and the simulated or forecasted vector the“successor.” The strategy is to ﬁnd the historical nearest neigh-bors of the current feature vector and to resample from theirsuccessors. Rather than resampling uniformly from the
k
suc-cessors, we use a discrete resampling kernel that is monoton-ically decreasing, is data adaptive, adapts automatically to thedimension of the feature vector and to boundaries of the sam-ple space, and has an attractive probabilistic interpretationconsistent with the nearest-neighbor method. Also presumefor now that the data have been deseasonalized or that atreatment for seasonality is available that does not affect thealgorithm presented in section 4.1. We deseasonalize the timeseries of each of the variables by removing the calendar day’smean and dividing by the calendar day’s standard deviationcomputed over the historical record. The
x
t
referred to aredeseasonalized variates. The ﬁnal results presented are ob-tained by multiplying the daily values generated by the stan-dard deviation for that date and by adding the mean for thatdate.We now present an annotated algorithm for resampling weather variables adopted here that considers day-to-day de-pendence between the variables. This algorithm is applied fora given season (e.g., 3 months, 1 month) and is initialized bythe
x
t
values for the last day of the previous season.
4.1. Flow Chart for Resampling
The key steps in the algorithm are (1) identifying a currentconditioning vector of the six weather variables, (2) determin-ing its
k
nearest neighbors in state space, (3) identifying, foreach of these
k
nearest neighbors, a successor vector compris-ing the next day’s values for the six variables, (4) resamplingone of these vectors to represent the next day’s weather usinga kernel or weight function, and (5) repeating this process.1. Deﬁne the composition of the feature vector
D
t
of di-mension
d
.
D
t
x
t
1
Here we have chosen to use the vector of the (six) deseason-alized variables of interest on the previous day as the feature vector. One could add, if desired, other information, such asthe value of an atmospheric ﬂow index (e.g., the SouthernOscillation Index) on the same day or averaged over the pastmonth and/or additional lags (e.g.,
D
t
[
x
t
1
,
x
t
1
,
,
x
t
L
] where
L
is the number of terms in the model).
Katz and Parlange
[1995] ﬁt stochastic models for daily precipitationconditional on a monthly index of large-scale atmospheric cir-culation.2. Denote the current feature vector as
D
i
and determineits
k
nearest neighbors among the historical state vectors
D
m
using the weighted Euclidean distance
r
im
j
1
d
w
j
v
ij
v
mj
2
(1) where
v
( )
j
is the
j
th component of
D
( )
and the
w
j
are weights.Here we chose the weights
w
j
as “scaling” weights (e.g., 1/
s
j
), where
s
j
is some measure of scale such as the standard devia-tion or range of
v
j
. The weighted euclidean distance may alsobe computed as (
r
im
2
(
v
i
v
m
)
T
¥
1
(
v
i
v
m
)), where
¥
is the covariance matrix of
D
and
v
i
and
v
m
represent the values of
D
at points
i
and
m
. The weights
w
j
may thus bespeciﬁed a priori, as is done here, or they may be chosen toprovide the best forecast for a particular successor in a leastsquares sense [see
Yakowitz and Karlsson
, 1987]. The latter would be the desirable method, but it adds substantially to the
Figure 3.
A plot of
x
t
1
versus
x
t
for the time series gener-ated from the model
x
t
1
[1
4(
x
t
0.5)
2
]. The statespace for
x
is discretized into four states as shown. Also shownare windows A and B with whiskers located over selected values of
x
t
. These windows represent a
k
nearest neighbor-hood of the corresponding
x
t
. In general, these windows willnot be symmetric about the
x
t
of interest, and their width variesdepending on the relative sampling density of
x
t
. Note how onecan think of state transition probabilities using these windowsin much the same way as with the multistate Markov chain.However, the nearest-neighbor windows point directly to theregion in which transitions are possible. A value of
x
t
1
con-ditional to point A or B can be bootstrapped by appropriatelysampling and replacing one of the values of
x
t
1
that falls inthe corresponding window. (From
Lall and Sharma
[1996].)
3093RAJAGOPALAN AND LALL: A
k
–NEAREST-NEIGHBOR SIMULATOR

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...Sign Now!

We are very appreciated for your Prompt Action!

x