Recipes/Menus

9 pages
4 views

DSP Architecture for Wireless Sensor Nodes Using VLSI Technique

Please download to get full document.

View again

of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Radio communication exhibits the highest energy consumption in wireless sensor nodes. Given their limited energy supply from batteries or scavenging, these nodes must trade data communication for on-the-node computation. Currently, they are designed
Transcript
  [Ragumadhavan ,  3(2): February, 2014] ISSN: 2277-9655 Impact Factor: 1.852   [883-891] IJESRT   INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY DSP Architecture for Wireless Sensor Nodes Using VLSI Technique R.Ragumadhavan Assistant Professor, Department of ECE, PSNA College Of Engineering And Technology, India ragufriends@gmail.com    Abstract — Radio communication exhibits the highest energy   consumption in wireless sensor nodes. Given their limited energy   supply from batteries or scavenging, these nodes must trade data   communication for on-the-node computation. Currently, they are   designed around off-the-shelf low-power microcontrollers. But   by employing a more appropriate processing element, the energy   consumption can be significantly reduced. This paper describes   the design and implementation of the newly proposed folded-tree   architecture for on-the-node data processing in wireless sensor   networks, using parallel prefix operations and data locality in   hardware. Measurements of the silicon implementation show   an improvement of 10–20 × in terms of energy as compared to   traditional modern micro-controllers found in sensor nodes.    keywords —Digital processor, parallel prefix, wireless sensor network (WSN).   I.   I NTRODUCTION WIRELESS sensor network (WSN) applications range   from medical monitoring to environmental sensing,   industrial inspection, and military surveillance. WSN nodes   essentially consist of sensors, a radio, and a microcontroller   combined with a limited power supply, e.g., battery or energy   scavenging. Since radio transmissions are very expensive   in terms of energy, they must be kept to a minimum in   order to extend node lifetime. The ratio of communication-to-   computation energy cost can range from 100 to 3000 [1]. So   data communication must be traded for on-the-node processing   which in turn can convert the many sensor readings into   a few useful data values. The data-driven nature of WSN   applications requires a specific data processing approach.   Previously, we have shown how parallel prefix computations   can be a common denominator of many WSN data processing   algorithms [2]. The goal of this paper is to design an ultra-   low-energy WSN digital signal processor by further exploiting this and other characteristics unique to WSNs. First, Section II lists these specific characteristics of WSN-   related on-the-node processing. Then, Section III covers the   proposed approach to exploit these properties. Section IV   elaborates on the programming and usage of the resulting   folded tree architecture. Section V discusses the application-   specific integrated circuit (ASIC) implementation of the design   while Section VI measures its performance and Section VII illustrates the usefulness to WSNs with four relevant case-   studies. Finally, the work is concluded in Section VIII. II.   C HARACTERISTICS OF WSN S AND R ELATED R EQUIREMENTS FOR P ROCESSING Several specific characteristics, unique to WSNs, need to   be considered when designing a data processor architecture   for WSNs.  Data-Driven: WSN applications are all about sensing data   in an environment and translating this into useful information   for the end-user. So virtually all WSN applications are char-   acterized by local processing of the sensed data [3].  Many-to-Few: Since radio transmissions are very expensive   in terms of energy, they must be kept to a minimum in order   to extend node lifetime. Data communication must be traded   for on-the-node computation to save energy, so many sensor   readings can be reduced to a few useful data values.  Application-Specific: A “one-size-fits-all” solution does not exist since a general purpose processor is far too power hungry   for the sensor node’s limited energy budget. ASICs, on the   other hand, are more energy efficient but lack the flexibility   to facilitate many different applications. Apart from the above characteristics of WSNs, two key requirements for improving existing processing and control   architectures can be identified.  Minimize Memory Access: Modern micro-controllers (MCU) are based on the principles of a divide-and-conquer   strategy of ultra-fast processors on the one hand and arbitrary   complex programs on the other hand [4]. But due to this   generic approach, algorithms are deemed to spend up to 40–   60% of the time in accessing memory [5], making it a bottle-   neck [6]. In addition, the lack of task-specific operations leads   to inefficient execution, which results in longer algorithms and   significant memory book keeping. Combine Data Flow and Control Flow Principles: To man-   age the data stream (to/from data memory) and the instruction   stream (from program memory) in the core functional unit,   two approaches exist. Under control flow, the data stream is   a consequence of the instruction stream, while under data   flow the instruction stream is a consequence of the data   stream. A traditional processor architecture is a control flow   machine, with programs that execute sequentially as a stream   of instructions. In contrast, a data flow program identifies the   data dependencies, which enable the processor to more or less   choose the order of execution. The latter approach has been   hugely successful in specialized high-throughput applications,   such as multimedia and graphics processing. This paper shows   how a combination of both approaches can lead to a significant   improvement over traditional WSN data processing solutions.    [883-891] Fig. 1. A binary tree (left, 7 PEs) is functionally equivalent to the novel   folded tree topology (right, 4 PEs) used in this architecture. Fig. 2. Addition with propagate-generate (PG) logic. III.   P ROPOSED A PPROACH  A.   WSN Applications and On-The-Node Data Aggregation  Notwithstanding the seemingly vast nature of WSN applica-   tions, a set of basic building blocks for on-the-node processing   can be identified. Common on-the-node operations performed   on input data collected directly from the node’s sensors or   through in-the-network aggregation include filtering, fitting,   sorting, and searching [7]. We published earlier [2] that these   types of algorithms can be expressed in terms of parallel prefix   operations as a common denominator. Prefix operations can be calculated in a number of ways [8],   but we chose the binary tree approach [9] because its flow   matches the desired on-the-node data aggregation. This can   be visualized as a binary tree of processing elements (PEs)   across which input data flows from the leaves to the root   (Fig. 1, left). This topology will form the fixed part of our   approach, but in order to serve multiple applications, flexibility   is also required. The tree-based data flow will, therefore, be   executed on a data path of programmable PEs, which provides   this flexibility together with the parallel prefix concept.  B.   Parallel Prefix Operations  In the digital design world, prefix operations are best   known for their application in the class of carry look-ahead   adders [10]. The addition of two inputs  A and  B in this case   consists of three stages (Fig. 2): a bitwise propagate-generate   (PG) logic stage, a group PG logic stage, and a sum-stage. The outputs of the bitwise PG stage ( P i =  A i ⊕  B i and   G i =  A i ·  B i ) are fed as (  P i , G i  ) -pairs to the group PG logic   stage, which implements the following expression : (  P i , G i  ) ∘ (  P i + 1 , G i + 1  ) = (  P i · P i + 1 , G i + P i · G i + 1  ) (1) It can be shown this ∘ -operator has an identify element  I = (  1 , 0  ) and is associative. Fig. 3. Example of a prefix calculation with sum-operator using Blelloch’s   generic approach in a trunk- and twig-phase. For example, the binary numbers  A = “1001” and  B =   “0101” are added together. The bitwise PG logic of LSB-first   noted  A = { 1001 } and  B = { 1010 } returns the PG-pairs for   these values, namely, (  P , G  ) = { (  0 , 1  ) ; (  0 , 0  ) ; (  1 , 0  ) ; (  1 , 0  ) } .   Using these pairs as input for the group PG-network, defined   by the ∘ -operator from (1) to calculate the prefix operation,   results in the carry-array G = { 1 , 0 , 0 , 0 } [i.e., the second   element of each resulting pair from (1)]. In fact, it contains all the carries of the addition, hence the name carry look-   ahead. Combined with the corresponding propagate values P i , this yields the sum S = { 0111 } , which corresponds to “1110.” The group PG logic is an example of a parallel prefix com- putation with the given ∘ -operator. The output of this parallel-   prefix PG-network is called the all-prefix set defined next. Given a binary closed and associative operator ∘ with   identity element  I and an ordered set of n elements   [ a 0 , a 1 , a 2 , ..., a n − 1 ] , the reduced-prefix set is the ordered   set [  I  , a 0 , (  a 0 ∘ a 1  ), ... , (  a 0 ∘ a 1 ∘·· ·∘ a n − 2  ) ] , while the all-prefix   set is the ordered set [ a 0 , (  a 0 ∘ a 1  ), ... , (  a 0 ∘ a 1 ∘ ··· ∘ a n − 1  ) ] ,   of which the last element ( a 0 ∘ a 1 ∘ ··· ∘ a n − 1 ) is called the   prefix element.  For example, if ∘ is a simple addition, then the prefix   element of the ordered set [ 3 , 1 , 2 , 0 , 4 , 1 , 1 , 3 ] is L, a i = 15.   Blelloch’s procedure [9] to calculate the prefix-operations on a binary tree requires two phases (Fig. 3). In the trunk-phase,   the left value L is saved locally as Lsave and it is added to   the right value R, which is passed on toward the root. This   continues until the parallel-prefix element 15 is found at the   root. Note that each time, a store-and-calculate operation is   executed. Then the twig-phase starts, during which data moves   in the opposite direction, from the root to the leaves. Now   the incoming value, beginning with the sum identity element   0 at the root, is passed to the left child, while it is also added   to the previously saved Lsave and passed to the right child.   In the end, the reduced-prefix set is found at the leaves. An example application of the parallel-prefix operation with the sum operator (prefix-sum) is filtering an array so that all   elements that do not meet certain criteria are filtered out. This   is accomplished by first deriving a “keep”-array, holding “1”   if an element matches the criteria and “0” if it should be left   out. Calculating the prefix-sum of this array will return the   amount as well as the position of the to-be-kept elements of the   input array. The result array simply takes an element from the   input array if the corresponding keep-array element is “1” and   copies it to the position found in the corresponding element    [883-891] of the prefix-sum-array. To further illustrate this, suppose the   criterion is to only keep odd elements in the array and throw   away all even elements. This criterion can be formulated as   keep (   x  ) = (   x mod 2  ) . The rest is calculated as follows: input = [ 2 , 3 , 8 , 7 , 6 , 2 , 1 , 5 ] keep = [ 0 , 1 , 0 , 1 , 0 , 0 , 1 , 1 ] prefix = [ 0 , 1 , 1 , 2 , 2 , 2 , 3 , 4 ]   result = [ 3 , 7 , 1 , 5 ] .   The keep-array provides the result of the criterion. Then   the parallel-prefix with sum-operator is calculated, which   results in the prefix-array. Its last element indicates how many   elements are to be kept (i.e., 4). Whenever the keep-array   holds a “1,” the corresponding input-element is copied in   the result-array at the index given by the corresponding   prefix-element (i.e., 3 to position 1, 7 to position 2, etc.). This   is a very generic approach that can be used in combination   with more complex criteria as well. Other possible applications that relate to WSNs include peak detection, polynomial evaluation for model-fitting, lexically   compare strings, add multi-precision numbers, delete marked   elements from arrays, and quick sort [3], [11]. Some of these   will be used as a case-study later in Section VII. C.   Folded Tree  However, a straightforward binary tree implementation of    Blelloch’s approach as shown in Fig. 3 costs a significant   amount of area as n inputs require  p = n − 1 PEs. To reduce area and power, pipelining can be traded for throughput [8]. With a classic binary tree, as soon as a layer of PEs finishes   processing, the results are passed on and new calculations can   already recommence independently. The idea presented here is to fold the tree back onto itself to maximally reuse the PEs. In doing so,  p becomes   proportional to n  /  2 and the area is cut in half. Note that also   the interconnect is reduced. On the other hand, throughput   decreases by a factor of log 2 (  n  ) but since the sample rate   of different physical phenomena relevant for WSNs does not   exceed 100 kHz [12], this leaves enough room for this tradeoff    to be made. This newly proposed folded tree topology is   depicted in Fig. 1 on the right, which is functionally equivalent   to the binary tree on the left. IV.   P ROGRAMMING AND U SING THE F OLDED T REE Now it will be shown how Blelloch’s generic approach for   an arbitrary parallel prefix operator can be programmed to   run on the folded tree. As an example, the sum-operator is   used to implement a parallel-prefix sum operation on a 4-PE   folded tree. First, the trunk-phase is considered. At the top of Fig. 4, a folded tree with four PEs is drawn of which PE3 and PE4   are hatched differently. The functional equivalent binary tree   in the center again shows how data moves from leaves to root   during the trunk-phase. It is annotated with the letters L and   R to indicate the left and right input value of inputs A and B.   In accordance with Blelloch’s approach, L is saved as Lsave Fig. 4. Implications of using a folded tree (four4-PE folded tree shown at the   top): some PEs must keep multiple Lsave’s (center). Bottom: the trunk-phase   program code of the prefix-sum algorithm on a 4-PE folded tree. and the sum L + R is passed. Note that these annotations are   not global, meaning that annotations with the same name do   not necessarily share the same actual value. To see exactly how the folded tree functionally becomes a   binary tree, all nodes of the binary tree (center of Fig. 4) are   assigned numbers that correspond to the PE (1 through 4),   which will act like that node at that stage. As can be seen,   PE1 and PE2 are only used once, PE3 is used twice and PE4   is used three times. This corresponds to a decreasing number   of active PEs while progressing from stage to stage. The first   stage has all four PEs active. The second stage has two active   PEs: PE3 and PE4. The third and last stage has only one   active PE: PE4. More importantly, it can also be seen that PE3   and PE4 have to store multiple Lsave values. PE4 must keep   three: Lsave0 through Lsave2, while PE3 keeps two: Lsave0   and Lsave1. PE1 and PE2 each only keep one: Lsave0. This   has implications toward the code implementation of the trunk-   phase on the folded tree as shown next. The PE program for the prefix-sum trunk-phase is given at the bottom of Fig. 4. The description column shows how data   is stored or moves, while the actual operation is given in the   last column. The write/read register files (RF) columns show   how incoming data is saved/retrieved in local RF, e.g.,  X @0 bY    means  X is saved at address 0 bY , while 0 bY @  X loads the   value at 0 bY into  X . Details of the PE data path (Fig. 8)   and the trigger handshaking, which can make PEs wait for   new input data (indicated by T ), are given in Section V. The   trunk-phase PE program here has three instructions, which are   identical, apart from the different RF addresses that are used.   Due to the fact that multiple Lsave’s have to be stored, each   stage will have its own RF address to store and retrieve them.    [883-891] Fig. 5. Annotated twig-phase graph of 4-PE folded tree. This is why PE4 (active for 3 stages) needs three instructions   (lines 0 - 2 ), PE3 (active for 2 stages) needs two instructions   (lines 0 - 1 ) and PE1 and PE2 (active in first stage only) need   one instruction (line 0 ). This basically means that the folding   of the tree is traded for the unrolling of the program code. Now, the twig-phase is considered using Fig. 5. The tree   operates in the opposite direction, so an incoming value (anno-   tated as S) enters the PE through its O port [see Fig. 4(top)].   Following Blelloch’s approach, S is passed to the left and the sum S + Lsave is passed to the right. Note that here as well none of these annotations are global. The way the PEs are activated during the twig-phase again influences how the   programming of the folded tree must happen. To explain this,   Fig. 6 shows each stage of the twig-phase (as shown in Fig. 5)   separately to better see how each PE is activated during the   twig-phase and for how many stages. The annotations on the   graph wires (circled numbers) relate to the instruction lines   of the program code shown in Fig. 7, which will also be   discussed. Fig. 6 (top) shows that PE4 is active during all three stages   of the twig-phase. First, an incoming value (in this case the   identity element S2) is passed to the left. Then it is added to   the previously (from the trunk-phase) stored Lsave2 value and   passed to the right. PE4-instruction 1 will both pass the sum   Lsave2 + S2 = S1 to the right ( = itself) and pass this S1 also the left toward PE3. The same applies for the next instruction 2 . The last instruction 3 passes the sum Lsave0 + S0. Looking at the PE3 activity [Fig. 6 (center)], it is only active in the second and third stage of the twig-phase. It is indeed   only triggered after the first stage when PE4 passes S2 to   the left. The first PE3-instruction 0 passes S2 to PE1, and   instruction 1 adds this to the saved Lsave1, passing this sum   T1 to PE2. The same procedure is repeated for the incoming   S1 from PE4 to PE3, which is passed to its left (instruction 2 ),   while the sum Lsave0+S1 is passed to its right (instruction 3 ).   In fact, two pairs of instructions can be identified, that exhibit   the same behavior in terms of its outputs: the instruction-pair 0 and 1 and the instruction-pair 2 and 3 . Two things are   different however. First, the used register addresses (e.g., to Fig. 6. Activity of different PEs during the stages of the twig-phase for a   4-PE folded tree: PE4 (top), PE3 (center), PE1 and PE2 (bottom). store Lsave values) are different. Second, the first pair stores   incoming values S0 and S1 from PE4, while the second pair   does not store anything. These differences due to the folding,   again lead to unrolled program code for PE3. Last, PE1 and PE2 activity are shown at the bottom of Fig. 6. They each execute two instructions. First, the incoming   value is passed to the left, followed by passing the sum of this   value with Lsave0 to the right. The program code for both is   shown in the bottom two tables of Fig. 7. V.   H ARDWARE I MPLEMENTATION Fig. 8(a) gives a schematic overview of the implemented   folded tree design. The ASIC comprises of eight identical   16-bit PEs (Fig. 10), each consisting of a data path with    [883-891] (a) (b) Fig. 8. Schematic diagram of design overview. (a) Top-level view (here with   four PEs shown). (b) Detail of one PE data path . Fig. 9. View of the lab measurement setup for the folded tree IC. Fig. 7. Program of the twig-phase of the prefix sum algorithm for a 4-PE   folded tree. programmable controller and 16 × 36 bit instruction memory.   They are interconnected through the request-acknowledge   handshaking trigger bus and the bidirectional data bus in the folded way (cf. Fig. 1). Handshaking triggers activate the   PEs only when new data is available and in such a way that   they functionally become a binary tree in both directions of    the trunk- and twig-phase. Within each data path [Fig. 8(b)],   muxes select external data, stored data or the previous result   as the next input for the data path. The data path contains an   algorithmic logical unit (ALU) with four-word deep register   files (RF-A and RF-B) at the inputs A and B for operand   isolation. These RFs comprise the distributed data memory   of the whole system, with a combined capacity of 4 kB.   They are the only clocked elements within the data path. As   data flows through the tree, it is constantly kept local to its   designated operation. This is one of the goals of this paper,   which effectively removes the von Neumann bottleneck and saves power. The design targets 20–80-MHz operation at 1.2   V. It was fabricated in 130-nm standard cell CMOS. A PE takes six (down-phase) or seven (up-phase) cycles to   process one 36-bit instruction, which can be divided into three   stages. 1)   Preparation, which acknowledges the data and starts the   core when input triggers are received (1 cycle). 2)   Execution, which performs the load-execute-jump stages to do the calculations and fetch the next instruction   pointer (4 cycles). 3)   Transfer, which forwards the result by triggering the next PE in the folded tree path on a request-acknowledge basis (1–2 cycle). This is tailored toward executing the key store-and-calculate   operation of the parallel prefix algorithm on a tree as described   earlier in Section III-B. Combined with the flexibility to   program the PEs using any combination of operators available   in their data path, the folded tree has the freedom to run a   variety of parallel-prefix applications [11]. VI.   E XPERIMENTAL V ALIDATION The measurement setup (Fig. 9) of the chip uses digital   interfacing over universal serial bus (USB) to access the data   I/O, programming, and debug facilities. The data bus [see
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x