ECE/CS 757: Advanced Computer Architecture II

Spring Semester 2007

News

Special Office hours 2-3 PM Wed. May 9

Homework 4 due April 24

MPP readings ready – Apr. 14

Posted SIMD readings/slides Apr. 9

Updated cluster slides Apr. 9

Notes and readings for Clusters ready – March 26

Final version Server MP lecture notes – March 26

Server MP lecture notes available; March 20

Exam posted; due March 20 before class

HW 3 solutions posted

Last installment of Client MP slides; March 7

First installment of Client MP slides

Appended final installment to memory slides Feb. 26

Homework 3 due Mar. 6

Project assignment given Feb. 20

Appended third installment of memory slides Feb. 19

Appended second installment of memory slides (to first installment) Feb. 14

Added assigned readings for transactional memory

Added first installment of Memory slides

Added reading material for Memory Feb. 11

Added lecture slides for cores Feb. 5

Web page posted Jan. 19, 2007

Instructor Information

Prof. James E. Smith
Office:             2359 Engineering Hall
Office hours:   2:30-3:30PM on lecture days
Office phone:   265-5737
Email:     jes@ece.wisc.edu put "757" somewhere in subject line.

Course Information

· Required Course Material

Instructor book draft and papers available on the Web

· Reference Course Material

David Culler and J. P. Singh with Anoop Gupta
Parallel Computer Architecture: A Hardware/Software Approach
Morgan Kaufmann Publishers, 1998.

· Homework

Assignments are due by 5:00 P.M on the due date. No late assignments will be accepted

· Project

The project is to do some original research in a group of three students. For example, you could examine a modest extension to a paper studied in class or re-validate the data in some paper by writing your own simulator.

Midterm exam I will be in class
Midterm exam II will be in class on May 10

· Grading

Homework     10%
Midterm I       30%
Project            30%
Midterm II      30%

Calendar

1. Introduction (1 lecture+)

Date: Jan. 23

Reading: Chapter 1; Amdahl, Olukotun et al.

2. MP Software and ISA (3 lectures)

Dates: Jan. 25, 30, Feb. 1

Reading: Chapter 2; Lamport; Muys; MPI tutorial;

3. Cores (3 lectures)

Dates: Feb. 6, 8, 13

Reading: Chapter 3; Borkenhagen et al.; D. Marr et al.; B. Sinharoy et al.; P. Kongetira, et al.

4. Memory (3 lectures)

Dates: Feb. 15, 20, 22

Reading: Chapter 4; Natarajan et al.; Adve and Gharachorloo; Hammond et al.; Hammond et al., Rajwar et al., Saha et al.

5. Client MPs (3 lectures)

Dates: Feb. 27, March 1, 6, 8

Reading: Kumar ISCA05, Suh, Petoumenos, Kumar Computer, Chishti, Mendelson

Review Mar. 8

Exam I, Mar. 8-18

6. Server MPs (Marty Talk) (4 lectures)

Dates: March 15, 20, 22, 27

Reading: Nesbit et al., Barroso et al., Marty and Hill, Beckman et al., Keltcher et al., Kota and Oehler, Briggs et al.

7. Clusters (2 lectures)

Dates: March 29, April 10

Reading: Kronenberg, Barrosso, Desai, King

8. SIMD systems (2 lectures)

Dates: April 12, 17

Reading: Padua and Wolf, Cray, Hillis and Steele

9. Massively Parallel MPs (3 lectures)

Dates: April 19, 24, 26

Reading: Singh et al., Ni and McKinley, Hillis and Tucker, Dally et al., Cray Research, Scott

10. Dataflow (1 lecture)

Dates: May 1

Reading: Lee & Hurson, Swanson et al..

11. Special Purpose Systems (1 lecture)

Dates: May 3

Reading: Kahle et al., Day&Hofstee

Review: May 8

Exam II: May 10, in class

Homeworks

1. Read one of the unassigned papers (12, 14, 21, 22) from the MP Software and ISA readings and write a one page summary. Due. Jan. 30.

2. Read one of the unassigned papers (31,33,34,36,37,38) from the Cores readings and write a one page summary. Due Feb. 15.

3. Homework 3 problems – Due Mar. 6.

Homework 3 solutions

4. Homework 4 problems – Due. Apr. 24.

Homework 4 solutions

Papers

Introduction

1. J. E. Smith, draft Chapter 1

2. G. Amdahl, “The Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” Spring Joint Computer Conference, 1967, pp. 483-485.

3. K. Olukotun, et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS-7, October 1996.

4. E. Berndt et al., “Price and Quality of Desktop and Mobile Personal Computers: A Quarter Century of Progress,” National Bureau of Economic Research Summer Institute 2000, Nov. 2000.

5. S. Eyerman, et al. “A Mechanistic Model for Superscalar Processors,” 2007.

6. A. Hartstein et al., “On the Nature of Cache Miss Behavior: Is It √2 ?” Journal of Instruction Level Parallelism, July 2006, pp. 1-21.

7. A Hartstein et al., “Optimal Memory Hierarchy”, unpublished paper, 2007.

8. I. Tuomi, “The Lives and Death of Moore’s Law,” First Monday, Nov. 2002.

MP Software and ISA

9. J. E. Smith, draft Chapter 2

10. A. H. Karp, "Programming for Parallelism," IEEE Computer, pp. 43-57, May 1987.

11. A. Muys, A Pthreads Tutorial.

12. P. E. McKenney, Selecting Locking Designs for Parallel Programs PLoPD-II, 1996.

13. Leslie Lamport, How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs, IEEE Trans. on Computers, September 1979, pp. 690-691.

14. H. Sutter and J. Larus, Software and the Concurrency Revolution, ACM Queue, September 2005.

15. J. Larus, Software Challenges in Nanoscale Technology, talk at CRA workshop, Dec. 2005.

16. Maui High Performance Computing Center, SP Parallel Programming Workshop -- Message Passing Interface, 2003.

17. MPI Forum, MPI-2: Extensions to the Message Passing Interface, 2003.

18. B. Barney, Posix Threads Programming, Lawrence Livermore National Laboratory.

19. S. Oaks and H. Wong, Advanced Synchronization in Java Threads, excerpt from Java Threads, O’reilly Publishers.

20. D. B. Skillicorn, J. M. D. Hill and W. F. McColl, “Questions and answers about BSP”, Journal of Scientific Programming, Fall 1997.

21. M. Dubois, C. Scheurich, and F. A. Briggs, “Synchronization, Coherence, and Event Ordering in Multiprocessors”, IEEE Computer, vol. 21, pp. 9-21, February 1988.

22. Sarita V. Adve and Kourosh Gharachorloo, “Shared Memory Consistency Models: A Tutorial,” IEEE Computer, 29(12):66-76, December 1996.

23. C. Blundell, et al., “Deconstructing Transactional Sematics: The Subtleties of Atomicity,” Fourth Annual Workshop on Duplicating, Deconstructing, and Debunking, June 2005.

24. L. Hammond, et al., “Programming with Transactional Coherence and Consistency (TCC)”, Proc. ASPLOS, October 2004.

Cores

25. J. E. Smith and G. S. Sohi, “The Microarchitecture of Superscalar Processors”,
Proceedings of the IEEE, Dec. 1995, pp. 1609 – 1624.

26. B. J. Smith, “Architecture and Applications of the HEP Multiprocessor Computer System”, SPIE Real Time Signal Processing IV, 1981, pp. 241-248.

27. J. Emer, "EV8: The Post-Ultimate Alpha", keynote talk, PACT 2001.

28. D. Marr et al., "Hyper-Threading Technology Architecture and Microarchitecture," Intel Technology Journal, Feb. 2002.

29. J. M. Borkenhagen et al., "A Multithreaded PowerPC Processor for Commercial Servers," IBM Journal of Research and Development, 2000.

30. P. Kongetira, et al., Niagara: A 32-way Multithreaded SPARC Processor, IEEE Micro, March/April 2005, pages 21-29.

31. Tullsen et al., “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” Proc. 24th International Symposium on Computer Architecture, May 1996, pp. 191-202.

32. B. Sinharoy et al., “Power5 System Microarchitecture,” IBM Journal of Research and Development, July 2005, pp. 505-521.

33. F. J. Cazorla, et al., “Dynamically Controlled Resource Allocation in SMT Processors,” Micro 2004, pp. 171-182.

34. H. McGhan, “Niagara 2 Opens the Floodgates,” Microprocessor Report, Nov. 2006.

35. S. Eyerman, et al. “A Mechanistic Model for Superscalar Processors,” 2007.

36. S. E. Raasch and S. K. Reinhardt, “The Impact of Resource Partitioning on SMT Processors,” PACT 2003, pp. 15-25.

37. F. J. Cazorla, et al., “Predictable Performance in SMT Processors: Synergy Between the OS and SMTs,” IEEE Trans. Comp, July 2006, pp. 785 - 799.

38. K. Luo, J. Gummaraju, M. Franklin, “Balancing thoughput and fairness in SMT processors,” 2001 ISPASS, Nov. 2001, pp. 164-171.

39. J. E. Thornton, “Parallel Operation in the CDC 6600,” AFIPS Proc. FJCC, pt. 2 vol. 26, 1964, pp. 33-40.

Memories

40. J. Archibald, “Cache coherence protocols: Evaluation using a multiprocessor simulation model,” ACM Trans. Comp. Systems, pp. 273-298,1986.

41. P. Sweazey and A. J. Smith, A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus, Proc. Thirteenth International Symposium on Computer Architecture, June 1986.

42. A. Gupta and Wolf-Dietrich Weber, Cache Invalidation Patterns in. Shared-Memory Multiprocessors, IEEE Transactions on Computers, July 1992.

43. A. Gupta, A., W.-D. Weber, and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, Proc. 1990 Int. Conf. on Parallel Processing. pp. 1:312-321, 1990

44. L. Hammond, et al., "Transactional Memory Coherence and Consistency," Proc. International Symposium on Computer Architecture, June 2004.

45. L. Hammond, et al., "Programming with Transactional Coherence and Consistency (TCC)", Proc. ASPLOS, October 2004.

46. R. Rajwar, et al., “Virtualizing Transactional Memory,” Proc. International Symposium on Computer Architecture, June 2005.

47. B. Saha, A.-R. Adl-Tabatabai, Q. Jacobson, “Architectural Support for Software Transactional,” 39^thInt. Symp. on Microarchitecture, Dec. 2006, pp. 185-196.

48. M. R. Marty and M. D. Hill, “Coherence Ordering for Ring-based Chip Multiprocessors,” 39^thInt. Symp. on Microarchitecture, Dec. 2006, pp. 309-320.

49. Leslie Lamport, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs,” IEEE Trans. on Computers, September 1979, pp. 690-691.

50. Sarita V. Adve and Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial, IEEE Computer, 29(12):66-76, December 1996.

51. B. Jacob and D. Wang, “DRAM: Architectures, Interfaces, and Systems,” DRAM Tutorial, 2002 Int. Symp. on Computer Architecture, June 2002.

52. S. Rixner, “Memory Controller Optimizations for Web Servers,” 37th Int. Symp. on Microarchitecture, Dec. 2004, pp. 355 – 366.

53. S. Rixner et al., “Memory Access Scheduling,” 27th Int. Symp. on Comp. Arch., June
2000, pp. 128 – 138.

54. V. Cuppu, B. Jacob, T. Mudge, “A performance comparison of contemporary DRAM architectures,” 26th Int. Symp. Computer Architecture, May 1999, pp. 222-233.

55. I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers for Modern Processors,” IEEE Micro, Jan.-Feb. 2006, pp. 22-29.

56. C. Natarajan, B. Christenson, and F. Briggs, “A Study of Performance Impact of Memory Controller Features in Multi-Processor Server Environment,” Proc. of the 3rd Workshop on Memory Perf. Issues, June 2004 pp. 80-87.

57. C. P. Thacker, L. C. Stewart, and E. H. Satterthwaite Jr., “ Firefly: A Multiprocessor Workstation,” IEEE Transactions on Computers, Aug. 1988, pp. 909-920.

58. R. A. Atkinson and E. M. McCreight, “The Dragon Processor,” Proc. 2nd Int. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1987, pp. 65-69.

59. M. D. Hill, “Multiprocessors Should Support Simple Memory Consistency Models,” IEEE Computer, Aug. 1998, pp. 28-34.

60. K. Gharachorloo et al., Specifying System Requirements for Memory Consistency Models, Computer Sciences Technical Report #1199, University of Wisconsin, Madison, December 1993. Also available as Technical Report #CSL-TR-93-594, Stanford University.

61. Saha, B., A.-R. Adl-Tabatabai, Q. Jacobson, “Architectural Support for Software Transactional,” 39^thInt. Symp. on Microarchitecture, Dec. 2006, pp. 185-196.

62. Adl-Tabatabai, A., et al., “Compiler and Runtime Support for Efficient Software Transactional Memory,” PLDI 2006, pp. 26-37.

Client SMPs

63. Balakrishnan, S., et al., “The Impact of Performance Asymmetry in Emerging Multicore Architectures,” 32nd Int’l Symp. on Computer Architecture, pp. 506-517, June 2005.

64. Chang, J, G. Sohi, “Cooperative Caching for Chip Multiprocessors”, ISCA-33, pp. 264-276, June 2006.

65. Cheng, L., et al., “Interconnect-Aware Coherence Protocols for Chip Multiprocessors”, ISCA-33, pp. 339-351, June 2006.

66. Chishti, Z., M. D. Powell, and T. N. Vijaykumar, “Optimizing Replication, Communication, and Capacity Allocation in CMPs,” 32nd Int’l Symp. on Computer Architecture, pp. 357-361, June 2005.

67. Creeger, M., “Multicore CPUs for the Masses” , ACM Queue, pp. 64-65, Sept. 2005.

68. Huh, J., D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In Proc. of the Int’l Conf. on Parallel Architectures and Compilation Techniques, Sep. 2001.

69. Kim, S., D. Chandra, and Y. Solihin, “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004, pp. 111-122.

70. Kumar, R., et al., “Heterogeneous Chip Multiprocessors”, IEEE Computer, pp. 32-38, Nov. 2005.

71. Kumar, R., V. Zyuban, and D. M. Tullsen, “Interconnections in Multicore Architectures: Understanding Mechanisms, Overheads and Scaling,” 32nd Int’l Symp. on Computer Architecture, June. 2005.

72. Li, Y., et al., “CMP Design Space Exploration Subject to Physical Constraints”, HPCA-12, pp. 17-28, Feb. 2006.

73. Liu, C., et al., “Organizing the Last Line of Defense Before Hitting the Memory Wall for CMPs”, HPCA-10, pp. 176-185, Feb. 2004.

74. Mendelson, A., et al., “CMP Implementation in Systems Based on the Intel® Core™ Duo Processor”, Intel Technology Journal, pp. 99-107, May 15, 2006.

75. Petoumenos, P. et al., “Modeling Cache Sharing on Chip Multiprocessor Architectures,” Int. Workshop on Performance Characterization, 2006.

76. Speight, E., et al., “Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors,” 32nd Int’l Symp. on Computer Architecture, pp. 346-356, June 2005.

77. Suh, G. E., L. Rudolph, and S. Devadas, “Dynamic Partitioning of Shared Cache Memory,” The Journal of Supercomputing, pp, 7–26, 2004.

Server SMPs

78. K. Nesbit, J. Laudon, and J. E. Smith, “Virtual Private Caches,” 34th Int. Symposium on Computer Architecture; to appear June 2007.

79. M. Marty, and M. Hill, “Coherence Ordering for Ring-based Chip Multiprocessors”, MICRO-39, Dec. 2006.*

80. Beckman, B., M. Marty, and D. Wood, “ASR: Adaptive Selective Replication for CMP Caches”, MICRO-39, Dec. 2006.*

81. Beckmann, B. and D. Wood, “Managing Wire Delay in Large Chip-Multiprocessor Caches”, MICRO-37, pp. 319 – 330, Dec. 2004.

82. Chandra, C., et al., Predicting Inter-Thread Cache Contention on a Chip Multi-processor Architecture, HPCA-11, pp. 340-351, Feb. 2005.

83. Keltcher, C.N., McGrath, K.J., Ahmed, A., and Conway, P., “The AMD Opteron processor for multiprocessor servers”, IEEE Micro, 2003.*

84. McNairy, C., Bhatia, R., “Montecito: a Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro, 2005.

85. Spracklen, L., Abraham, S.G., “Chip Multithreading: Opportunities and Challenges”, HPCA 2005.

86. J. Tendler et al., IBM Power4 System Architecture , IBM Whitepaper, 2001.

87. P. Kongetira, et al., Niagara: A 32-way Multithreaded SPARC Processor, IEEE Micro, March/April 2005, pages 21-29.

88. J. D. Davis, et al., Maximizing CMT Throughput with Mediocre Cores, In Proceeedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, Sep. 2005, pages 51-62.

89. S. Chaudhry et al., “High-Performance Throughput Computing, IEEE Micro, pp. 32-45, May-June 2005.

90. R. Kota, R. Oehler, “Horus: Large-Scale Symmetric Multiprocessing for Opteron Systems”, IEEE Micro, pp. 30-40, March-April 2005.*

91. L. Barroso et al. Piranha: A Scalable Architecture Based on Single Chip Multiprocessing, Proc. 27^th International Symposium on Computer Architecture, June 2000.*

92. F. Briggs, et al., “Intel 870: a building block for cost-effective, scalable servers,” IEEE Micro, March-April 2002, pp. 36-47.*

93. A. Charlesworth, et al., Gigaplane XB -- Extending the Ultra Enterprise family Hot Interconnects V, July 1997.

Clusters

94. N. Kronenberg et al., VAXclusters: A Closely-Coupled Distributed System, ACM Transactions on Computer Systems, May 1986, pp. 130-146.

95. Luiz Andre Barroso, Jeffrey Dean, Urs Holzle, Web Search For a Planet: The Google Cluster Architecture, IEEE Micro, 23(2):22-28, March-April 2003.

96. D. Desai, et al., BladeCenter System Overview, IBM Journal of R and D, Nov. 2005, pp. 809- 821.

97. G. King, Cluster Architectures and S/390 Parallel Sysplex Scalability, IBM Systems Journal, 1197.

98. R. Martin et al., Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture, 24th Int. Symp. on Computer Architecture, June 1997, pp. 85-97.

99. M. O'keefe, Shared File Systems and Fibre Channel, 6th NASA Conf. on Mass Storage Technologies, March 1998.

SIMD systems

100. D. A. Padua and M. J. Wolfe, "Advanced Compiler Optimizations for Supercomputers," Communications of the ACM, pp. 1184-1201, December 1986.

101. Cray Research, Inc., Cray-1 Hardware Reference Manual (first three chapters), 1977

102. W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183.

103. L. W. Tucker and G. G. Robertson, "Architecture and Applications of the Connection Machine," IEEE Computer, pp. 26-38, August 1988.

104. D. J. Kuck and R. A. Stokes, The Burroughs Scientific Processor (BSP), IEEE Trans. on Computers, May 1982, pp. 363-373.

Massively Parallel MPs

105. J. P. Singh, J. L. Hennessy, A. Gupta, "Scaling Parallel Programs for Multiprocessors: Methodology and Examples," IEEE Computer, pp. 42-50, July 1993.

106. L. M. Ni and P. K. McKinley, "A Survey of Wormhole Routing Techniques in Direct Networks," IEEE Computer, pp. 62-76, Feb. 1993.

107. W. Daniel Hillis, Lewis W. Tucker, "The CM-5 Connection Machine: A Scalable Supercomputer," CACM, pp. 30-40, Nov. 1993.

108. W. J. Dally, et al., "The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms," IEEE Micro, April 1992, pp. 23--37.

109. Cray Research, Inc., CRAY T3D System Architecture Overview, Feb. 25, 1994.

110. Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, pages 26-36, October 1996.

111. S. L. Scott and G. M. Thorson, "The T3E Network: Adaptive Routing in a High Performance 3D Torus," HOT Interconnects IV, Aug. 1996.

112. J. Kim et al., Microarchitecture of a High Radix Router, Int. Symp. Comp. Arch., pp. 423-431, 2005.

Dataflow

113. B. Lee, A. R. Hurson, "Dataflow Architectures and Multithreading", IEEE Computer pp. 27-39, Aug. 1994.

114. S. Swanson, et al., "WaveScalar" In the 36th Annual International Symposium on Microarchitecture (MICRO-36), December 2003

Special Purpose Systems

115. J. A. Kahle, et al., "Introduction to the Cell Multiprocessor", IBM Journal of Research and Development, pp. 589-604, July/Sept. 2005.

116. M. Day and P. Hofstee, "Hardware and Software Architectures for the Cell Broadband Engine Processor, " Codes+ISSS Conference, Sept. 2005.

Project

Project Assignment

Miscellaneous Links

. Previous EXAM 2. Note that this was a closed book exam that covered different material.