[CMSC 411 Home] | [Syllabus] | [Project] | [VHDL resource] | [Homework 1-6] | [lecture notes]

[Homework 7-12] | [news] | [files]

CMSC 411 news from various sources

Yes, some may be old news.

Yet, old news may be good news

IBM Roadrunner petaflop computer


Super computer sets record, By John Markoff, Published: June 9, 2008



SAN FRANCISCO: An American military super computer, assembled from
components originally designed for video game machines, has reached a
long-sought-after computing milestone by processing more than 1.026
quadrillion calculations per second.

The new machine is more than twice as fast as the previous fastest
super computer, the IBM BlueGene/L, which is based at Lawrence
Livermore National Laboratory in California.

The new $133 million super computer, called Roadrunner in a reference
to the state bird of New Mexico, was devised and built by engineers
and scientists at IBM and Los Alamos National Laboratory, based in
Los Alamos, New Mexico. It will be used principally to solve
classified military problems to ensure that the nation's stockpile of
nuclear weapons will continue to work correctly as they age. The
Roadrunner will simulate the behavior of the weapons in the first
fraction of a second during an explosion.

Before it is placed in a classified environment, it will also be used
to explore scientific problems like climate change. The greater speed
of the Roadrunner will make it possible for scientists to test global
climate models with higher accuracy.

To put the performance of the machine in perspective, Thomas
D'Agostino, the administrator of the National Nuclear Security
Administration, said that if all six billion people on earth used
hand calculators and performed calculations 24 hours a day and seven
days a week, it would take them 46 years to do what the Roadrunner
can in one day.

The machine is an unusual blend of chips used in consumer products
and advanced parallel computing technologies. The lessons that
computer scientists learn by making it calculate even faster are seen
as essential to the future of both personal and mobile consumer
computing.

The high-performance computing goal, known as a petaflop, one
thousand trillion calculations per second, has long been viewed as
a crucial milestone by military, technical and scientific
organizations in the United States, as well as a growing group
including Japan, China and the European Union. All view super
computing technology as a symbol of national economic
competitiveness.

By running programs that find a solution in hours or even less time,
compared with as long as three months on older generations of
computers, petaflop machines like Roadrunner have the potential to
fundamentally alter science and engineering, super computer experts
say. Researchers can ask questions and receive answers virtually
interactively and can perform experiments that would previously have
been impractical.

"This is equivalent to the four-minute mile of super computing," said
Jack Dongarra, a computer scientist at the University of Tennessee
who for several decades has tracked the performance of the fastest
computers.

Each new super computing generation has brought scientists a step
closer to faithfully simulating physical reality. It has also
produced software and hardware technologies that have rapidly spilled
out into the rest of the computer industry for consumer and business
products.

Technology is flowing in the opposite direction as well.
Consumer-oriented computing began dominating research and development
spending on technology shortly after the cold war ended in the late
1980s, and that trend is evident in the design of the world's fastest
computers.

The Roadrunner is based on a radical design that includes 12,960
chips that are an improved version of an IBM Cell microprocessor, a
parallel processing chip originally created for Sony's Play Station 3
video-game machine. The Sony chips are used as accelerators, or
turbochargers, for portions of calculations.

The Roadrunner also includes a smaller number of more conventional
Opteron processors, made by Advanced Micro Devices, which are already
widely used in corporate servers. In addition, the Roadrunner will
operate exclusively on the Fedora Linux operating from Red Hat.

"Roadrunner tells us about what will happen in the next decade," said
Horst Simon, associate laboratory director for computer science at
the Lawrence Berkeley National Laboratory. "Technology is coming from
the consumer electronics market and the innovation is happening first
in terms of cell phones and embedded electronics."

The innovations flowing from this generation of high-speed computers
will most likely result from the way computer scientists manage the
complexity of the system's hardware.

Roadrunner, which consumes roughly three megawatts of power, or about
the power required by a large suburban shopping center, requires
three separate programming tools because it has three types of
processors. Programmers have to figure out how to keep all of the
116,640 processor cores in the machine occupied simultaneously in
order for it to run effectively.

"We've proved some skeptics wrong," said Michael Anastasio, a
physicist who is director of the Los Alamos National Laboratory.
"This gives us a window into a whole new way of computing. We can
look at phenomena we have never seen before."

Solving that programming problem is important because in just a few
years personal computers will have microprocessor chips with dozens
or even hundreds of processor cores. The industry is now hunting for
new techniques for making use of the new computing power. Some
experts, however, are skeptical that the most powerful super
computers will provide useful examples.

"If Chevy wins the Daytona 500, they try to convince you the Chevy
Malibu you're driving will benefit from this," said Steve Wallach, a
super computer designer who is chief scientist of Convey Computer, a
start-up firm based in Richardson, Texas.

Those who work with weapons might not have much to offer the video
gamers of the world, he suggested.

Many executives and scientists see Roadrunner as an example of the
resurgence of the United States in super computing.

Although American companies had dominated the field since its
inception in the 1960s, in 2002 the Japanese Earth Simulator briefly
claimed the title of the world's fastest by executing more than 35
trillion mathematical calculations per second. Two years later, a
super computer created by IBM reclaimed the speed record for the
United States. The Japanese challenge, however, led Congress and the
Bush administration to reinvest in high-performance computing.

"It's a sign that we are maintaining our position," said Peter
Ungaro, chief executive of Cray, a maker of super computers. He
noted, however, that "the real competitiveness is based on the
discoveries that are based on the machines."

Having surpassed the petaflop barrier, IBM is already looking toward
the next generation of super computing. "You do these record-setting
things because you know that in the end we will push on to the next
generation and the one who is there first will be the leader," said
Nicholas Donofrio, an IBM executive vice president.

By breaking the petaflop barrier sooner than had been generally
expected, the United States' super computer industry has been able to
sustain a pace of continuous performance increases, improving a
thousand fold in processing power in 11 years. The next thousand fold
goal is the exaflop, which is a quintillion calculations per second,
followed by the zettaflop, the yottaflop and the xeraflop.

Previously, Aiming for a Petaflop
eWeek (05/19/08) Vol. 25, No. 16, P. 20; Ferguson, Scott

IBM's $100 million Roadrunner supercomputer will likely offer a
sustained performance of 1 petaflop through a combination of new and
commodity technology, says IBM engineer Donald Grice. Roadrunner will
employ Cell processors developed by IBM, Sony, and Toshiba as
accelerators for the parts of applications that carry a heavy
computing load, while standard x86 Opteron processors from Advanced
Micro Devices will handle the standard computing. In addition, the
Roadrunner will operate exclusively on the Fedora Linux operating
system from Red Hat. This hybrid architecture lets IBM build a system
that consumes less power while boosting its ability to scale to a
petaflop and higher. Illuminata analyst Gordon Haff says the project
reflects a trend within high performance computing to migrate to a
heterogeneous framework that combines chips to accelerate different
tasks. "The really hard part here is the software, because we don't
have very good programming models to handle heterogeneous processing,
but the supercomputers, high-performance computing folks,
particularly at the extreme high end, are much more tolerant of
programming difficulty given the type of performance they need," he
says. Roadrunner will be installed at the U.S. Department of Energy's
Los Alamos National Laboratory.


Petascale Computers: The Next Supercomputing Wave
IT News Australia (11/29/07) Tay, Liz

Academics are focusing their attention on petascale computers that
can perform 1 quadrillion, or 1 million billion, operations per
second, almost 10 times faster than today's fastest supercomputers.
Petascale computing is expected to create solutions to global
challenges such as environmental sustainability, disease prevention,
and disaster recovery. "Petascale Computing: Algorithms and
Applications," by Georgia Tech computing professor David A. Bader,
was recently released, becoming the first published collection on
petascale techniques for computational science and engineering. Bader
says the past 50 years has seen a fundamental change in the
scientific method, with computation joining theory and
experimentation as a means for scientific discovery. "Computational
science enables us to investigate phenomena where economics or
constraints preclude experimentation, evaluate complex models and
manage massive data volumes, model processes across interdisciplinary
boundaries, and transform business and engineering practices," Bader
says. However, petascale computing will also create new challenges in
designing algorithms and applications. "Several areas are important
for this task: scalable algorithm design for massive concurrency,
computational science and engineering applications, petascale tools,
programming methodologies, performance analyzes, and scientific
visualization," Bader says. He expects to see the first peak
petascale systems in 2008, with sustained petascale systems following
shortly behind.

Replacing leader Bluegene

Mind-Blowing Blue Gene/P Revs Up at Argonne
Pioneer Press (04/23/08) Jaworski, Jim

About 100 local legislators, Argonne scientists, and federal
representatives gathered in a warehouse at Argonne National
Laboratory for the dedication of the Argonne Leadership Computing
Facility. At the dedication ceremony, the crowd was able to watch a
simulation of a supernova, a complex simulation made possible by the
facility's Blue Gene/P, one of the fastest supercomputers in the
world. The Blue Gene/P, built by IBM and Argonne, runs at 445
teraflops, and is housed in the same location as its predecessor, the
Blue Gene/L, which runs at 5.7 teraflops. The new supercomputer was
built in part as a response to Japan's Earth Simulator, which was
seven times faster than any other computer in the world when it was
unveiled in 2002. "It's always helpful to have a competitor," says
U.S. Rep. Judy Biggert (R-Ill.). "Just look at what Sputnik did for
space." The Argonne Leadership Computing Facility will house 20
scientific projects and award 111 million computing hours to research
projects from around the world. The simulation of the supernova was
one of the computer's first challenges. Other research efforts at
ALCF will include projects such as testing new airline components to
reduce the need for physical tests, and medical research that
provides scientists a more detailed understanding of how diseases and
illnesses develop in the body.


NASA 245 million pixel display


From technews@hq.acm.org Mon Mar 31 16:06:01 2008

NASA Builds World's Largest Display
Government Computer News (03/27/08) Jackson, Joab

NASA's Ames Research Center is expanding the first Hyperwall, the
world's largest high-resolution display, to a display made of 128 LCD
monitors arranged in an 8-by-16 matrix, which will be capable of
generating 245 million pixels. Hyperwall-II will be the largest
display for unclassified material. Ames will use Hyperwall-II to
visualize enormous amounts of data generated from satellites and
simulations from Columbia, its 10,240-processor supercomputer. "It
can look at it while you are doing your calculations," says Rupak
Biswas, chief of advanced supercomputing at Ames, speaking at the
High Performance Computer and Communications Conference. One gigantic
image can be displayed on Hyperwall-II, or more than one on multiple
screens. The display will be powered by a 128-node computational
cluster that is capable of 74 trillion floating-point operations per
second. Hyperwall-II will also make use of 1,024 Opteron processors
from Advanced Micro Devices, and have 128 graphical display units and
450 terabytes of storage.


Intel integrated memory controller




Yet to be seen: Will Intel shed its other dinosaur,
the North Bridge and South Bridge concept, in order to
achieve integrated IO?




Silicon Nanophtonic Waveguide


How long will it be before your computer is really and truly outdated??
Who will be the first to have their own supercomputer?
 
"IBM researchers reached a significant milestone in the quest to send
information between the "brains" on a chip using pulses of light through
silicon instead of electrical signals on copper wires.
The breakthrough -- a significant advancement in the field of
"Silicon Nanophotonics" -- uses pulses of light rather than electrical
wires to transmit information between different processors on a single chip,
significantly reducing cost, energy and heat while increasing communications
bandwidth between the cores more than a hundred times over wired chips.
The new technology aims to enable a power-efficient method to connect
hundreds or thousands of cores together on a tiny chip by eliminating
the wires required to connect them. Using light instead of wires to send
information between the cores can be as much as 100 times faster and use
10 times less power than wires, potentially allowing hundreds of cores
to be connected together on a single chip, transforming today's large super
computers into tomorrow's tiny chips while consuming significantly less power.
 
 
IBM's optical modulator performs the function of converting a digital
electrical signal carried on a wire, into a series of light pulses,
carried on a silicon nanophotonic waveguide. First, an input laser beam
(marked by red color) is delivered to the optical modulator.
The optical modulator (black box with IBM logo) is basically a very fast
"shutter" which controls whether the input laser is blocked or transmitted
to the output waveguide. When a digital electrical pulse
(a "1" bit marked by yellow) arrives from the left at the modulator,
a short pulse of light is allowed to pass through at the optical output
on the right. When there is no electrical pulse at the modulator (a "0" bit),
the modulator blocks light from passing through at the optical output.
In this way, the device "modulates" the intensity of the input laser beam,
and the modulator converts a stream of digital bits ("1"s and "0"s)
from electrical signals into light pulses. December 05, 2007"

http://www.flixxy.com/optical-computing.htm


Parallel Programming a necessity

With us again today is James Reinders, a senior engineer at Intel.

DDJ: James, how is programming for a few cores different from
programming for a few hundred cores?

JR: As we go from a few cores to hundred, two things happen:

1. Scaling is everything and single core performance is truly
uninteresting in comparison, and;

2. Shared memory becomes tougher and tougher to count on, or
disappears altogether.

For programmers, the shift to "Think Parallel" is not complete until
we truly focus on scaling in our designs instead of performance on a
single core. A program which scales poorly, perhaps because it divides
work up crudely, can hobble along for a few cores. However, running a
program on hundreds of cores will reveal the difference between
hobbling and running. Henry Ford learned a lot about automobile design
while doing race cars before he settled on making cars for the masses.
Automobiles which ran under optimal conditions at slower speeds did
not truly shake out a design the way less optimal and high speed
racing condition did with a car. Likewise, a programmer will find
designing programs for hundreds of cores to be a challenge.

I think we already know more than we think. It is obvious to think of
supercomputer programming, usually scientific in nature, as having
figured out how their programs can run in parallel. But, let me
suggest that Web 2.0, is highly parallel -- and is a model which helps
with the second issue in moving to hundreds of cores.

Going from a few cores to many core means several changes is in the
hardware which impact software a great deal. The biggest change is in
memory because with a few cores you can assume every core has equal
access to memory. It turns out having equal access to memory
simplifies many things for programmers, many ugly things do not need
to be worried about. The first step away from complete bliss is when
instead of equal access (UMA) you move to unequal but access is still
available (NUMA). In really large computers, memory is usually broken
up (distributed) and is simply not globally available to all
processors. This is why in distributed memory machines programming is
usually done with messages instead of using shared memory.

Programs can easily move from UMA to NUMA, the only real issue is
performance -- and there will be countless tricks in very complex
hardware to help mask the need for tuning. There will, nevertheless,
be plenty of opportunity for programmers to tune for NUMA the same way
we tune for caches today. The gigantic leap, it would seem, is to
distributed memory. I have many thoughts on how that will happen, but
that is a long ways off -- sort of. We see it already in web computing
-- Web 2.0, if you will, is a distributed programming model without
shared memory -- all using messages (HTML, XML, etc.)

So maybe message passing of the supercomputer world has already met
its replacement for the masses: Web 2.0 protocols.

DDJ: Are compilers ready to take advantage of these multi-core CPUs?

JR: Compilers are great at exploiting parallelism and terrible at
discovering it. When people ask the question you did of me, I find
they are usually wondering about automatic compilers which take my
program of today, and magically find parallelism and produce great
multi-core binaries. That is simply not going to happen. Every decent
compiler will have some capability to discover parallel automatically,
but it will simple not be enough.

The best explanation I can give is this: it is a issue of algorithm
redesign. We don't expect a compiler to read in a bubble sort function
and compile it into a quick sort function. That would be roughly the
same as reading most serial programs and compiling into a parallel
program.

The key is to find the right balance of how to have the programmer
express the right amount of the algorithm and the parallelism, so the
compiler can take it the rest of the way. Compiler have done a great
job exploiting SIMD parallelism for programs written using vectors or
other syntaxes designed to make the parallel accessible enough for the
compiler to not have too much difficulty to discover it. In such
cases, compilers do a great job exploiting MMX, SSE, SSE2, etc.

The race is on to find the right balance of programming practices and
compiler technology. While the current languages are not quite enough,
we've seen small additions like OpenMP yield big results for a class of
applications. I think most programming will evolve to use small
changes which open up the compiler to seeing the parallelism. Some
people advocate whole new programming languages, which allow much more
parallelism to be expressed explicitly. This is swinging the pendulum
too far for most programmers, and I have my doubts we any one solution
is general purpose enough for widespread usage.

DDJ: Earlier in our conversation, I gave you an I.O.U. for the beer
you asked for. Will you share with readers what Prof. Norman R. Scott
told you and why you blame it for having you so confident in the
future of computing?

JR: Okay, but since I've been living in the Pacific Northwest some
time you need to know that I'm not likely to drink just any beer.

In 1987, my favorite college professor was retiring from teaching at
the University of Michigan. He told us that when he started in
electronics that he would build circuits with vacuum tubes. He would
carefully tune the circuit for that vacuum tube. He thought it was
wonderful. But, if the vacuum tube blew he could get a new one but
would have to retune the circuit to the new vacuum tube because they
were never quite the same. Now this amazed us, because most of us
helped our dad's buy "standard" replacement vacuum tubes at the corner
drug store for our televisions when we were kids. So the idea of
vacuum tubes not being standard and interchangeable seemed super old
to us, because even standard vacuum tubes were becoming rare specialty
items at Radio Shack (but also perfected to have lifetime guarantees).

Next, Prof. Scott noted that Intel had announced a million transistor
processor recently. He liked to call that VLSII (Very Large Scale
Integration Indeed!).

Now for the punchline: Je said his career spanned inconsistent vacuum
tubes to a million transistors integrated on a die the size of a
fingertip. He asked if we thought technology (or our industry) was
moving FASTER or SLOWER than during his career? We all said "faster!"
So he asked: "Where will the industry be when your careers end since
we will start with a million transistors on a chip?"

I've always thought that was the scariest thing I ever heard. It
reminds me still to work to keep up -- lest I be left behind (or run
over).

So, when people tell me that the challenge before us is huge and never
before seen -- and therefore insurmountable -- I'm not likely to be
convinced.

You can blame Prof. Scott for my confidence that we'll figure it out.
I don't think a million vacuum tubes equivalents on a finger tip
seemed like anything other than fantasy to him when he started his
career -- and now we have a thousand times that. So I'm not impressed
when people say we cannot figure out how to use a few hundred cores. I
don't think this way because I work at Intel, I think this way in no
small part because of Prof. Scott. Now, I might work at Intel because
I think this way. And I'm okay with that. But let's give credit to
Prof. Scott, not Intel for why I think what I do.

DDJ: James, thanks for taking time over the past few weeks for this
most interesting conversation.

JR: You're welcome.

Articles are edited to fit the purpose of this page.
All copyrights belong to the original source.

Other links

Last updated 8/23/08