[CMSC 411 Home] | [Syllabus] | [Project] | [VHDL resource] | [Homework 1-6] | [lecture notes]

[Homework 7-12] | [news] | [files]

CMSC 411 news from various sources

Yes, some may be old news.

Yet, old news may be good news

AMD 12 and Intel 8 multicore


AMD Unveils 12-Core Opterons with HP, Dell, Acer
Chip maker AMD officially launches its new 12-core Opteron processor
with help from Hewlett-Packard, Dell and even Acer, which is looking
to now expand into the server market. AMD's new Opteron chip comes
along as Intel is expected to release its "Nehalem EX" Xeon chip
later this week. Both processors target the data center.

Typically multiple chips with special signals that allow
the multiple chips to act as a single multicore chip.
Possibly better memory bandwidth, cache consistency?
3/30/2010

web page

Intel Core i7-980X six core 3.3GHz with overclocking

single chip:

hardwaresecrets.com

Facebook usage and storage data


facebook.txt

IBM Roadrunner petaflop computer


Super computer sets record, By John Markoff, Published: June 9, 2008
(In 2009 Jaguar ran at 1.7 petaflops, 1.7*10^15 instructions per second)



SAN FRANCISCO: An American military super computer, assembled from
components originally designed for video game machines, has reached a
long-sought-after computing milestone by processing more than 1.026
quadrillion calculations per second.

The new machine is more than twice as fast as the previous fastest
super computer, the IBM BlueGene/L, which is based at Lawrence
Livermore National Laboratory in California.

The new $133 million super computer, called Roadrunner in a reference
to the state bird of New Mexico, was devised and built by engineers
and scientists at IBM and Los Alamos National Laboratory, based in
Los Alamos, New Mexico. It will be used principally to solve
classified military problems to ensure that the nation's stockpile of
nuclear weapons will continue to work correctly as they age. The
Roadrunner will simulate the behavior of the weapons in the first
fraction of a second during an explosion.

Before it is placed in a classified environment, it will also be used
to explore scientific problems like climate change. The greater speed
of the Roadrunner will make it possible for scientists to test global
climate models with higher accuracy.

To put the performance of the machine in perspective, Thomas
D'Agostino, the administrator of the National Nuclear Security
Administration, said that if all six billion people on earth used
hand calculators and performed calculations 24 hours a day and seven
days a week, it would take them 46 years to do what the Roadrunner
can in one day.

The machine is an unusual blend of chips used in consumer products
and advanced parallel computing technologies. The lessons that
computer scientists learn by making it calculate even faster are seen
as essential to the future of both personal and mobile consumer
computing.

The high-performance computing goal, known as a petaflop, one
thousand trillion calculations per second, has long been viewed as
a crucial milestone by military, technical and scientific
organizations in the United States, as well as a growing group
including Japan, China and the European Union. All view super
computing technology as a symbol of national economic
competitiveness.

The Roadrunner is based on a radical design that includes 12,960
chips that are an improved version of an IBM Cell microprocessor, a
parallel processing chip originally created for Sony's Play Station 3
video-game machine. The Sony chips are used as accelerators, or
turbochargers, for portions of calculations.

The Roadrunner also includes a smaller number of more conventional
Opteron processors, made by Advanced Micro Devices, which are already
widely used in corporate servers. In addition, the Roadrunner will
operate exclusively on the Fedora Linux operating from Red Hat.


NASA 245 million pixel display


From technews@hq.acm.org Mon Mar 31 16:06:01 2008

NASA Builds World's Largest Display
Government Computer News (03/27/08) Jackson, Joab

NASA's Ames Research Center is expanding the first Hyperwall, the
world's largest high-resolution display, to a display made of 128 LCD
monitors arranged in an 8-by-16 matrix, which will be capable of
generating 245 million pixels. Hyperwall-II will be the largest
display for unclassified material. Ames will use Hyperwall-II to
visualize enormous amounts of data generated from satellites and
simulations from Columbia, its 10,240-processor supercomputer. "It
can look at it while you are doing your calculations," says Rupak
Biswas, chief of advanced supercomputing at Ames, speaking at the
High Performance Computer and Communications Conference. One gigantic
image can be displayed on Hyperwall-II, or more than one on multiple
screens. The display will be powered by a 128-node computational
cluster that is capable of 74 trillion floating-point operations per
second. Hyperwall-II will also make use of 1,024 Opteron processors
from Advanced Micro Devices, and have 128 graphical display units and
450 terabytes of storage.


Intel integrated memory controller




Yet to be seen: Will Intel shed its other dinosaur,
the North Bridge and South Bridge concept, in order to
achieve integrated IO?




Silicon Nanophtonic Waveguide


How long will it be before your computer is really and truly outdated??
Who will be the first to have their own supercomputer?
 
"IBM researchers reached a significant milestone in the quest to send
information between the "brains" on a chip using pulses of light through
silicon instead of electrical signals on copper wires.
The breakthrough -- a significant advancement in the field of
"Silicon Nanophotonics" -- uses pulses of light rather than electrical
wires to transmit information between different processors on a single chip,
significantly reducing cost, energy and heat while increasing communications
bandwidth between the cores more than a hundred times over wired chips.
The new technology aims to enable a power-efficient method to connect
hundreds or thousands of cores together on a tiny chip by eliminating
the wires required to connect them. Using light instead of wires to send
information between the cores can be as much as 100 times faster and use
10 times less power than wires, potentially allowing hundreds of cores
to be connected together on a single chip, transforming today's large super
computers into tomorrow's tiny chips while consuming significantly less power.
  
IBM's optical modulator performs the function of converting a digital
electrical signal carried on a wire, into a series of light pulses,
carried on a silicon nanophotonic waveguide. First, an input laser beam
(marked by red color) is delivered to the optical modulator.
The optical modulator (black box with IBM logo) is basically a very fast
"shutter" which controls whether the input laser is blocked or transmitted
to the output waveguide. When a digital electrical pulse
(a "1" bit marked by yellow) arrives from the left at the modulator,
a short pulse of light is allowed to pass through at the optical output
on the right. When there is no electrical pulse at the modulator (a "0" bit),
the modulator blocks light from passing through at the optical output.
In this way, the device "modulates" the intensity of the input laser beam,
and the modulator converts a stream of digital bits ("1"s and "0"s)
from electrical signals into light pulses. December 05, 2007"

http://www.flixxy.com/optical-computing.htm


Seagate crams 329 gigabits of data per square inch


Seagate crams 329 gigabits of data per square inch
http://ct.zdnet.com/clicks?t=73361625-e808f46de0195a86f73d2cce955257f9-bf&brand=ZDNET&s=5

Seagate has announced that it is shipping the densest 3.5 inch desktop
hard drive available - cramming an incredible 329 gigabits per square inch.

The new drives, the Barracuda 7200.12, offers 1TB of storage on two 
platters and the high density is achieved by using Perpendicular 
Magnetic Recording technology. Seagate hopes to add more platters 
later this year in order to boost capacity even further.

The Barracuda 7200.12 is a 7,200RPM drive that has a 3Gbps serial ATA
(SATA) interface that offers a sustained transfer rate of up to 160MB/s
and a burst speed of 3Gbps.

Prior to the Barracuda 7200.12 the Seagate drive with the greatest
density was the Barracuda 7200.11 that offered 1.5GB of storage across 
four platters.

Too Many Cores, when a big number, can be worse


More Chip Cores Can Mean Slower Supercomputing, Sandia Simulation Shows.
Sandia National Laboratories (01/13/09) Singer, Neal

Simulations at Sandia National Laboratory have shown that increasing
the number of processor cores on individual chips may actually worsen
the performance of many complex applications. The Sandia researchers
simulated key algorithms for deriving knowledge from large data sets,
which revealed a significant increase in speed when switching from
two to four multicores, an insignificant increase from four to eight
multicores, and a decrease in speed when using more than eight
multicores. The researchers found that 16 multicores were barely able
to perform as well as two multicores, and using more than 16
multicores caused a sharp decline as additional cores were added. The
drop in performance is caused by a lack of memory bandwidth and a
contention between processors over the memory bus available to each
processor. The lack of immediate access to individualized memory
caches slows the process down once the number of cores exceeds eight,
according to the simulation of high-performance computing by Sandia
researchers Richard Murphy, Arun Rodrigues, and Megan Vance. "The
bottleneck now is getting the data off the chip to or from memory or
the network," Rodrigues says. The challenge of boosting chip
performance while limiting power consumption and excessive heat
continues to vex researchers. Sandia and Oak Ridge National
Laboratory researchers are attempting to solve the problem using
message-passage programs. Their joint effort, the Institute for
Advanced Architectures, is working toward exaflop computing and may
help solve the multichip problem.

MRI at nano scale


Microscope Has 100 Million Times Finer Resolution Than Current MRI

An artistic view of the magnetic tip (blue) interacting with the 
virus particles at the end of the cantilever.
Scientists at IBM Research, in collaboration with the Center for 
Probing the Nanoscale at Stanford University, have demonstrated 
magnetic resonance imaging (MRI) with volume resolution 100 million 
times finer than conventional MRI. This signals a significant step 
forward in tools for molecular biology and nanotechnology by offering 
the ability to study complex 3D structures at the nanoscale.

By extending MRI to such fine resolution, the scientists have created 
a microscope that may ultimately be powerful enough to unravel the 
structure and interactions of proteins, paving the way for new 
advances in personalized healthcare and targeted medicine.

This advancement was enabled by a technique called magnetic resonance 
force microscopy (MRFM), which relies on detecting ultrasmall 
magnetic forces. In addition to its high resolution, the imaging 
technique is chemically specific, can "see" below surfaces and, 
unlike electron microscopy, is non-destructive to sensitive 
biological materials.

The researchers use MRFM to detect tiny magnetic forces as the sample 
sits on a microscopic cantilever - essentially a tiny sliver of 
silicon shaped like a diving board. Laser interferometry tracks the 
motion of the cantilever, which vibrates slightly as magnetic spins 
in the hydrogen atoms of the sample interact with a nearby nanoscopic 
magnetic tip. The tip is scanned in three dimensions and the 
cantilever vibrations are analyzed to create a 3D image.

Parallel Programming a necessity

With us again today is James Reinders, a senior engineer at Intel.

DDJ: James, how is programming for a few cores different from
programming for a few hundred cores?

JR: As we go from a few cores to hundred, two things happen:

1. Scaling is everything and single core performance is truly
uninteresting in comparison, and;

2. Shared memory becomes tougher and tougher to count on, or
disappears altogether.

For programmers, the shift to "Think Parallel" is not complete until
we truly focus on scaling in our designs instead of performance on a
single core. A program which scales poorly, perhaps because it divides
work up crudely, can hobble along for a few cores. However, running a
program on hundreds of cores will reveal the difference between
hobbling and running. Henry Ford learned a lot about automobile design
while doing race cars before he settled on making cars for the masses.
Automobiles which ran under optimal conditions at slower speeds did
not truly shake out a design the way less optimal and high speed
racing condition did with a car. Likewise, a programmer will find
designing programs for hundreds of cores to be a challenge.

I think we already know more than we think. It is obvious to think of
supercomputer programming, usually scientific in nature, as having
figured out how their programs can run in parallel. But, let me
suggest that Web 2.0, is highly parallel -- and is a model which helps
with the second issue in moving to hundreds of cores.

Going from a few cores to many core means several changes is in the
hardware which impact software a great deal. The biggest change is in
memory because with a few cores you can assume every core has equal
access to memory. It turns out having equal access to memory
simplifies many things for programmers, many ugly things do not need
to be worried about. The first step away from complete bliss is when
instead of equal access (UMA) you move to unequal but access is still
available (NUMA). In really large computers, memory is usually broken
up (distributed) and is simply not globally available to all
processors. This is why in distributed memory machines programming is
usually done with messages instead of using shared memory.

Programs can easily move from UMA to NUMA, the only real issue is
performance -- and there will be countless tricks in very complex
hardware to help mask the need for tuning. There will, nevertheless,
be plenty of opportunity for programmers to tune for NUMA the same way
we tune for caches today. The gigantic leap, it would seem, is to
distributed memory. I have many thoughts on how that will happen, but
that is a long ways off -- sort of. We see it already in web computing
-- Web 2.0, if you will, is a distributed programming model without
shared memory -- all using messages (HTML, XML, etc.)

So maybe message passing of the supercomputer world has already met
its replacement for the masses: Web 2.0 protocols.

DDJ: Are compilers ready to take advantage of these multi-core CPUs?

JR: Compilers are great at exploiting parallelism and terrible at
discovering it. When people ask the question you did of me, I find
they are usually wondering about automatic compilers which take my
program of today, and magically find parallelism and produce great
multi-core binaries. That is simply not going to happen. Every decent
compiler will have some capability to discover parallel automatically,
but it will simple not be enough.

The best explanation I can give is this: it is a issue of algorithm
redesign. We don't expect a compiler to read in a bubble sort function
and compile it into a quick sort function. That would be roughly the
same as reading most serial programs and compiling into a parallel
program.

The key is to find the right balance of how to have the programmer
express the right amount of the algorithm and the parallelism, so the
compiler can take it the rest of the way. Compiler have done a great
job exploiting SIMD parallelism for programs written using vectors or
other syntaxes designed to make the parallel accessible enough for the
compiler to not have too much difficulty to discover it. In such
cases, compilers do a great job exploiting MMX, SSE, SSE2, etc.

The race is on to find the right balance of programming practices and
compiler technology. While the current languages are not quite enough,
we've seen small additions like OpenMP yield big results for a class of
applications. I think most programming will evolve to use small
changes which open up the compiler to seeing the parallelism. Some
people advocate whole new programming languages, which allow much more
parallelism to be expressed explicitly. This is swinging the pendulum
too far for most programmers, and I have my doubts we any one solution
is general purpose enough for widespread usage.

DDJ: Earlier in our conversation, I gave you an I.O.U. for the beer
you asked for. Will you share with readers what Prof. Norman R. Scott
told you and why you blame it for having you so confident in the
future of computing?

JR: Okay, but since I've been living in the Pacific Northwest some
time you need to know that I'm not likely to drink just any beer.

In 1987, my favorite college professor was retiring from teaching at
the University of Michigan. He told us that when he started in
electronics that he would build circuits with vacuum tubes. He would
carefully tune the circuit for that vacuum tube. He thought it was
wonderful. But, if the vacuum tube blew he could get a new one but
would have to retune the circuit to the new vacuum tube because they
were never quite the same. Now this amazed us, because most of us
helped our dad's buy "standard" replacement vacuum tubes at the corner
drug store for our televisions when we were kids. So the idea of
vacuum tubes not being standard and interchangeable seemed super old
to us, because even standard vacuum tubes were becoming rare specialty
items at Radio Shack (but also perfected to have lifetime guarantees).

Next, Prof. Scott noted that Intel had announced a million transistor
processor recently. He liked to call that VLSII (Very Large Scale
Integration Indeed!).

Now for the punchline: Je said his career spanned inconsistent vacuum
tubes to a million transistors integrated on a die the size of a
fingertip. He asked if we thought technology (or our industry) was
moving FASTER or SLOWER than during his career? We all said "faster!"
So he asked: "Where will the industry be when your careers end since
we will start with a million transistors on a chip?"

I've always thought that was the scariest thing I ever heard. It
reminds me still to work to keep up -- lest I be left behind (or run
over).

So, when people tell me that the challenge before us is huge and never
before seen -- and therefore insurmountable -- I'm not likely to be
convinced.

You can blame Prof. Scott for my confidence that we'll figure it out.
I don't think a million vacuum tubes equivalents on a finger tip
seemed like anything other than fantasy to him when he started his
career -- and now we have a thousand times that. So I'm not impressed
when people say we cannot figure out how to use a few hundred cores. I
don't think this way because I work at Intel, I think this way in no
small part because of Prof. Scott. Now, I might work at Intel because
I think this way. And I'm okay with that. But let's give credit to
Prof. Scott, not Intel for why I think what I do.

DDJ: James, thanks for taking time over the past few weeks for this
most interesting conversation.

JR: You're welcome.

Articles are edited to fit the purpose of this page.
All copyrights belong to the original source.

Other links

Last updated 2/14/10