|
The goal is to do strong scaling in single, millisecond-scale MD simulations. It is one long trajectory and not many short ones. This is a harder problem but often essential. Why a millisecond? This is a time scale at which many biologically interesting things start to happen.
Dr. Salmon explained the interactions between proteins and the binding of drugs to their molecular targets. The laboratory developed a drug that targets the specific cancer without damaging the healthy cells. The speaker showed an illustration of the required speed-up. It took two weeks to do the calculation at D.E. Research. Now, it can be done in ten minutes.
What will it take to simulate a millisecond? Can it be done with a machine bought off-the-shelf? The lab needs needs 10,000 ns/day. The challenges are to simply doing that much computation and to keep the computing elements busy.
The approach is to design a specialized machine, the Anton, an enormously parallel architecture. Dr. Salmon gave an example of molecular dynamics. The time has to be divided to calculate the forces, and iterate the process time and time again. Non-bonded calculations account for most of the work.
Dr. Salmon described the algorithms used for this process. Ewald methods decompose electrostatics. The typical way to parallelize is partitioning the spaces into boxes. There are two-dimensional home and neutral territory methods. The speaker showed a picture of scaling with the traditional versus the non-traditional method.
The Anton will execute the calculations. Desmond's performance on a commodity cluster shows the following results:
- GROMACS on single processor - 1 processor per core - about 1 ns/day
- MDGRAPE-3 - 12 ASICS - 3.3 ns/day
- Desmond on cluster - 512 processor cores - 132 ns/day
- Desmond on cluster - 512 processor cores - 280 ns/day
The lab has designed a single Anton ASIC with 4 ASICs per board, 512 ASICs in total. What makes Anton fast? There is a high throughput interaction subsystem with extremely high computational density for specific application-dependent operations. The communication subsystem is high-performance, highly integrated. It has a flexible subsystem. Ahmdal's Law requires high performance here too.
The bandwidth has a link of 42 Gigabits/second, a node of 250 Gigabits/second and a cross section of 5 Terabits/second. The latency is 50 ns hop time. There is unification across the network layers 2-7.
The computational density is achieved with pairwise point interaction modules: 32 per chip x 28 stages deep x 800MHz.
The process is as follows:
1. import tower particles
2. import plate particles
3. create direct product
4. select pairs
5. evaluate function
6. accumulate plate forces
7. accumulate tower forces
8. export plate forces
9. export tower forces
The flexible subsystem for general purpose computation consist of four processing slices. The Tensilica cores control the floor. There is 32 kilobytes of memory and the researchers hardly ever touch the DRAM. There is a racetrack station and a correction pipeline to undo the operation that is not right.
Dr. Salmon also showed an example of protein folding of Villin Headpiece. It folds in a millisecond in the lab. The protein does not always fold so there is more work to be done here.
|