LOPSA-MADISON: Running the World's Largest MRTG

2008-01-03 12:30
America/Chicago

Please join the Madison chapter of The League of Professional
System Administrators (LOPSA) for our January talk and discussion:

When: Thursday, January 3, 2008
pizza at 6:30 (pizza provided by TDS Telecom)
talk at 7:00
 
Where: TDS Telecom, City Center West

RM 2510
525 Junction Road
Madison, WI 53717

[Park in the ramp in visitor stalls on the first level or on the top
floor. Enter the building near the front fountain in the North Tower,
which is opposite the Quizno's and not facing Junction Rd.]

Application Buffer-Cache Management for Performance: Running the World's Largest MRTG

Dave Plonka, University of Wisconsin

Dave is the co-author of this paper, which won the Best Paper award at LISA 07, He will make a presentation based on the paper and lead a discussion about it.

Abstract

An operating system's readahead and buffer-cache behaviors can significantly impact application performance; most often these better performance, but occasionally they worsen it. To avoid unintended I/O latencies, many database systems sidestep these OS features by minimizing or eliminating application file I/O. However, network traffic measurement applications are commonly built instead atop a high-performance file-based database: the Round Robin Database (RRD) Tool. While RRD is successful, experience has led the network operations community to believe that its scalability is limited to tens of thousands of, or perhaps one hundred thousand, RRD files on a single system, keeping it from being used to measure the largest managed networks today. We identify the bottleneck responsible for that experience and present two approaches to overcome it.

In this paper, we provide a method and tools to expose the readahead and buffer-cache behaviors that are otherwise hidden from the user. We apply our method to a very large network traffic measurement system that experiences scalability problems and determine the performance bottleneck to be unnecessary disk reads, and page faults, due to the default readahead behavior. We develop both a simulation and an analytical model of the performance-limiting page fault rate for RRD file updates. We develop and evaluate two approaches that alleviate this problem: application advice to disable readahead and application-level caching. We demonstrate their effectiveness by configuring and operating the world's largest Multi-Router Traffic Grapher (MRTG), with approximately 320,000 RRD files, and over half a million data points measured every five minutes. Conservatively, our techniques approximately triple the capacity of very large MRTG and other RRD-based measurement systems.

Location
TDS Telecom, CIty Center West
525 Junction Road
Madison, WI, 53717
United States
See map: Google Maps

Trackback URL for this post:

http://lopsa.org/trackback/1558