kernel weekly news – 25.06.2011

Posted: June 25, 2011 in kernel

Hello world, and welcome to the last edition of June!
In this week’s issue we have…

-…Paul E. McKenney who has a rcu/urgent pull request
, banishing RCU kthreads in the RCU_BOOST=n case,
Grant Likely has gpio and spi fixes for 3.0-rc3,
Jiri Kosina has updates for hid (mostly fixes, actually),
Alex Elder has xfs fixes (“The first is a simple fix that
flips the sign of an error return value.

The second fixes a bug having to do with the order in which log and
data (or realtime) device cache flushes are issued. This could lead
to data corruption when an external log was used. Previously, we
avoided the problem by disabling barriers and issuing a warning
about that at mount time, but the recent barrier/FUA rework makes
that not work properly any more. Fixing the flush ordering problem
also nicely eliminates the need a block of code. “) and
Takashi Iwai has a lot of smallish fixes for the sound tree.

-John W. Linville updates wireless, Dave Jones has a small pull
request for cpufreq (one of which is powernow-k8-related),
Dan MAgenheimer updates the xen tree with a small request and
Dave Airlie has also a pull request composed of one fix for
the drm tree.

-Guenter Roeck has hwmon fixes, Andres Salomon has firmware updates
in a pull request pertaining to OLPC libertas blobs, Ingo Molnar
has rcu, ipi and timer fixes , Jonas Bonn sends out an RFR (request
for review) related to the OpenRISC architecture; here it is :

 This is a port of Linux to the OpenRISC 1000 architecture.

The OpenRISC architecture was conceived with the idea of creating a CPU
with an open specification and freely licensed implementations thereof.
The OR1200 implementation of the OpenRISC 1000 architecture is LGPL licensed,
runs on FPGA's from a broad number of vendors, and is currently being
used in a number of successful industrial projects.

A product of OpenCores.org, development of the OpenRISC architecture has
unfortunately languished for a while buried amongst a multitude of other
projects.  Recently, however, a renewed effort to lift the CPU project out
of its doldrums was initiated by the OpenCores.org management and a community
of participants consisting of both commercial and independent contributors
has rapidly taken shape around this effort.  The project now lives (at
least temporarily) at http://openrisc.net where it can get the attention
it needs.  The active development community around this architecture has
grown from roughly 5 to 25 developers in the last year and we are looking
forward to seeing what may emerge as the community grows and this CPU
architecture is allowed to be developed truly in the open.

We have been tracking upstream Linux with our port since 2.6.35.  Tracking
upstream has been mostly painless as the changes that require architecture
modifications are pretty easy to spot with git.  That said, we want to be
"upstream!"

So here's our code.  The following patches implement support for the
OpenRISC architecture on Linux.  The patch series is broken into functional
units that, hopefully, will facilitate review.

There are a couple of blemishes here and there that we are aware of and
that we aim to clean up in short order.  That said, the important thing is
that the userspace-facing ABI has settled and will not be changing again,
so any necessary cleanups can just as easily be made after the tree is
merged upstream as before.

We are excited by the prospect of getting our work reviewed by our peers and
we are excited by this opportunity to bring the OpenRISC architecture to the
attention of a larger audience.  Thanks for taking the time to consider this
port for inclusion.

The branch 'for-upstream' of the following git repository has the patch
series in this thread:

git://openrisc.net/jonas/linux

Web view for that branch at:

http://git.openrisc.net/cgit.cgi/jonas/linux/log/?h=for-upstream

Notes
-----

See the file README.openrisc for information on getting a toolchain and
simulator for building and running this code.

We currently have only a uClibc port to this architecture and this does
not fully support the reduced set of generic syscalls.  In our upstream
submission we have included only the reduced set of syscalls, while we
will be carrying an out-of-tree patch that enables Linux to work with our
uClibc port until we are able to sort this out.  This shouldn't, however,
need to further delay beginning the code review.

There's a short TODO.openrisc file, as well.  Some of the items listed
there could potentially be done before any pull request for the architecture
is sent.

There are also fair number of places where we still have old test code
laying about inside #ifdefs and comments.  The intention is to get all that
cleaned up before the final pull request, but I didn't want to allow that
to delay beginning the review process any further. 

,
Avi Kivity has kvm updates (fairly serious fixes, as the author says) and
Dmitry Torokhov has input updates for -rc3 (fixes) .

-Konrad Rzeszutek Wilk of Oracle fame has Xen updates (bugfixes for 3.0-rc3),
Theodore Ts’o has a pull request for ext4, Al Viro is back with more vfs/fs
patches, J. Bruce Fields has nfsd bugfixes and David Miller with the
following fixes :

 1) IPVS namespace exit causes crash in conntrack, fix from
   Hans Schillstrom.

2) ieee802154_nl_fill_phy() memory leak fix from Jesper Juhl.

3) Fix IRQ autoprobing regression in 3c503 driver, from
   Ondrej Zary.

4) Fix oops in mwifiex driver when probing setting using
   ethtool, from Yogesh Ashok Powar.

5) Netfilter NAT code adjusts sequence numbers one too many times
   over loopback, fix from Julian Anastasov.

6) Bridge multicast code sets ->mrouters_only on wrong SKB, fix
   from Fernando Luis Vazquez Cao.

7) Rik van Riel reports a regression of using netpoll over bridge
   slave devices.  What's happening now is that once we have a device
   become a slave, we cannot allow it to have netpoll run over it.

   The situations that care about this (virtualization) should run
   the netconsole instance over the bridge device, but that only
   works if all slave devices support polling.  The exception
   that makes this difficult is the TUN driver.

   Fortunately, adding netpoll support to TUN is entirely trivial
   because all of it's receive events are synchronously triggered.

   Fix from Neil Horman, tested by Rik van Riel.

8) VLAN code invokes OPS without checking if the underlying device
   supports the offload feature, fix from Antoine Reversat.

9) Memory leak fix in bfin_mac driver, from Sonic Zhang.

10) RFS steering doesn't happen on the first pack of a passive TCP
    flow due to a missing sock_rps_record_flow() call in both ipv4
    and ipv6.  Fix from Eric Dumazet.

11) Module ref leak fixes in farsync and gigaset drivers, from
    Pavel Shved.

12) inet_diag byte code audit code is buggy and can cause loops as
    well as unaligned accesses.  Fix from Eric Dumazet.

13) Fix regression in multicast route lookups cause by the conversion
    to return error pointers, from Eric Dumazet. 

-The announcement for Linux 3.0-rc4:

 Mostly the usual small driver one- (or few-) liners, and some bigger
changes to drm (and md). But also two new smallish drivers
(net/usb/kalmia.c, and the ADP8870 backlight driver). Some filesystem
fixes (btrfs, cifs, afs, xfs, nfsd).

And a couple of performance regressions: rcu doesn't need threads (and
avoiding them fixes a performance problem under certain loads) and the
conversion from spinlocks to mutexes for the anon_vma locking ended up
causing a scalability issue that required fixing. 

-Paul Turner announces a 16-piece patchset for CFS (v7) :

 optimizations/tweaks:
- no need to reschedule on an enqueue_throttle
- bandwidth is reclaimed at time of dequeue rather than put_prev_entity, this
prevents us losing small slices of bandwidth to load-balance movement.

quota/period handling:
- runtime expiration now handles sched_clock wrap
- bandwidth now reclaimed at time of dequeue rather than put_prev_entity, this
  was resulting in load-balance stranding small amounts of bandwidht
  perviously.
- logic for handling the bandwidth timer is now better unified with idle state 
  accounting, races with period expiration during hrtimer tear-down resolved
- fixed wake-up into a new quota period waiting for timer to replenish
  bandwidth.

misc:
- fixed stats not being accumulated for unthrottled periods [thanks H. Sato]
- fixed nr_running corruption in enqueue/dequeue_task fair  [thanks H. Sato]
- consistent specification changed to max(child bandwidth) <= parent
  bandwidth, sysctl controlling this behavior was nuked
- throttling not enabled until both throttle and unthrottle mechanisms are in
  place.
- bunch of minor cleanups per list discussion 

-Roland Dreier has a pull request pertaining to infiniband (few fixes),
Trond Myklebust has a bugfix pull request for nfs, Eric W. Biederman
has nsfd fixes, Chetan Loke has a patchset for net-next-af-packet,
explained as follows :

 Changes from v1:

1) v1 was based on 2.6.38.9. v2 is rebased to net-next.
2) Aligned bdqc members, pr_err to WARN, sob email      (Joe Perches)
3) Added tp_padding                                     (Eric Dumazet)
4) Nuked useless ;) white space                         (Stephen H)
5) Use __u types in headers                             (Ben Hutchings)
6) Added field for creating private area             	(Chetan Loke)

This patch attempts to:
1)Improve network capture visibility by increasing packet density
2)Assist in analyzing multiple(aggregated) capture ports.

Benefits:
  B1) ~15-20% reduction in cpu-usage.
  B2) ~20% increase in packet capture rate.
  B3) ~2x  increase in packet density.
  B4) Port aggregation analysis.
  B5) Non static frame size to capture entire packet payload.

With the current af_packet->rx::mmap based approach, the element size
in the block needs to be statically configured. Nothing wrong with this
config/implementation. But the traffic profile cannot be known in advance.
And so it would be nice if that configuration wasn't static. Normally,
one would configure the element-size to be '2048' so that you can atleast
capture the entire 'MTU-size'.But if the traffic profile varies then we
would end up either i)wasting memory or ii) end up getting a sliced frame.
In other words the packet density will be much less in the first case.

--------------------
Performance results:
--------------------

Tpacket config(same on Physical/Virtual setup):
64 blocks(1MB block size)

**************
Physical setup
**************

pktgen: 64 byte traffic.

1G Intel
driver: igb
version: 2.1.0-k2
firmware-version: 3.19-0


Tpacket          V1                 V3
capture-rate     600K pps     720K pps
cpu usage        70%           53%
Drop-rate         7-10%        ~1%

**********************
Virtual Machine setup:
**********************

pktgen: 64 byte traffic,40M packets(clone_skb )

Worker VMs(FC12):
3 VMs:VM0 .. VM2, each sending 40M packets.

probe-VM(FC15): 1-vCPU/512MB memory
running patched kernel


Tpacket          V1                       V3
capture-rate     700-800K pps        1M pps
cpu usage        50%                   ~30%
Drop-rate         9-10%                <1%


Plus, in the VM setup,V3 sees/captures around 5-10% more traffic than V1/V2.

------------
Enhancement:
------------
E1) Enhanced tpacket_rcv so that it can dump/copy the packets one after another.
E2) Also implemented basic timeout mechanism to close 'a' current block.
    That way, user-space won't be blocked forever on an idle link.
    This is a much needed feature while monitoring multiple ports.
    Look at 3) below.

-------------------------------
Why is such enhancement needed?
-------------------------------
1) Well, spin-waiting/polling on a per-packet basis to see if it's ready
   to be consumed does not scale while monitoring multiple ports.
   poll() is not performance friendly either.
2) Also, typically a user-space packet capture interface handles multiple
   packets to another user-space protocol-decoder.

   ----------------
   protocol-decoder
          T2
   ----------------
    =============
    ship pkts
    =============
           ^
           |
           v
   -----------------
   pkt-capture logic
           T1
   -----------------
   ================
     nic/sock IF
   ================
           ^
           |
           V

T1 and T2 are user-space threads. If the hand-off between T1 and T2
happens on a per-pkt basis then the solution does NOT scale.

However, one can argue that T1 can coalesce packets and then pass of a
single chunk to T2.But T1's packet consumption granularity is still at
an individual packet level and that is something that needs to be
addressed to avoid excessive polling.


3) Port aggregation analysis:
   Multiple ports are viewed/analyzed as one logical pipe.
   Example:
   3.1) up-stream    path can be tapped in eth1
   3.2) down-stream  path can be tapped in eth2
   3.3) Network TAP splits Rx/Tx paths and then feeds to eth1,eth2.

   If both eth1,eth2 need to be viewed as one logical channel,
   then that implies we need to timesort the packets as they come across
   eth1,eth2.

   3.4) But following issues further complicates the problem:
        3.4.1)What if one stream is bursty and other is flowing
              at line rate?
        3.4.2)How long do we wait before we can actually make a
              decision in the app-space and bail-out from the spin-wait?

   Solution:
   3.5) Once we receive a block from multiple ports,we can compare
        the timestamps from the block-descriptor and then easily time sort
        the packets and feed them to the decoders.

PS: The actual patch is ~744 lines of code. Rest ~220 lines are code comments.

sample user space code:
git://lolpcap.git.sourceforge.net/gitroot/lolpcap/lolpcap 

-Amerigo Wang has a five-piece patchset for notifiers, Dave Kleikamp
has jfs fixes ready (not much, few bugs adressed), Rafael J. Wysocki
has power management issues fixed in a new pull request, Lars-Peter Clausen
issues a 4-piece patchest adding support for Analog Devices ADAV801 and ADAV803
and Paul Gortmaker starts the 2.6.34 longterm (247 patches), Jesse
Barnes has pci fixes (addressing a few minor issues) and Jeff Garzik
has 4 libata fixes.

-Jesper Juhl has a 37-piece patch for cleaning up linux/version.h
of unneeded #includes, Greg Kroah Hartman announces kernels
2.6.39.2, 2.6.32.42 and 2.6.33.15, Nicholas A. Bellinger has
target updates for -rc5 (“Here are the target updates for v3.0-rc5
that have been marinating in lio-core-2.6.git for a few weeks now,
and are ready for you to pull.

It includes a OOPs bugfix for task management exception handling in target
core, a bugfix for incorrect TMR LUN lookup in tcm_fc fixed by Kiran,
a series from Dan to address issues with ERR_PTR dereferencing + strlen usage +
possible deadlock fix, and Roland’s spin_lock_irq() -> irqsave() change
required for certain cases in HW target mode operation in order prevent from
re-enabling interrupts that have already been disabled with a seperate
irqsave() call.”), Dave Airlie has four fixes for radeon drm, Jens Axboe
has block fixes for -rc4 (“A small collection of fixes for the current cycle.
Most of these are stable material as well.

– Fix a long standing race around the ioc lookup cache in cfq-iosched.
Very elusive bug, as the window is really small. Seemed to reproduce
most easily on single CPU systems.

– Add REQ_SECURE to the shared bio/rq mask.

– Bad types in throtl_log() prints.

– A small series of continued fixes disk event notification and bdev
claiming from Tejun. “) and….

-…this ends this week’s kermel news, may you have a great time!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s