kernel weekly news – 23.07.2011

Posted: July 23, 2011 in kernel

Hello everyone and welcome!

-We start this week with news about Dave Airlie having
updates of some pci ids for Radeon cards (pull request),
also Guenter Roeck has hwmon fixes (git), Ryusuke Konishi
has a nilfs2 fix in a pull request (doc), Ben Greear
has a 12-piece patchset regarding nfs and ip binding
, explained thusly :

 This patch series allows binding the nfs and rpc logic to a specific local
IP address.

Features/Benefits:

*  Allow multiple unique mounts to the same server using unique
   source IP addresses on client system.

   Routing rules can then be created based on the source IP address for
   advanced routing of the NFS traffic.

   This could allow someone to use two 1G interfaces to access a
   server with 10G connectivity and allow aggregate 2Gbps transfer,
   for instance.

   This is also useful for load testing NFS servers as well as the
   client-side NFS logic (note the bugs & fixed found while testing
   this code!)

*  Allow using a specific IP address on multi-homed system.  This could
   increase security in some cases and in general gives the user more
   control over how the services are configured.

This code has been tested under load on 3.0 and a similar version has
been tested under load on 2.6.38 and previous kernels.

Full implementation of this feature requires patches to mount.nfs,
which have been agreed to be accepted if the kernel patches are accepted. 

and Steven Rostedt has a 20-piece update (patchset) for perf/tracing.

-Small fixes/pull requests : Mauro Carvalho Chehab – v4l, David Miller –
networking and sparc, Rafael J. Wysocki – mips, Len Brown – acpi for
-rc7 and Hans Verkuil has v4l-dvb for 3.1 .

-Jonas Bonn posts version 3 of the OpenRISC architecture :

 Here's v3 of the OpenRISC architecture patch series.

The major changes since version 2 are:

i)  cleanup of the ptrace code

I've removed the single stepping code for now as I want to clean that up
separately.  Will resubmit that in 3.2.

Have implemented exporting of thread state to userspace via the GETREGSET
mechanism.  This allows us to keep the pt_regs structure opaque and allows
us to freely change the layout of the registers on the stack, which is
something that we will want to do in order to get better cache behaviour.

ii) rewrite of dma_alloc_consistent

iii) clean up device tree code in response to feedback for the v2 patch set

We only have a single example DTS file in the tree now, for the simulator,
that goes along with the defconfig file.


Aside from those three things, there are mostly cleanups coming from code
review and from checkpatch.

This tree should now be ready for inclusion in 3.1, but any feedback that
lets us be that much better is of course appreciated.

The code is available in the 'for-upstream' branch of the git repo at:

git://openrisc.net/jonas/linux

Note that there are 4 patches at the base of the branch changing asm-generic
behaviour and reviewed separately.  Note also, that the code in the branch
depends on the devicetree/next branch and one patch from Rusty's patch set
for 3.1.

Thanks,
Jonas 

-Seth Forshee has a patch fixing a hibernation problem on a certain
Toshiba netbook model :

 The following patch is in response to a consistently reproducible
failure to freeze tasks prior to restoring a hibernation image on a
Toshiba NB505 netbook. This machine has a built-in USB card reader.
Since the usb-stor-scan task is freezable but the code in
quiesce_and_remove_host() that waits for scanning to complete is not,
khubd can fail to freeze when processing the disconnect for the card
reader.

It seems that both should either be freezable or not freezable. Since
there doesn't currently seem to be any freezable way to wait on a
completion, I started with the simpler approach of making usb-stor-scan
non-freezable. If it would be preferable to make both freezable I can
take that approach instead.

Thanks,
Seth 

-David Miller has another round of networking fixes (pull req.) :

 A few last-minute stragglers.  The Tulip debug message thing, in
particular, is a really annoying regression for people who have
that hardware.

1) pr_*() conversion of tulip driver turned some commented out messages
   into pr_debug() which spams the log, just kill them off.  From Joe
   Perches.

2) PPPOE connections are keyed on MAC address, so we have to flush all
   connections on a device when the MAC address changes since until we
   renegotiate with the new MAC address the remote end won't see any
   of our packets.

3) linux/sdla.h has a kernel function declaration in the userspace
   visible area.  In fact this function hasn't been in the kernel for
   years so just remove it outright.  From WANG Cong.

Please pull, thanks a lot. 

-Vivien Didelot has a patchset proposing support for the
Technologic Systems TS-5500 Single Board Computers,
Marek Szyprowski has also a patchset regarding CMA
(Contiguous Memory Allocation), explained as follows:

 This is yet another round of Contiguous Memory Allocator patches. Now I
focused mainly on the integration of CMA to DMA mapping subsystem on ARM
architecture. In this version I've tried to solve the issue of the
aliasing in coherent memory mapping that was present in earlier versions
of DMA mapping framework.

The proposed solution should be considered as a proof-of-concept. Right
now it doesn't support GFP_ATOMIC allocations. Support for them is on my
TODO list and will be implemented on top of the "ARM: DMA: steal memory
for DMA coherent mappings" patch by Russell King.

A few words for these who see CMA for the first time:

   The Contiguous Memory Allocator (CMA) makes it possible for device
   drivers to allocate big contiguous chunks of memory after the system
   has booted. 

   The main difference from the similar frameworks is the fact that CMA
   allows to transparently reuse memory region reserved for the big
   chunk allocation as a system memory, so no memory is wasted when no
   big chunk is allocated. Once the alloc request is issued, the
   framework will migrate system pages to create a required big chunk of
   physically contiguous memory.

   For more information you can refer to nice LWN articles: 
   http://lwn.net/Articles/447405/ and http://lwn.net/Articles/450286/
   as well as links to previous versions of the CMA framework.

   The CMA framework has been initially developed by Michal Nazarewicz
   at Samsung Poland R&D Center. Since version 9, I've taken over the
   development, because Michal has left the company.

The current version of CMA is a set of helper functions for DMA mapping
framework that handles allocation of contiguous memory blocks. The
difference between this patchset and Kamezawa's alloc_contig_pages()
are:

1. alloc_contig_pages() requires MAX_ORDER alignment of allocations
   which may be unsuitable for embeded systems where a few MiBs are
   required.

   Lack of the requirement on the alignment means that several threads
   might try to access the same pageblock/page.  To prevent this from
   happening CMA uses a mutex so that only one allocating/releasing
   function may run at one point.

2. CMA may use its own migratetype (MIGRATE_CMA) which behaves
   similarly to ZONE_MOVABLE but can be put in arbitrary places.

   This is required for us since we need to define two disjoint memory
   ranges inside system RAM.  (ie. in two memory banks (do not confuse
   with nodes)).

3. alloc_contig_pages() scans memory in search for range that could be
   migrated.  CMA on the other hand maintains its own allocator to
   decide where to allocate memory for device drivers and then tries
   to migrate pages from that part if needed.  This is not strictly
   required but I somehow feel it might be faster.

The integration with ARM DMA-mapping subsystem is done on 2 levels.
During early boot memory reserved for contiguous areas are remapped with
2-level page tables. This enables us to change cache attributes of the
individual pages from such area on request. Then, DMA mapping subsystem
is updated to use dma_alloc_from_contiguous() call instead of
alloc_pages() and perform page attributes remapping.

Current version have been tested on Samsung S5PC110 based Goni machine
and s5p-fimc V4L2 driver. The driver itself uses videobuf2 dma-contig
memory allocator, which in turn relies on dma_alloc_coherent() from
DMA-mapping subsystem. By integrating CMA with DMA-mapping we managed to
get this driver working with CMA without any single change required in
the driver or videobuf2-dma-contig allocator.

TODO:
- implement GPF_ATOMIC allocations
- implement support for contiguous memory areas placed in HIGHMEM zone

Best regards 

-H. Peter Anvin has a pull request with x86 for -final,
Ingo Molnar has also a pull req. with lots of
scheduler fixes, Linus Torvalds has a note regarding
3.0 and 3.1 merge window
(“As everybody knows by now, not only did I do an -rc7 last week instead
of releasing 3.0 (due to some worries about the RCU code), but I ended
up also not doing the 3.0 on Monday because of a pathname lookup bug
and then some _more_ RCU issues.

Anyway, those should all be resolved and the fixes merged now, and I’m
not really all that nervous about the pathname lookup issue – I think
that got nailed, and the patch for that was literally just moving a
single line (and adding a comment).

The RCU issues worries me a bit, but everything says it’s all good,
and the biggest issues were with the new RCU_BOOST feature that really
neither defaults to on, nor is suggested right now. So I think we’re
ok, and I’m planning on doing 3.0 tomorrow.

That said, I do have one observation, and two requests:

The observation: with the upcoming version number change, the official
‘git’ repository is now (and has been for a week, but people probably
didn’t notice) just

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

ie the “-2.6” thing is gone. However, the old name continues to work,
just so that nothing breaks. You don’t need to really change anything,
but I thought I’d just point it out.

The two requests are pretty simple:

– Please do spend the time doing some last-minute checking of the
current git tree. The 3.0 release hasn’t really had a lot of
fundamental changes, and the pathname bug Hugh found wasn’t even a
regression – it’s been there a while and was in no way a 3.0 thing. So
I don’t really expect any huge issues, but just for psychological
reasons it would be nice to not have even a whiff of the traditional
“.0 problems” .

– Because the 3.0 release ends up being so delayed from my original
plan, that is now pushing the merge window solidly into my summer
vacation. Originally, I’d have vacationed after -rc1. With the extra
week, the merge window pushed a couple of days into my vacation – not
enough to worry about. And now, it’s solidly “the second week of the
merge window, Linus plans to spend much of the week under water”. So
please try to send your merge window pull requests *early*.

So that second request is basically aiming to not have to extend the
merge window. I won’t have WiFi under water, and the computer I will
have there have won’t be doing git merges. But if the load is light on
the second week, I may be able to keep up despite being on vacation.
Thanks to Intel, I will at least have a very capable laptop that can
compile kernels in minutes rather than hours, and where
‘allyesconfig’s are actually reasonable targets for me.

If I end up not being able to do a good job while on vacation, I’ll
obviously have to extend the merge window, but I’m basically hoping
that we simply won’t need to go there. I just need some help from you
guys for that to work.

Linus “) and
Paul Turner has a 18-piece CFS patchset explained as follows:

 Hi all,

Please find attached the incremental v7.2 for bandwidth control.

This release follows a fairly intensive period of scraping cycles across
various configurations.  Unfortunately we seem to be currently taking an IPC
hit for jump_labels (despite a savings in branches/instr. ret) which despite
fairly extensive digging I don't have a good explanation for.  The emitted
assembly /looks/ ok, but cycles/wall time is consistently higher across several
platforms.

As such I've demoted the jumppatch to [RFT] while these details are worked
out.  But there's no point in holding up the rest of the series any more.

[ Please find the specific discussion related to the above attached to patch 
17/18. ]

So -- without jump labels -- the current performance looks like:

                            instructions            cycles                  branches         
---------------------------------------------------------------------------------------------
clovertown [!BWC]           843695716               965744453               151224759        
+unconstrained              845934117 (+0.27)       974222228 (+0.88)       152715407 (+0.99)
+10000000000/1000:          855102086 (+1.35)       978728348 (+1.34)       154495984 (+2.16)
+10000000000/1000000:       853981660 (+1.22)       976344561 (+1.10)       154287243 (+2.03)

barcelona [!BWC]            810514902               761071312               145351489        
+unconstrained              820573353 (+1.24)       748178486 (-1.69)       148161233 (+1.93)
+10000000000/1000:          827963132 (+2.15)       757829815 (-0.43)       149611950 (+2.93)
+10000000000/1000000:       827701516 (+2.12)       753575001 (-0.98)       149568284 (+2.90)

westmere [!BWC]             792513879               702882443               143267136        
+unconstrained              802533191 (+1.26)       694415157 (-1.20)       146071233 (+1.96)
+10000000000/1000:          809861594 (+2.19)       701781996 (-0.16)       147520953 (+2.97)
+10000000000/1000000:       809752541 (+2.18)       705278419 (+0.34)       147502154 (+2.96)

Under the workload:
  mkdir -p /cgroup/cpu/test
  echo $$ > /dev/cgroup/cpu/test (only cpu,cpuacct mounted)
  (W1) taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"

This may seem a strange work-load but it works around some bizarro overheads
currently introduced by perf.  Comparing for example with::w
  (W2)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;true"
  (W3)taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "$(dirname $0)/pipe-test 100000;"


We see: 
 (W1)  westmere [!BWC]             792513879               702882443               143267136             0.197246943  
 (W2)  westmere [!BWC]             912241728               772576786               165734252             0.214923134  
 (W3)  westmere [!BWC]             904349725               882084726               162577399             0.748506065  

vs an 'ideal' total exec time of (approximately):
$ time taskset -c 0 ./pipe-test 100000
 real    0m0.198 user    0m0.007s ys     0m0.095s

The overhead in W2 is explained by that invoking pipe-test directly, one of
the siblings is becoming the perf_ctx parent, invoking lots of pain every time
we switch.  I do not have a reasonable explantion as to why (W1) is so much
cheaper than (W2), I stumbled across it by accident when I was trying some
combinations to reduce the -to- variance. 

-There is a mail on lkml called “Linux 3.0 release”, authored by
someone named Linus Torvalds, which says :

 So there it is. Gone are the 2.6. days, and 3.0 is out.

This obviously also opens the merge window for the next kernel, which
will be 3.1. The stable team will take the third digit, so 3.0.1 will
be the first stable release based on 3.0.

As already mentioned several times, there are no special landmark
features or incompatibilities related to the version number change,
it's simply a way to drop an inconvenient numbering system in honor of
twenty years of Linux. In fact, the 3.0 merge window was calmer than
most, and apart from some excitement from RCU I'd have called it
really smooth. Which is not to say that there may not be bugs, but if
anything, there are hopefully fewer than usual, rather than the normal
".0" problems.

And as I already mentioned yesterday, I'm hoping the 3.1 merge window
will be calm too, because due to the delays the latter half of the
merge window will fall into my vacation time. I briefly considered
simply waiting two extra weeks, but quite frankly, that wouldn't
really have solved anything (it would have made the merge window
instead fall into LinuxCon and my divemaster weekends).

So I'm going to try to keep to the normal two-week merge window, but
if it ends up being too busy for me to keep up, I may end up extending
the window just so that I can merge everything. However, even if that
happens, that will *not* mean that I will accept big pull requests for
longer, it just means that I may end up delaying things to catch up
with timely merge requests.

That said, judging by past experience, the summer merge windows often
tend to be quieter, so maybe I worry needlessly. Much of Europe is
starting to go on vacation, and parts of the US are being fried to a
crisp, so maybe 3.1 will be calm too.

Anyway, what has changed since -rc7 is mainly some RCU interactions
with the scheduler, and the RCU problems should hopefully be behind
us. The pathname lookup race is also fixed. There's a few DRI fixes
(i915 modesetting, and some Radeon fixes), and Al walked through some
more esoteric VFS d_lock issues. Other than that it's really pretty
small and random.

The shortlog from -rc7 is appended, the bigger "everything since
2.6.39" list is obviously unmanageable.

                                Linus 

-Chris Ball has MMC updates for linux 3.1-rc1 (pull req.),
Konrad Rzeszutek Wilk has few xen fixes/cleanups, Benjamin
Herrenschmidt has a powerpc pull request (“This branch contains
some rework and consolidation of the code
to establish the mapping between device-tree nodes for PCI
devices (if they exist) and the corresponding Linux struct device.

It moves it all to generic code in a way that is a lot cleaner
than any of the previous implementations. It specifically allows
me to get rid of a two subtly different ways of doing the same
thing I had in powerpc between 32-bit and 64-bit, and updates
microblaze and x86 to use that common code as well.

This has been in -next for a while with no complaints so far
and is completely orthogonal to the powerpc changes I will send
you in a couple of days (I want to wait for some other trees
to go in first to address a couple of known collisions).

Cheers,
Ben. “), Pekka Enberg announces changes for the SLAB subsystem
for 3.1-rc0 (pull request) and Artem Bityutskiy has ubi(fs) changes
for 3.1.

-This is this week’s edition! Enjoy and a good weekend!

Leave a comment