Count like an Egyptian, the Anderer data dictionary

Panglozz

September 1, 2006

The Sontag-Heise presentation at SCO Forum 2003 featured a slide of a "spectral analysis" of the Berkley Packet Filter. This powerpoint presentation was forwarded by Blake Stowell to journalist Bob McMillan of IDG without restriction, who in turn made it available to Bruce Perens for analysis.
http://perens.com/SCO/SCOSlideShow.html

The original unmodified powerpoint is available at http://brucep.webfarmhosting.com/SCOsource_Briefing_II.2.ppt

The spectral analyis graphs are embedded Excel charts, which means they can be opened, and the underlying spreadsheet inspected using MSFT's software.

The spreadsheet reveals itself as a 435x11 array of integers without identifying column or row labels. Other charts, unused in the powerpoint, are scattered around the data table. The data array represents the two source files net/core/filter.c from a Linux 2.4 version, and the supposed coordinate file from some SCO Unix. The linux file is represented by rows 1-216 in the spreadsheet and the Unix file by 217-435

The other charts in the worksheet were used as a graphic in the Anderer patent specification (Sept. 2003), which is a print-out of another powerpoint.

One central idea of the Anderer code comparator was that it would recode line statement source files as number vectors, with each line's concepts and/or variables represented by their frequency of occurence globally in the analysis pool.

Access to the data array and the net/core/filter file lets one coordinate specific code with the secret Anderer numbers. For instance, If{ == 572858, Continue; == 18396 and Else == 122425, and so forth. (unexplained why If and Else are 4x different in frequency)

A secret number substitution code is really just silly; but more significant I believe is the fact the source spreadsheet was residing on a SCO-owned computer for the author to access while he was constructing this powerpoint (meta data happily provides an exact path). The absence of row and column headings likely means the author was familiar with the data structure of the code comparator output.

SCO has previously sought to exclude the MIT deep divers from evidence on the grounds that their explorations were a contractors unused, and hence protected, work product. The spreadsheet supports the notion that SCO understood and used the data internally, as well as presenting it publicly.

1:20:07 PM

SCO understanding, was Re: Count like an Egyptian, the Anderer data dictionary

September 1, 2006

hamjudo2000

"The spreadsheet supports the notion that SCO understood and used the data internally, as well as presenting it publicly."

Looking at it from another direction; SCO did not understand the source of the files they were using to generate the data, so the data is worthless.  GIGO, Garbage In, Garbage Out.  So the data SCO was using had no meaning, and SCO did not understand it at all.  All SCO could see is how similar the files were based on some metric.  Not the lack of significance of that similarity.

Did SCO do the "spectral analysis" before or after talking about millions of lines of matching code?

It appears that the metric they were using was not particularly usefull.  It seems it may have led to thousands of false positives.  Or if Blepp was not exagerating, there were millions of lines of false positives.

5:36:01 PM

Re: SCO understanding, was Re: Count like an Egyptian, the Anderer data dictionary

Panglozz

September 1, 2006

We are wading close to a pool, SCO's reliance on Anderer's evidence, that is much shrouded.

When Laura DiDio reported on the MIT deepdivers in June, 2003, She said,
6/13/03 DiDio
"SCO hired three separate teams of code experts, including a group from the Massachusetts Institute of Technology. According to SCO, these groups all found code in Linux that purportedly originated in the Unix System V kernel and not BSD."

The key point here is the positive assertion that the origin of the code was checked, and BSD excluded, and SysV identified. This indicates that far from being a data dump, a filter for origin was a portion of the investigation.

Enderle provides another tidbit:
7/7/03 Enderle
"I saw what appeared to be a word-for-word copy of about every third line of code in the central module of the Linux kernel," said Enderle of Giga Information Group, who viewed the alleged code violations two weeks ago. "The lines of code contained typos, misspellings and even copyright disclaimers. It appeared to constitute a violation of the license."

This indicates that the copyright block and other comment text was compared, for some texts

Perens and others have speculated that the Berkeley Packet match was found after the copyright and other comment block was stripped from the files, or the non-protectable nature of BPF would have been immediately apparent.

The Anderer code comparator is described as stripping comments prior to numerical coding, so this may have affected analysis of this file.

The first MIT reference was by Sontag in April 2003, McBride and Broughton mentioned the effort in May 2003, and stock analysts began reporting in June.

Objections to the method that became more apparent after the SCO Forum, were anticipated and answered by SCO prior to public release. These include BSD unprotectable elements (DiDio), Copyright inspection (Enderle). No one pulled the wool over SCO's eyes, they were not innocents led to the slaughter by a cunning Anderer, they knew what the objections would be and tried to deny these were problems.

6:07:17 PM

Source: Investor Village SCO Board [ http://www.investorvillage.com/smbd.asp?mb=1911 ]