[Home] [By Thread] [By Date] [Recent Entries]
Sorry for the length of this consolidated reply. Tl;dr version: The difference between text and binary is a matter of intent, which is not always subjective. We insist on beginnings because we are individuals, though because we are many individuals, any point can be called central. When markup is old enough, we call it plain text.
On Sun, Nov 24, 2013 at 9:25 AM, Costello, Roger L. <costello@m...> wrote:
I should have said "not interpreted [by someone]" rather than "not interpretable", since every binary file is interpretable as a stream of characters (in at least some encoding) if you don't care about its meaning. That, plus the fact that most early encodings were small and simple, is what allowed the computing world to muddle along for so long without clearly distinguishing text from binary, except in the matter of line ends (see below).
That too wasn't very well worded: it would be like calling a cat a mammal in a context in which we divide animals of interest into cats and mammals: obviously "mammals" is short for "other mammals".
On Sun, Nov 24, 2013 at 10:39 AM, Steve Newcomb <srn@c...> wrote: As a practical matter, is there *any* difference between text and binary
As a practical matter, is there any difference between words and pictures other than the necessity (in most cases) of dividing text into lines? Of course there is. There are a vast number of issues with text, of which encoding and line division are the simplest and easiest of all to solve. But you know that.
The problem with John's definition is that it begs the question, "What Indeed, "interpretability" was a bad choice of words on my part, as I noted above. Regular expressions, for example, are quite useful for detecting I don't know what you mean by "purely numeric". Numbers are an abstract concept. They can be represented by numerals (e.g. "1234567"), in which case they are text. Or they can be represented in any of a vast number of binary formats, which I cannot exemplify in this email because it is textual.
Charles Goldfarb used to say, "If there are bugs in a text-processing I think it would refer to encoding now. but the absurd Historically, it descends from the difference between the Model 33 Teletype and the Model 37. On the former, CR only returned the printing head to the left margin and LF was required to feed the paper upwards; on the latter, LF did both jobs. Bell Labs people had access to the spiffy Model 37, so their OS employed LF alone, whereas folks in the outside world mostly had Model 33s (I cut my teeth on one), so the DEC OSes used CR+LF. That legacy passed through CP/M to MS-DOS and Windows.
It's sort of
The Russian Empire did that on purpose to make it hard to invade them using their own train tracks. The Unix Empire wanted simplicity of internal processing (so only one character for a newline), whereas the DEC Empire wanted simplicity of I/O (so that text shipped directly to a teletype would Just Work).
On Sun, Nov 24, 2013 at 10:49 AM, David Lee <dlee@c...> wrote: I had a recent argument/discussion with a co-worker about if it is accurate (or useful) to consider UTF8 "Text" ... Only if an actual ^Z character appeared in the UTF-8, which would be no different from it appearing in a pure ASCII file. UTF-8 does not introduce spurious 0x1A bytes into the binary representation.
You might make a better case that UTF-16 is "not text" for such purposes. I suggest there is an "intent" or "desire" to categorize <gasp> "files" as "Text" or "Binary" but in reality the distinction is nearly or completely impossible to make accurately and without overlap. Yes, it is a distinction of intention. Simple test case: It is text if you intend to treat it so, and binary if you don't. If you cannot answer this definitively I suggest you cannot answer the general case definitively. That is equivalent to saying all intent is subjective, but there are a variety of definitions (as always) of objective intent. The differences between murder and manslaughter (culpable homicide), between assault and accident, lie in intent. But when prosecuting people for murder, we rely on objective evidence to establish their intent, we don't just ask them.
On Sun, Nov 24, 2013 at 11:00 AM, Dimitre Novatchev <dnovatchev@g...> wrote:
That is so: for example, the character & represents the word "and" in the English language. In other languages, it may represent the words "y" or "et" or "und" or "и". In Chinese, which is probably what you are thinking of, some characters represent whole words, such as 一, which represents the word "yī", meaning "one". But in the typical case Chinese characters represent meaningful syllables. Thus the word for China, 中国, contains two characters, both because it represents the two syllables "zhōng" and "guó", and because it represents the two semantic units "middle" and "country".
Why "middle"? Because the Chinese saw themselves as the only civilized people, in the middle of the barbarians to the north, south, east, and west, thus conceptually in the middle of the world.
Similarly, we text folk treat all non-text formats as barbarian, though there are thousands of them, all very different from one another. There is a cross-link here to another current thread, for indeed any point on the Earth's surface may be considered the middle of the world, just as any element in a network of semantic relationships may be treated as the root element, and any piece of information is the center of all knowledge, for there are conceptual links leading everywhere. Physically, we all have exactly one point of view (in the literal sense), but there are seven billion available points of view, and potentially many more. That, I think, is why human beings, being embodied intellects and not bodiless angels, seem to always insist on having a unique starting point in our conceptual networks.
Therefore, we need to be cautious Letters are a prototypical sort of character, but by no means the only kind. Some are so different from letters that Unicode does not hesitate to call Chinese characters "letters". Indeed, the great majority of letters in Unicode 6.3 are Chinese characters: 77421 (85%) are, and only 14104 (15%) are not.
And we could have languages where the "character" is something so Music and mathematics are textual, but they are not plain text: they are fancy text, because of their irreducibly two-dimensional character. Fortunately, we have a well-established invention, about a thousand years old, for reducing fancy text to plain text: it's called "markup".
Horizontal and vertical whitespace are the oldest kinds of markup, and they are so well established that we rarely think of them as such and treat them as plain text, at least up to a point. But it wasn't always so. This paragraph from Wikipedia's article on "scriptio continua" (writing without whitespace) shows how things used to be:
Before the advent of the codex (book), Latin and Greek script was written on scrolls. Reading continuous script on a scroll was more akin to reading a musical score than reading text. The reader would typically already have memorized the text through an instructor, had memorized where the breaks were, and the reader almost always read aloud, usually to an audience in a kind of reading performance, using the text as a cue sheet. Organizing the text to make it more rapidly ingested (through punctuation) was not needed. Indeed, punctuation was also originally a kind of markup now also taken into the text. So are symbols, though they are more akin to entity markup than to element markup. I could go on forever, but that's the nature of knowledge: see above. I've already pruned several digressions.
GMail doesn't have rotating .sigs, but you can see mine at http://www.ccil.org/~cowan/signatures
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



