|
My academic research is in the area of compression, specifically problems in prefix coding that can be considered extensions to Huffman coding (and the related Hu-Tucker algorithm), with an emphasis on utilities other than expected codeword length. Applications include low bandwidth channels, security, search, queueing, diagnostic testing, and decision trees. I have written a lay introduction to this topic here, and a longer one here. Talk slides of my presentations and other works not available online are available upon request. © COPYRIGHT NOTICE: Many of the below documents have been submitted by their author(s) to scholarly journals or conferences (as indicated); others contain summaries of and extensions to this work. The manuscripts are put online to facilitate non-commercial dissemination of scientific work. These manuscripts are copyrighted by the authors or the journals in which they were published. You may copy a manuscript for scholarly, non-commerical purposes, such as research or instruction, provided that you agree to respect these copyrights. |
"Another Personal View of Generalized Huffman Coding": [PDF]
This note is a high-level introduction to my research, written for those who want an overview of this body of work, rather than just an abstract of one particular paper. As such, most mathematical and algorithmic details are omitted. The first part, intended for a general audience, explains the ideas behind lossless coding, specifically Huffman coding and variants thereof. The second part, intended for a more technical audience, summarizes the results of my dissertation, papers, correspondences, and extended abstracts. This note was written in September 2008 and revised in November 2013, its title inspired by that of Mordecai Golin's September 2001 talk, ``A Personal View of Generalized Huffman Encoding.''
Coding for General Penalties: [PDF]
Given the probability mass function of a random variable, Huffman coding finds a corresponding prefix-free binary code that minimizes the expected codeword length. However, there are many practical situations in which the constraints and goals may be different from the linear penalty of minimizing expected length. For example, we may wish to avoid long codewords in some situations, since these may cause an inordinate wait for transmission of unlikely events. Thus we may desire a non-linear penalty in the codeword lengths. Such penalties arise naturally in group testing for diagnostic systems, queueing, remote exploration, and military intelligence.
In this thesis, we examine families of coding problems with non-linear penalties. These include generalizations of Huffman coding suitable for communication and analysis, extending instances of such source coding found in the literature. Alternative penalty measures include exponential average and maximum pointwise redundancy; we find an interesting framework for solutions of these problems, a framework also encompassing expected length and other natural penalties. An efficient algorithm will be presented for general additive convex penalties, penalties of the form Σi f(li,pi), where li denotes the length of the ith codeword, pi denotes the corresponding probability, and f is convex and increasing in li. This includes several previously proposed penalties, among them a general problem proposed by Campbell. For Campbell's problem, the optimization may be performed using quadratic time and linear space. This algorithm may also be generalized to an even broader range of penalties. Finally, we consider infinite-alphabet coding, a problem not fully understood for even the penalty of expected length.
"Redundancy-Related Bounds on Generalized Huffman Codes":
This paper presents new lower and upper bounds for the compression rate of binary prefix codes optimized over memoryless sources according to various nonlinear codeword length objectives. Like the most well-known redundancy bounds for minimum average redundancy coding — Huffman coding — these are in terms of a form of entropy and/or the probability of an input symbol, often the most probable one. The bounds here, some of which are tight, improve on known bounds of the form L ∈ [H,H+1), where H is some form of entropy in bits (or, in the case of redundancy objectives, 0) and L is the length objective, also in bits. The objectives explored here include exponential-average length, maximum pointwise redundancy, and exponential-average pointwise redundancy (also called dth exponential redundancy). The first of these relates to various problems involving queueing, uncertainty, and lossless communications; the second relates to problems involving Shannon coding and universal modeling. For these two objectives we also explore the related problem of the necessary and sufficient conditions for the shortest codeword of a code being a specific length.
"Alphabetic Coding with Exponential Costs": [PDF]
An alphabetic binary tree formulation applies to problems in which an outcome needs to be determined via alphabetically ordered search prior to the termination of some window of opportunity. Rather than finding a decision tree minimizing Σi=1n w(i) l(i), this variant involves maximizing Σi=1n w(i) al(i) for a given a ∈ (0,1). This note introduces a dynamic programming algorithm that finds the optimal solution in polynomial time and space, and shows that methods traditionally used to improve the speed of optimizations in related problems, such as the Hu-Tucker procedure, fail for this problem. This note thus also introduces two approximation algorithms which can find a suboptimal solution in linear time (for one) or O(n log n) time (for the other), with associated coding redundancy bounds.
"Optimal Prefix Codes for Infinite Alphabets with Nonlinear Costs": [PDF]
Let P = {p(i)} be a measure of strictly positive probabilities on the set of nonnegative integers. Although the countable number of inputs prevents usage of the Huffman algorithm, there are nontrivial P for which known methods find a source code that is optimal in the sense of minimizing expected codeword length. For some applications, however, a source code should instead minimize one of a family of nonlinear objective functions, β-exponential means, those of the form logaΣi p(i)an(i), where n(i) is the length of the ith codeword and a is a positive constant. Applications of such minimizations include a novel problem of maximizing the chance of message receipt in single-shot communications (a<1) and a previously known problem of minimizing the chance of buffer overflow in a queueing system (a>1). This paper introduces methods for finding codes optimal for such exponential means. One method applies to geometric distributions, while another applies to distributions with lighter tails. The latter algorithm is applied to Poisson distributions and both are extended to alphabetic codes, as well as to minimizing maximum pointwise redundancy. The aforementioned application of minimizing the chance of buffer overflow is also considered.
"Source Coding for Quasiarithmetic Penalties": [PDF]
Whereas Huffman coding finds a prefix code minimizing mean codeword length for a given finite-item probability distribution, quasiarithmetic or quasilinear coding problems have the goal of minimizing a generalized mean of the form φ-1(Σi pi φ(li)), where li denotes the length of the ith codeword, pi denotes the corresponding probability, and φ is a monotonically increasing cost function. Such problems, proposed by Campbell, have a number of diverse applications. Several cost functions are shown here to yield quasiarithmetic problems with simple redundancy bounds in terms of a generalized entropy. A related property, also shown here, involves the existence of optimal codes: For "well-behaved" cost functions, optimal codes always exist for (possibly infinite-alphabet) sources having finite generalized entropy. An algorithm is introduced for finding binary codes optimal for convex cost functions. This algorithm, which can be extended to other minimization utilities, can be performed using quadratic time and linear space. This reduces the computational complexity of a problem involving minimum delay in a queue, allows combinations of previously considered problems to be optimized, and greatly expands the set of problems solvable in quadratic time and linear space.
"A General Framework for Codes Involving Redundancy Minimization": [PDF]
The original Shannon code and the optimal Huffman code are the best-known methods of prefix coding, and both guarantee codes within one bit of entropy. However, the methods by which these codes are constructed are quite different, the former method being algebraic and the latter method using a bottom-up greedy algorithm. In addition, although the Huffman code is optimal, unlike Shannon's code, it has no guarantee that a given codeword will have a length within a constant of the associated input symbol's self-information. A code which minimizes maximum pointwise redundancy is a strict improvement on Shannon's code in that no codeword in Shannon's code is shorter than a code minimizing maximum pointwise redundancy; indeed, Drmota and Szpankowski showed that such a code can be found via an O(n log n)-time generalization of Shannon coding. Such a code can also be found via a linear-time variant of Huffman coding introduced here, and this coding method establishes a continuum between minimum maximum pointwise redundancy coding and conventional Huffman coding. Codes obtained via this method are strict improvements on generalized Shannon codes and also minimize other desirable factors among codes minimizing maximum pointwise redundancy. This ties together not only Huffman and Shannon coding, but also provides a greater framework encompassing problems previously proposed by Campbell and by Nath.
Published version (modified above for clarity):A framework with two scalar parameters is introduced for various problems of finding a prefix code minimizing a coding penalty function. The framework encompasses problems previously proposed by Huffman, Campbell, Nath, and Drmota and Szpankowski, shedding light on the relationships among these problems. In particular, Nath's range of problems can be seen as bridging the minimum average redundancy problem of Huffman with the minimum maximum pointwise redundancy problem of Drmota and Szpankowski. Using this framework, two linear-time Huffman-like algorithms are devised for the minimum maximum pointwise redundancy problem, the only one in the framework not previously solved with a Huffman-like algorithm. Both algorithms provide solutions common to this problem and a subrange of Nath's problems, the second algorithm being distinguished by its ability to find the minimum variance solution among all solutions common to the minimum maximum pointwise redundancy and Nath problems. Simple redundancy bounds are also presented.
"On the Redundancy of Huffman Codes with Exponential Objectives": [PDF]
We present new lower and upper bounds for the compression rate of binary prefix codes over memoryless sources optimized according to two related exponential codeword length objectives. The objectives explored here are exponential-average length and exponential-average redundancy. The first of these relates to various problems involving queueing, uncertainty, and lossless communications. It can be reduced to the second, which has properties more amenable to analysis. These bounds, some of which are tight, are in terms of a form of entropy and/or the probability of an input symbol, improving on recently discovered bounds of similar form. We also observe related properties of optimal codes over the exponential-average redundancy utility.
"Efficient Implementation of the Generalized Tunstall Code Generation Algorithm": [PDF]
A method is presented for constructing a Tunstall code that is linear time in the number of output items. This is an improvement on the state of the art for non-Bernoulli sources, including Markov sources, which use a suboptimal generalization of Tunstall's algorithm proposed by Savari and analytically examined by Tabus and Rissanen. In general, if n is the total number of output leaves across all Tunstall trees, s is the number of trees (states), and D is the number of leaves of each internal node, then this method takes O((1+(log s)/D) n) time and O(n) space.
"Tight Bounds on Minimum Maximum Pointwise Redundancy": [PDF]
This paper presents new lower and upper bounds for the optimal compression of binary prefix codes in terms of the most probable input symbol, where compression efficiency is determined by the nonlinear codeword length objective of minimizing maximum pointwise redundancy. This objective relates to both universal modeling and Shannon coding, and these bounds are tight throughout the interval. The upper bounds also apply to a related objective, that of dth exponential redundancy.
"Prefix Codes for Power Laws": [PDF]
In prefix coding over an infinite alphabet, methods that consider specific distributions generally consider those that decline more quickly than a power law (e.g., a geometric distribution for Golomb coding). Particular power-law distributions, however, model many random variables encountered in practice. Estimates of expected number of bits per input symbol approximate compression performance of such random variables and can thus be used in comparing such methods. This paper introduces a family of prefix codes with an eye towards near-optimal coding of known distributions, precisely estimating compression performance for well-known probability distributions using these new codes and using previously known prefix codes. One application of these near-optimal codes is an improved representation of rational numbers.
"Reserved-length Prefix Coding": [PDF]
Huffman coding finds an optimal prefix code for a given probability mass function. Consider situations in which one wishes to find an optimal code with the restriction that all codewords have lengths that lie in a user-specified set of lengths (or, equivalently, no codewords have lengths that lie in a complementary set). This paper introduces a polynomial-time dynamic programming algorithm that finds optimal codes for this reserved-length prefix coding problem. This has applications to quickly encoding and decoding lossless codes. In addition, one modification of the approach solves any quasiarithmetic prefix coding problem, while another finds optimal codes restricted to the set of codes with g codeword lengths for user-specified g (e.g., g=2). For small enough g, a sublinear-time constant-space approach is even more efficient.
"D-ary Bounded-Length Huffman Coding": [PDF]
Efficient optimal prefix coding has long been accomplished via the Huffman algorithm. However, there is still room for improvement and exploration regarding variants of the Huffman problem. Length-limited Huffman coding, useful for many practical applications, is one such variant, in which codes are restricted to the set of codes in which none of the n codewords is longer than a given length, lmax. Binary length-limited coding can be done in O(n(lmax-lmin)) time and O(n) space via the widely used Package-Merge algorithm. In this paper the Package-Merge approach is generalized without increasing complexity in order to introduce a minimum codeword length, lmin, to allow for objective functions other than the minimization of expected codeword length, and to be applicable to both binary and nonbinary codes; nonbinary codes were previously addressed using a slower dynamic programming approach. These extensions have various applications — including faster decompression — and can be used to solve the problem of finding an optimal code with limited fringe, that is, finding the best code among codes with a maximum difference between the longest and shortest codewords. The previously proposed method for solving this problem was nonpolynomial time, whereas solving this using the novel algorithm requires only O(n(lmax-lmin)2) time and O(n) space.
"Infinite-Alphabet Prefix Codes Optimal for β-Exponential Penalties": [PDF]
Let P = {p(i)} be a measure of strictly positive probabilities on the set of nonnegative integers. Although the countable number of inputs prevents usage of the Huffman algorithm, there are nontrivial P for which known methods find a source code that is optimal in the sense of minimizing expected codeword length. For some applications, however, a source code should instead minimize one of a family of nonlinear objective functions, β-exponential means, those of the form logaΣi p(i)an(i), where n(i) is the length of the ith codeword and a is a positive constant. Applications of such minimizations include a problem of maximizing the chance of message receipt in single-shot communications (a<1) and a problem of minimizing the chance of buffer overflow in a queueing system (a>1). This paper introduces methods for finding codes optimal for such exponential means. One method applies to geometric distributions, while another applies to distributions with lighter tails. The latter algorithm is applied to Poisson distributions. Both are extended to minimizing maximum pointwise redundancy.
"On Conditional Branches in Optimal Decision Trees": [PDF]
The decision tree is one of the most fundamental programming abstractions. A commonly used type of decision tree is the alphabetic binary tree, which uses (without loss of generality) ``less than'' versus ``greater than or equal to'' tests in order to determine one of n outcome events. The process of finding an optimal alphabetic binary tree for a known probability distribution on outcome events usually has the underlying assumption that the cost (time) per decision is uniform and thus independent of the outcome of the decision. This assumption, however, is incorrect in the case of software to be optimized for a given microprocessor, e.g., in compiling switch statements or in fine-tuning program bottlenecks. The operation of the microprocessor generally means that the cost for the more likely decision outcome can or will be less — often far less — than the less likely decision outcome. Here we formulate a variety of O(n3)-time O(n2)-space dynamic programming algorithms to solve such optimal binary decision tree problems, optimizing for the behavior of processors with predictive branch capabilities, both static and dynamic. In the static case, we use existing results to arrive at entropy-based performance bounds. Solutions to this formulation are often faster in practice than ``optimal'' decision trees as formulated in the literature, and, for small problems, are easily worth the extra complexity in finding the better solution. This can be applied in fast implementation of decoding Huffman codes.
"Rényi to Rényi — Source Coding under Siege": [PDF]
A novel lossless source coding paradigm applies to problems of unreliable lossless channels with low bitrate, in which a vital message needs to be transmitted prior to termination of communications. This paradigm can be applied to Alfréd Rényi's secondhand account of an ancient siege in which a spy was sent to scout the enemy but was captured. After escaping, the spy returned to his base in no condition to speak and unable to write. His commander asked him questions that he could answer by nodding or shaking his head, and the fortress was defended with this information. Rényi told this story with reference to traditional lossless source coding, in which the objective is minimization of expected codeword length. The goal of maximizing probability of survival in the siege scenario is distinct from yet related to this traditional objective. Rather than finding a code minimizing expected codeword length Σi p(i) l(i), this variant involves maximizing Σi p(i) θl(i) for a known θ ∈ (0,1). When there are no restrictions on codewords, this problem can be solved using a known generalization of Huffman coding. The optimal solution has coding bounds which are functions of Rényi entropy; in addition to known bounds, new bounds are derived here. The alphabetically constrained version of this problem has applications in search trees and diagnostic testing. A novel dynamic programming algorithm — based upon the oldest known algorithm for the traditional alphabetic problem — optimizes this problem in O(n3) time and O(n2) space, whereas two novel approximation algorithms can find a suboptimal solution faster: one in linear time, the other in O(n log n). Coding bounds for the alphabetic version of this problem are also presented.
"Source Coding for General Penalties": [PDF] [published one-page abstract]
Given a probability vector, Huffman coding finds a corresponding prefix-free binary code that minimizes the expected codeword length. In this paper we explore situations in which the goals are different from those in Huffman coding. We generalize an efficient algorithm for finding length-limited codes to an efficient algorithm for finding optimal codes for any of a family of penalties proposed by Campbell. This algorithm, which also applies to a larger family of problems, may be performed using quadratic time and linear space.
"Designing Audio Aura": [PDF]
In this paper, we describe the process behind the design of Audio Aura. The goal of Audio Aura is to provide serendipitous information, via background auditory cues, that is tied to people's physical actions in the workplace. We used scenarios to explore issues in serendipitous information such as privacy and work practice. Our sound design was guided by a number of strategies for creating peripheral sounds grouped in cohesive ecologies. Faced with an physical and software infrastructure under development in a laboratory distant from our sound studio, we prototyped different sonic landscapes in VRML worlds. In our infrastructure design, we made a number of trade-offs in our use of legacy systems and our client-server design.
"On Conditional Branches in Optimal Search Trees":
Decision trees are one of the most fundamental programming abstractions, and a commonly used type of decision tree is the alphabetic binary tree, which uses (without loss of generality) "less than" versus "greater than or equal to" tests in order to determine one of n outcome events. The process of finding an optimal alphabetic binary tree for a known probabily distribution on outcome events usually has the underlying assumption that the cost per decision is uniform and thus independent of the outcome of the decision. Algorithms without this assumption generally use one cost if the decision result is "less than" and another cost if it is "greater than." In practice, neither assumption is accurate for software to be optimized for a given microprocessor. The operation of the microprocessor generally means that the cost for the more likely decision outcome can or will be less — often far less — than the less likely decision outcome. This problem and generalizations thereof are applicable to hard coding static decision tree instances in software, e.g., for compiling switch statements or for fine-tuning program bottlenecks. O(n3)-time O(n2)-space dynamic programming algorithms can solve such an optimal binary decision tree problem, optimizing for the behavior of processors with predictive branch capabilities, both static and dynamic. Solutions to this formulation are often faster in practice than "optimal" decision trees as formulated in the literature. Different decision paradigms can sometimes yield even better performance.
"Twenty (or so) Questions":
The game of Twenty Questions has long been used to illustrate binary source coding. Recently, a physical device has been developed that mimics the process of playing Twenty Questions, with the device supplying the questions and the user providing the answers. However, this game differs from Twenty Questions in two ways: Answers need not be only "yes" and "no," and the device at least 20 and at most 25 questions are asked. The nonbinary variation on source coding is one that is well known and understood, but not with such bounds on length. An upper bound on the related property of fringe, the difference between the lengths of the longest and the shortest codewords, has been considered, but no polynomial-time algorithm currently finds optimal fringe-limited codes. An O(n(lmax-lmin))-time O(n)-space Package-Merge-based algorithm is presented here for D-ary (binary and nonbinary) source coding with all n codeword lengths (numbers of questions) bounded to be within the interval [lmin, lmax]. This algorithm minimizes average codeword length or, more generally, any other quasiarithmetic convex coding penalty. In the case of minimizing average codeword length, time complexity can often be improved via an alternative graph-based reduction. This has, as a special case, a method for nonbinary length-limited Huffman coding, which was previously solved via dynamic programming with O(n2lmax log D) time and O(n2 log D) space. These algorithms can also be used to efficiently find a code that is optimal given a limit on fringe.
Abstract for Hong Kong / Cyprus talks: [announcement]
Since Shannon introduced information theory, we have had entropy bounds for the expected codeword length of optimal lossless fixed-to-variable-length ("Huffman") binary codes. The lower bound is entropy, while the upper bound is one bit greater, resulting in unit-sized bounds. Since then, tighter bounds have been developed, bounds which use both entropy and the probability of the most probable codeword. I will present new lower and upper bounds for codes optimal according to various nonlinear codeword length objectives, also in terms of a form of entropy and/or the probability of the most probable input symbol. These bounds improve on known unit-sized bounds for these nonlinear objectives. The objectives I explore include exponential-average length, maximum pointwise redundancy, and exponential-average pointwise redundancy (also called dth exponential redundancy). These relate to queueing and single-shot communications, Shannon coding and universal modeling (worst-case minimax redundancy), and bridging the maximum pointwise redundancy problem with Huffman coding, respectively. A generalized form of Huffman coding known to find optimal codes for these objectives helps yield these bounds, some of which are tight. Related properties to such bounds are the necessary and sufficient conditions for the shortest codeword being a specific length.
Google talk: [video]
Decision trees are one of the most fundamental programming abstractions, and a commonly used type of decision tree is the alphabetic binary tree, which uses (without loss of generality) "less than"/"greater than or equal to" tests in order to determine an outcome event. The process of finding an optimal alphabetic binary tree for a known probabily distribution on outcome events usually has the underlying assumption that the cost per decision is uniform and thus independent of the outcome of the decision. Algorithms without this assumption generally use one cost if the decision result is "less than" and another cost if it is "greater than." In practice, neither assumption is accurate for software to be optimized for a given microprocessor. The operation of the microprocessor generally means that the cost for the more likely decision outcome will be less — often far less — than the less likely decision outcome. This problem and generalizations thereof are applicable to hard coding static decision tree instances in software, e.g., for compiling switch statements or for fine-tuning program bottlenecks. An O(n3)-time O(n2)-space dynamic programming algorithm can solve this optimal binary decision tree problem, and this approach has many generalizations that optimize for the behavior of processors with predictive branch capabilities, both static and dynamic. Solutions to this formulation are often faster in practice than "optimal" decision trees as formulated in the literature. Different decision paradigms can sometimes yield even better performance.
USC talk:
Huffman coding finds an optimal binary code for a random variable with a known probability mass function, in the sense that optimality is achieved by a code with minimum expected codeword length. However, there are many practical situations in which the constraints and costs are different from this strictly linear case. When delay is important, network transmission time may involve higher-order statistics. When a message is crucial, we may wish to maximize the probability of successful receipt.
In this talk, source coding problems with nonlinear costs will be examined. Many of these problems can be solved using algorithmic extensions of Huffman coding or limited length Huffman coding. These include the use of Golomb codes and other codes with a unary component for nonlinear costs over infinite alphabets. Information-theoretical extensions of source coding theory reveal properties of the resulting codes.
UCLA talk: [announcement]
Huffman coding finds an optimal binary code for a random variable with a known probability mass function, in the sense that optimality is achieved by a code with minimum expected codeword length. However, there are many practical situations in which the constraints and costs are different from this strictly linear case. When delay is important, network transmission time may involve higher-order statistics. When a message is crucial, we may wish to maximize the probability of successful receipt.
Source coding problems with nonlinear penalties will be examined. Many of these problems can be solved using extensions of Huffman coding or limited length Huffman coding; these extensions improve upon previous algorithms. Examination of the properties of optimal codes with nonlinear costs yields several analogues to the relation between Huffman coding and Shannon entropy.
CalTech talk: [announcement]
Given the probability mass function of a random variable, Huffman's well-known algorithm finds a corresponding prefix-free binary code that minimizes the expected codeword length. However, there are many practical situations in which the constraints and goals are different from those of Huffman coding. Minimization of expected length, a linear penalty, is not always called for. For example, we may wish to avoid long codewords in some situations, at the price of using additional bits for shorter ones, since the longer ones may cause an inordinate wait for the transmission of unlikely events. Nonlinear penalties arise naturally in group testing for diagnostic systems, queueing, and military intelligence.
Families of coding problems with nonlinear penalties will be examined; these may be considered generalizations of Huffman coding. These generalizations include alternative penalty measures as exponential average and maximum pointwise redundancy. A framework encompassing these and other natural penalties will be introduced. In addition, an efficient algorithm will be presented for a general convex problem proposed by Campbell, an algorithm with quadratic time and linear space complexity, extending and improving upon previous results. We will also explore properties and applications of nonlinear penalties.
© 1998-2016 Michael B. Baer, 1998-2003 Stanford University, calbear @ stanfordalumni·org