librsync  2.0.2
TODO.md
1 * Fix symbol names:
2 
3  * Rename all symbols that are intended to be private to `rs__`
4 
5  * Rename those that don't match either prefix.
6 
7 * We have a few functions to do with reading a netint, stashing
8  it somewhere, then moving into a different state. Is it worth
9  writing generic functions for that, or would it be too confusing?
10 
11 * Duplicate block handling. Currently duplicate blocks are included in
12  the signature, but we only put the first duplicate block in the
13  hashtable so the delta only includes references to the first block.
14  This can result in sub-optimal copy commands, breaking single large
15  copies with duplicate blocks into multiple copies referencing the
16  earlier copy of the block. However, this could also make patching use
17  the disk cache more effectively. This solution is probably fine,
18  particularly given how small copy instructions are, but there might be
19  solutions for improving copy commands for long runs of duplicate blocks.
20 
21 * Optimisations and code cleanups;
22 
23  scoop.c: Scoop needs major refactor. Perhaps the API needs
24  tweaking?
25 
26  rsync.h: rs_buffers_s and rs_buffers_t should be one typedef?
27 
28  * Just how useful is rs_job_drive anyway?
29 
30  mdfour.c: This code has a different API to the RSA code in libmd
31  and is coupled with librsync in unhealthy ways (trace?). Recommend
32  changing to RSA API?
33 
34 * Don't use the rs_buffers_t structure.
35 
36  There's something confusing about the existence of this structure.
37  In part it may be the name. I think people expect that it will be
38  something that behaves like a FILE* or C++ stream, and it really
39  does not. Also, the structure does not behave as an object: it's
40  really just a shorthand for passing values in to the encoding
41  routines, and so does not have a lot of identity of its own.
42 
43  An alternative might be
44 
45  result = rs_job_iter(job,
46  in_buf, &in_len, in_is_ending,
47  out_buf, &out_len);
48 
49  where we update the length parameters on return to show how much we
50  really consumed.
51 
52  One technicality here will be to restructure the code so that the
53  input buffers are passed down to the scoop/tube functions that need
54  them, which are relatively deeply embedded. I guess we could just
55  stick them into the job structure, which is becoming a kind of
56  catch-all "environment" for poor C programmers.
57 
58 * Meta-programming
59 
60  * Plot lengths of each function
61 
62  * Some kind of statistics on delta each day
63 
64 * Encoding format
65 
66  * Include a version in the signature and difference fields
67 
68  * Remember to update them if we ever ship a buggy version (nah!) so
69  that other parties can know not to trust the encoded data.
70 
71 * abstract encoding
72 
73  In fact, we can vary on several different variables:
74 
75  * what signature format are we using
76 
77  * what command protocol are we using
78 
79  * what search algorithm are we using?
80 
81  * what implementation version are we?
82 
83  Some are more likely to change than others. We need a chart
84  showing which source files depend on which variable.
85 
86 * Encoding implementation
87 
88  * Join up signature commands
89 
90 * Encoding algorithm
91 
92  * Self-referential copy commands
93 
94  Suppose we have a file with repeating blocks. The gdiff format
95  allows for COPY commands to extend into the *output* file so that
96  they can easily point this out. By doing this, they get
97  compression as well as differencing.
98 
99  It'd be pretty simple to implement this, I think: as we produce
100  output, we'd also generate checksums (using the search block
101  size), and add them to the sum set. Then matches will fall out
102  automatically, although we might have to specially allow for
103  short blocks.
104 
105  However, I don't see many files which have repeated 1kB chunks,
106  so I don't know if it would be worthwhile.
107 
108  * Extended files
109 
110  Suppose the new file just has data added to the end. At the
111  moment, we'll match everything but the last block of the old
112  file. It won't match, because at the moment the search block
113  size is only reduced at the end of the *new* file. This is a
114  little inefficient, because ideally we'd know to look for the
115  last block using the shortened length.
116 
117  This is a little hard to implement, though perhaps not
118  impossible. The current rolling search algorithm can only look
119  for one block size at any time. Can we do better? Can we look
120  for all block lengths that could match anything?
121 
122  Remember also that at the moment we don't send the block length
123  in the signature; it's implied by the length of the new block
124  that it matches. This is kind of cute, and importantly helps
125  reduce the length of the signature.
126 
127  * State-machine searching
128 
129  Building a state machine from a regular expression is a brilliant
130  idea. (I think *The Practice of Programming* walks through the
131  construction of this at a fairly simple level.)
132 
133  In particular, we can search for any of a large number of
134  alternatives in a very efficient way, with much less effort than
135  it would take to search for each the hard way. Remember also the
136  string-searching algorithms and how much time they can take.
137 
138  I wonder if we can use similar principles here rather than the
139  current simple rolling-sum mechanism? Could it let us match
140  variable-length signatures?
141 
142 * Support gzip compression of the difference stream. Does this
143  belong here, or should it be in the client and librsync just have
144  an interface that lets it cleanly plug in?
145 
146  I think if we're going to just do plain gzip, rather than
147  rsync-gzip, then it might as well be external.
148 
149 * rsync-gzip: preload with the omitted text so as to get better
150  compression. Abo thinks this gets significantly better
151  compression. On the other hand we have to important and maintain
152  our own zlib fork, at least until we can persuade the upstream to
153  take the necessary patch. Can that be done?
154 
155  abo says
156 
157  It does get better compression, but at a price. I actually
158  think that getting the code to a point where a feature like
159  this can be easily added or removed is more important than the
160  feature itself. Having generic pre and post processing layers
161  for hit/miss data would be useful. I would not like to see it
162  added at all if it tangled and complicated the code.
163 
164  It also doesn't require a modified zlib... pysync uses the
165  standard zlib to do it by compressing the data, then throwing
166  it away. I don't know how much benefit the rsync modifications
167  to zlib actually are, but if I was implementing it I would
168  stick to a stock zlib until it proved significantly better to
169  go with the fork.
170 
171 * Licensing
172 
173  Will the GNU Lesser GPL work? Specifically, will it be a problem
174  in distributing this with Mozilla or Apache?
175 
176 * Checksums
177 
178  * Do we really need to require that signatures arrive after the
179  data they describe? Does it make sense in HTTP to resume an
180  interrupted transfer?
181 
182  I hope we can do this. If we can't, however, then we should
183  relax this constraint and allow signatures to arrive before the
184  data they describe. (Really? Do we care?)
185 
186  * Allow variable-length checksums in the signature; the signature
187  will have to describe the length of the sums and we must compare
188  them taking this into account.
189 
190 * Testing
191 
192  * Just more testing in general.
193 
194  * Test broken pipes and that IO errors are handled properly.
195 
196  * Test files >2GB, >4GB. Presumably these must be done in streams
197  so that the disk requirements to run the test suite are not too
198  ridiculous. I wonder if it will take too long to run these
199  tests? Probably, but perhaps we can afford to run just one
200  carefully-chosen test.
201 
202  * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>?
203 
204  * Generate random data; do random mutations.
205 
206  * Try different block lengths.
207 
208  * Tests should fail if they can't find their inputs, or have zero
209  inputs: at present they tend to succeed by default.
210 
211  * Test varying strong-sum inputs: default, short, long.
212 
213 * Security audit
214 
215  * If this code was to read differences or sums from random machines
216  on the network, then it's a security boundary. Make sure that
217  corrupt input data can't make the program crash or misbehave.
218 
219 * Long files
220 
221  * How do we handle the large signatures required to support large
222  files? In particular, how do we choose an appropriate block size
223  when the length is unknown? Perhaps we should allow a way for
224  the signature to scale up as it grows.
225 
226 * Perhaps make extracted signatures still be wrapped in commands.
227  What would this lead to?
228 
229  * We'd know how much signature data we expect to read, rather than
230  requiring it to be terminated by the caller.
Description of input and output buffers.
Definition: librsync.h:300
rs_result rs_job_iter(rs_job_t *job, rs_buffers_t *buffers)
Run a rs_job state machine until it blocks (RS_BLOCKED), returns an error, or completes (RS_DONE)...
Definition: job.c:96