Noise out of proportion, and random things out of context

I spoke to some of my colleagues back in London and was reliably informed (again) that the ratio of idealists to pragmatists amongst them is in fact lower than I made out in my last post. We did agree, though, that it’s the idealists making most of the noise.

New things from last night’s conversation, to be investigated later:

  • Domain-specific languages. This (briefly and inaccurately) is what happens if you notice that something is changing a lot in code, and move those changes to a configuration file. The file itself is then a form of language. (A better definition can be found here).
  • The database problem: part 1. If you’ve got several client versions in production, the server database has to be backwards compatible. This makes it hard to change, which can make the domain model hard to change too. What solutions are there for getting round this problem?
  • The database problem: part 2. A grid which relies on a central database can’t scale as well as we’d like. The database forms a bottleneck. What solutions are there for getting round this problem? (And what kind of things require a central database?)
  • Zen and the Art of Motorcycle Maintenance. I only got halfway through this last time I tried, about three years ago. Now I know a little more about philosophy and the twisted nature of the human mind. Time to try again?
This entry was posted in Uncategorized. Bookmark the permalink.

24 Responses to Noise out of proportion, and random things out of context

  1. anonymous says:

    Domain specific languages are not limited to separate “config” files. Embedded domain specific languages are DSLs defined in terms of a host language. This is common practice in LISP, Smalltalk, Forth, Haskell and other expressive languages. Refactoring in these languages tends to drive the domain model towards a domain specific language. It is even possible to write Embedded DSLs in “enterprise” languages like Java or C# with a bit of ingenuity, but (like everything) it’s much harder to do so in these languages.

    Zen and the Art of Motorcycle Maintenance is the best book about software development that one can buy.

    –Nat.

  2. sirenian says:

    I did say briefly and inaccurately. ๐Ÿ™‚ But that was how it was explained to me. I shall have to finish my Umberto Eco quickly.

  3. entropyjim says:

    Database part 2.

    How much does the grid rely on the database. How fluid is the data? Could a local cache reduce the number of times that the database must be queried? Could a cluster of databases (syncing with each other) reduce the load?
    Do the clients contact the database on demand or do they hold open a connection from when they are instatiated?

  4. sirenian says:

    We’ve been thinking about it in terms of a retail outlet, here. People buy things from one store and want to refund them in another, or buy things and then decide they don’t want them any more. The prices, stock levels and transactions are all stored in the database, and any purchase or refund will affect them all. Warehouses which store items which can be ordered must also be informed if any items which need shipping are purchased.

    So no; a local cache wouldn’t work in this instance, because the data has to be consistent across the grid. Syncing databases would just meen hitting them all instead of just hitting one – it would be easier to read, but not easier to write.

    As for whether the clients hold the connection or reconnect, I guess the answer is “whichever proves to be quickest and most reliable”.

    Mostly I’m looking at it from a theoretical point of view – how can we make data available for all nodes of a grid, but only stored on some of them, with synchronised data updates (ie: if five nodes have a copy and one updates the data, the otehr four nodes get the changes too), and still maintain rapid performance?

  5. entropyjim says:

    Well, you could have a number of DBs associated with certain branches (so that a given branch always looks at a given DB) and sync changes between the DBs over time – in the application outlined above the data does not need to be _exactly correct_ 100% of the time. You could force syncs before the warehouse does orders, force syncs at the end of the working day, etc, etc.

    Ideally you need to speak to someone at Tescos or Dell. Both companies have a reputation for good IT systems. Tesco especially uses their ‘just in time’ ordering to minimise stock in the shops but keep stock levels from running out.

    I’d also talk to a DB expert about clustering. I believe a cluster of DBs could solve the problem of availability but you’d have to speak to someone with more experience for example configurations – Bagnall perhaps?

  6. sirenian says:

    Yes, I’ll have a chat with him. I’ve got a couple of ideas but they’re a bit wild, so I’m looking at the interest groups here to see if anyone wants a mad chat.

    The trouble with not having 100% correct data is that pretty quickly someone will take advantage of it by, eg: getting a cash refund for an ordered item from three stores at the same time, or they’ll suffer the consequences of conflicts when two people both order the last air conditioner in the warehouse on that hot summer’s day. Having spent the last couple of months on support, dealing with the consequences of dodgy data on a single database, I’m not convinced that it’s a good idea to have several conflicting versions.

    My current thought is that the data could be held as in-memory objects, with a few nodes having copies and each piece of data also containing the id of each node that has it, so that updates could be propagated and removed from the job stack once it’s all done. Priorities could be used to ensure that updates are done in order, and data could be persisted during a node’s “idle time” if required. The only problem then is that all the nodes will be hit if a piece of data is sought that doesn’t exist, but since each node would have a partial set of data it could be a quicker search than it would be on a full set. Nodes could also develop a preference for a particular type of data, so that they can say “no” more quickly if the type of data isn’t in their list. And, just to make it more robust, you could have say five copies of data, and if it gets corrupted for any reason the nodes could ‘vote’ on the correct version.

    There’s more, but it’s just a wild idea at the moment. I’d love to have time to try it out, but I don’t. Too busy doing JBehave and poetry. Bother.

  7. entropyjim says:

    The trouble with not having 100% correct data is that pretty quickly someone will take advantage of it by, eg: getting a cash refund for an ordered item from three stores at the same time

    This is a special case. The system has to be designed with it in mind but it won’t be dealing with refunds anywhere near as much as it will be dealing with simple sales. Perhaps you have a ‘master DB’ that applies changes more regularly that deals with special cases?

    This is why you need to talk to someone who has worked on such a system to ask how they dealt with their special cases. In general you need to concentrate on the MANY routine transactions without ruling out dealing with special cases. If you can design a robust system that accounts for _everything_ first time you should be a millionaire ๐Ÿ˜‰

    or they’ll suffer the consequences of conflicts when two people both order the last air conditioner in the warehouse on that hot summer’s day. Having spent the last couple of months on support, dealing with the consequences of dodgy data on a single database, I’m not convinced that it’s a good idea to have several conflicting versions.

    Again, ordering things might be a special (or not so special) case. You need to know how the shop ordering system works. Do they ask the warehouse if there is one before they order? Do they look in a local depot? If the latter you could have a DB per depot. If the former perhaps you’re forced to have one DB and live with it?

    At the end of the day, unless there is a DAMN good reason for it you want to try to stick to one DB. You are saying that you can’t so you you’re forcing yourself to have some kind of compromise between data integrity and access. It may be that any given DB applies certain kinds of updates before allowing a purchase.

    Before reinventing the wheel I’d suggest you ask somone that’s done it – it sounds like a problem that must have been solved a number of times already.

  8. sirenian says:

    It’s more a case of inventing something because, once invented, possibilities open up; and because the kind of things we might find out we can do in the process of inventing it should be entertaining. It’s certainly worth having a discussion about.

    AFAIK, no one has invented it yet. But that’s why I’ve flagged it; because I want to have that discussion with some knowledgeable bods.

  9. anonymous says:

    If by ‘in production’ you mean live:
    We tend to wrap our (Oracle) database tables in package functions. Nothing outside of the database touches the tables. It occurs to me that if you followed the same principle then you can could change the underlying structure without changing the interface for the earlier versions. If you need differing interfaces for the different versions you could supply the version number (or a less granular ‘compatibility mode’) parameter into a configuration procedure when you first connect to the database and then dynamically show the procedures you need for that version.

    If by ‘in production’ you mean currently being produced (in development):
    Bring the database into the project fully, treat it as you would any other piece of application. Migrate data as you need data to change. I’ve had many thoughts on Oracle being used in an XP environment (mostly implementation issues), and you can find them here: Bobablog

    If I’m stating the obvious then feel free to shoot me!

  10. sirenian says:

    I mean live.

    I guess the problem we have is that you have to do the packaging _before_ you start putting the code in. We’ve inherited a legacy system without that kind of structure.

    It isn’t the functionality that’s a problem as much as the data migration. I need to go away and learn more about databases.

  11. anonymous says:

    You’re not being very creative in looking for solutions. There are many ways to get around the bottleneck without sacrificing integrity. Some common techniques include:

    – Partition your data across multiple databases by type. Sounds weird but it can work outrageously well. So you could have product info here, inventory info over there, orders in a 3rd place, etc. You might believe that this requires 2PC to work, but if you are wicked smart about your partioning you can live without it. Think about it – do you really need _all_ your data on a single database?

    – Partition your data across multiple databases by value. At the most trivial level, an example would be 26 databases labeled A-Z. All people with a last name starting with A goes to the A database, B’s go to the B database, etc. Again this sounds weird but in some domains this is a very common solution that can work surprisingly well. If you can’t wrap your mind around this concept, try thinking geographically for another example. It’s common for international companies to have seperate databases for, say, London, Hong Kong, NY, LA, etc. This is just another form of partitioning data by value, but in this case the paritioning is done by location/office.

    – Distributed, paritioned caching technologies, related clustering technologies. In a nutshell, use many machines in a cache configuration with 2 or more copies of any given piece of data on seperate machines, and stick this giant cache in front of the database. The app only ever uses the cache. Since data is kept on multiple machines, you lazy write out ot the database when the cluster has free cycles. This may also sound crazy until you realize that 10 relatively cheap machines could cache 40-50GB of data with read and write speeds much faster than any RDBMS could handle. To scale for size or speed just add machines (assuming that your clustering technologies scales close to linearly).

    – Use database replicas intelligently. Determine who needs 100% correct “realtime” access and who doesn’t need that level of guarantees. Push the ones who don’t need the guarantees off to replicated databases.

    – Since a grid was mentioned – distinguish between data which really needs to be persisently saved and data that is transient. Often you can generate a surprising amount of such transient data and just hold it in memory (and maybe throw it into a distributed cache) – so long as you’re willing to take a potential hit on grid startup.

    – Use software/hardware combinations that use 2PC very efficiently and intelligently and replicate data in a transactionally correct manner (but much faster than you might think 2PC could handle). This may sound crazy until you see IBM mainframe setups that do this, and do this blazingly fast.

    – Buy a really, really honking gigantic machine with a bazillion CPUs, scads of disks, and GB ethernet connections to your grid and stick your database on that. ๐Ÿ™‚

    The bottom line: you can get 100% correct data without having a single-database bottleneck.

  12. anonymous says:

    This is easy to achieve conceptually (but a PITA to do in practice). Just follow these rules:

    – if you use views or stored procedure, stick a version number on the name. So you don’t hit view ‘trade_view’, you’d hit ‘trade_view_v11’. This allows multiple SPs and views per version to co-exist in your database.

    – Never delete a table. Never delete or change the type of a column. The motto is “add only”.

    – Ensure that your code works with the database with the motto “If I don’t understand it, I ignore it”. So if an old version sees data it doesn’t understand from a newer version, it ignores it.

    As I said, this can be a PITA but many organizations do this on a regular basis, and it works. It probably breaks every agile rule you’ve ever heard of, but agility doesn’t fit in very well with the concept of backwards compatibility anyway.

  13. entropyjim says:

    – Partition your data across multiple databases by value. At the most trivial level, an example would be 26 databases labeled A-Z. All people with a last name starting with A goes to the A database, B’s go to the B database, etc. Again this sounds weird but in some domains this is a very common solution that can work surprisingly well.

    This is equivalent to the example I gave of having multiple databases for individual, local warehouses. I see no problems with that at all ๐Ÿ˜‰

    – Use database replicas intelligently. Determine who needs 100% correct “realtime” access and who doesn’t need that level of guarantees. Push the ones who don’t need the guarantees off to replicated databases.

    This is essentially what I meant by “not 100% correct data”. I should have explained myself better.

    At the end of the day I believe you need to know A LOT about the business in question before you can choose an appropriate solution. All of the ones above seem fine but until you know more info you don’t know the pro’s and con’s.

    The real problem I have is the idea of coming up with a ‘one size fits all’ solution. I don’t believe a solution that works for John Q Small Business would be appropriate for Tescos or Dell.

  14. anonymous says:

    If you want to read about handling really big piles of data, read this paper by some guys who know a bit about handling it: http://labs.google.com/papers/gfs.html

    -Darren

  15. sirenian says:

    Star. ๐Ÿ™‚

  16. anonymous says:

    You’re right in the cases of replicas or partitioning. If you take that approach you have to be intimately familiar with the data, its structure, and its usage patterns.

    There is however a generalized solution which does work more or less as ‘one size fits all’, and that’s the distributed cache mechanism I mentioned. In essence a clustered database will be using this approach internally. Or it can be visible and used directly by the application layer (e.g. like the product Coherence). The fundamental idea is simple: use hashing and bucketing generically on your data to carve it up so that the data is distributed out across many machines. Then invest in memory and fat pipes between your machines so that cluster communication is efficient and you’re holding a meaningful amount of data.

    Assuming that you’re truly hashing and bucketing out across your cluster (as opposed to just replicating) and use asynchronous protocols, then this solution scales tremendously, has great performance, and can be used generically. Robustness can be built-in by allowing a bucket to be copied onto 1-N “buddy” processes (which can be designated automatically by the cluster) or by treating your data in memory and across processes in the same way that a RAID array works.

  17. sirenian says:

    I don’t think anyone’s being uncreative in looking for solutions – I’m not actually looking yet, and Jim’s just being helpful with the stuff he knows I don’t know. Just flagging the idea for dealing with it later, and garnering a whole load of material to go look at from those kind bods who care to give it (thank you. It is appreciated).

    The first question I intend to ask the guys who started discussing the whys and wherefores of this is: why do you want it? What will it give you? But at the moment it’s just a theoretical problem that’s interesting to talk about.

    For any data to be partitioned, cached or replicated, there has to be a central control somewhere. Someone has to know who’s got all the A’s, all the transactions, all the rubber ducks, or collect them up at the end of the day. I’m sure that this does work well for real world applications, but in the back of my imagination I’m still thinking of far-fetched AI… With caching you’ve still got a central database. With partitioning you can’t just add another fifty identical nodes to the grid. It’s the manifold symmetry of the problem that appeals to me; I’d love a solution which works as well for fifty thousand nodes as it does for two.

    So I haven’t got a solution yet, but I have got a discussion started, and I thought of an ending for the short story I’ve been mulling over all week. Thank you!

  18. sirenian says:

    Wouldn’t it be nice to replace a legacy table with a view that points to the (new) right data, migrated where appropriate, and remove the old columns? That way, as the legacy code is updated, the legacy table views are updated too, and the tables have a beautifully clean structure.

    I’ve not used views or packaging much, but there are some people frothing at the mouth on a couple of threads over here. I get the feeling they’re not as popular as they might be.

    Agile doesn’t fit well with backwards compatibility, but it does fit with constantly changing requirements – which, in any multiversion environment, means backwards compatibility. Aargh!

  19. anonymous says:

    Hmmm….you say “For any data to be partitioned, cached or replicated, there has to be a central control somewhere.”

    This is not necessarily true. See how peer-to-peer dynamic systems work for an example. There are many scenarios where you need a single control – but note that a “single control” does not necessarily have to be a central one. The “controller” can be nominated in an ad-hoc manner dynamically at runtime.

    Think multiple redundant servers and voting algorithms.

    Think about a caching grid where all nodes participate in caching, and data is replicated N times in the grid (typically with N being 2 or 3). Imagine the grid is geographically disperse. In this scenario you could always use lazy asynchronous writes out to a database during “off hours” – but the application would never, ever read from the database, only from the cache. In fact the cache size would equal your dataset size. Combine it with a voting algorithm and you don’t need a database at all – your cache could lazy-write out to local file systems (or local databases ๐Ÿ™‚ ).

    Stop thinking central, synchronous. Start thinking disperse, peer-2-peer, asynchronous with redundancy thrown in.

    Find out how RAID disk arrays work. You look like you have one big, hugely reliable and pretty fast disk when in fact you are using many smaller disks logically connected together. Apply the RAID concept to application clusters and you’ll see sort of what I’m talking about.

  20. sirenian says:

    That’s the kind of thing I mean. Is there any way of getting rid of the single controller, though? So that every node in the system collaborates to work out who has copies of what data?

    I’m thinking of nodes with a kind of intelligence, so that they can work out which data is being accessed regularly and say spread it across 7 different nodes instead of just 2 or 3; nodes which are intelligent enough to inform all the other nodes holding a piece of data if that data updates.

    The real issue as far as I can tell is when a piece of information is sought that isn’t actually there. eg: “Find all flights to Minas Tirath Airport”. Last I looked Minas Tirath only served dragons, eagles and Nazgul, not aeroplanes, so the query needs to hit all the nodes before it can return a negative result. That’s almost as bad as having a central server.

    It would be nice to make that job easy; for instance, all the nodes could know which nodes were responsible for particular kinds of information, and that information could be updated if it changed, which it would do less often.

    You’d need intelligent rules to deal with node responsibility; ie: if there are 52 nodes, 2 each do A to Z, if there are 104 do Aa to Am, An to Az etc; shuffle up if you’re feeling flustered. And then, if Aragorn decides to build an airport at Minas Tirath, all the nodes know who’s responsible for what, so the information gets passed to the nodes that need to know.

    So if all the nodes know who’s responsible for what, and that doesn’t change very often, then a search for flights to Minas Tirath only needs to get 2 negative responses and the other nodes can ignore the job.

    That would work, wouldn’t it?

  21. anonymous says:

    Here’s a curve ball for you… can you convince the customer that they don’t need to have a multiversion environment. That would seem like the simplest option in most cases despite what a customer may initially think.

    But if you can’t…

    I agree with your last post sirenian.

    We also have a legacy system that can’t change. We’re slowly replacing its functionality, but until we have complete coverage the old system won’t be removed. So we wrap all access to the legacy system in views, the wrap those views (and the new application tables) in packages.

    The views exist so that we can extend the ‘tables’ without having to extend the legacy system. For example: The legacy table named CANDIDATE_STATUS may be extended in our system by adding a new table CANDIDATE_STATUS_PROPERTIES and then adding a view on top of the join. That way the legacy system remains unchanged but we get the additional columns we need. Removing a column is simply a case of removing it from the view rather than the underlying table.

    Once the legacy system is completely obsolete a data migration will take place to move all the data from the legacy tables into new tables matching the views. The SQL to do that transform already exists in the form of the views we are using to wrap those legacy tables.

    I can see how a combination of such views and a ‘compatibility parameter’ to switch between given views might work in a multi version environment, although I imagine the schema will get into one hell of a mess as soon as you have more than a couple of versions live at any time.

    Using views to wrap tables should be near trivial to implement for any database reads. Database writes are another matter (which is where the packages come in).

    As you might be able to tell… we love views and packages here.

  22. sirenian says:

    Our current customer has about 10,000 client machines around the UK; there’s no way the rollout of a new version happens all at once. It would be nice to only have one version, wouldn’t it?

    I don’t think we have any more than 3 versions live at any given time. That sounds like the kind of thing we need, though; a way of getting rid of legacy structure as it becomes obsolete instead of adding on to it.

  23. anonymous says:

    A number of problems need a controller. However, the controller can be designated dynamically (most often as the oldest process within the grid). And you would code the controller to be highly minimal, so that it’s doing the bare minimum controlling work and the “real” work would stay distributed across the grid.

    Using your examples about node intelligence, you may be over-optimizing and being too clever. You can end up with a situation where your nodes are so smart that they spend all their time figuring things out in their clever way and little of their time doing actual work. This was found to be a problem in the Linux kernel scheduler and is also a key argument in how RAID arrays work, and in the RISC vs. CISC debate and in fixed-size frames vs. variable sized packets. The argument in common across all of them are: small atomic units are good, regularity is good, simple is good. They are good because they can be parallelized. And true scalability _always_ comes from parallelization.

    Using your example – to be truly generic and scalable you wouldn’t tell nodes to do “A-Z” or “Aa to Am”. Instead, you’d hash the name and then take that hash modulo your grid size. If you want to then do a query, you asynchronously throw out a query to all N servers and reap back the results. If you’re asynchronous then sending to 52 servers and getting 52 responses will take almost exactly the same amount of time as sending 2 requests (or 10, or 20, or 500). Even if most nodes don’t have your data you don’t care.

    Of course this can get expensive if you have many, many nodes (thousands). However – you can be smart about how you configure your grid/cluster. For example, you could have sub-grids – so that a query would go out to the sub-grid and then the sub-grid will subdivide it for you. This is a sort of situation where you would need a ‘controller’ notion – 1 per sub-grid. But, again, the controller can be nominated dynamically, and you could likely partition sub-grids dynamically as well.

  24. anonymous says:

    In Oracle at least, you can declare ‘instead of’ triggers that fire on insert/update/delete against a view, allowing you to code in pl/sql whatever is required to update the underlying tables.

    This can be very handy for allowing the pruning of redundant database constructs in a phased way.

    David.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s