Martin Fowler + Toby Clemson | Kafka Summit 2018 Keynote (Experimentation Using Event-based Systems)

Martin Fowler + Toby Clemson | Kafka Summit 2018 Keynote (Experimentation Using Event-based Systems)



I'm Martin Fowler I write things on the Internet I spend a lot of my time thinking about how to communicate interesting software ideas but as a result of that I spend all my time doing that I don't actually do anything that's kind of real I have to rely on people like Toby here to tell me what's going on and every so often I kind of try and pluck things out of him but here you'll get to hear from him directly instead of through me as a mediator which is definitely an improvement so we're going to talk today about using events to get an experimentation in a system which immediately raises the question why are we talking about experimentation and I think a way to think about how we get here is if we look at the way software teams have developed over time there was a sense of sort of in Aldie days and actually not that all t people would just come up with big influence in statements about this is all the things we want you to go and do go away for a few years and come back and tell us when you've done it and we've shifted away from that and a large driver what shifted away from that has been the notion of agile software development and the notion that instead we should have a different way of relating with things but agile software development works in a number of different ways and one approach of looking at this that I found very handy was a model called the agile fluency model it was developed by a couple of people not connected with what works but they I felt had a good sense of talking about the way they saw different teams work in different ways so to very very quickly go through it the basic idea is that there's a there's a first level which is referred to as focusing and focusing is all about really saying how do I change the management style of a team so but instead of looking at how much do I get done as work items ticked off and begin to think about how do we produce business value and that's a relatively straightforward step it's more the management style of agile approaches the notion of iterations and that kind of thing to get really better at this no you really need to raise the productivity and get the things going faster and faster and this is this second stage which is called the delivering stage and the point of delivering stage is that you then take on the technical practices you do things like testing and refactoring and continuous integration and stuff like that that allows you to release more rapidly drastically reduces the amount of defects in the system and basically allows you to ship when the market needs it as opposed to ship when you're ready and that is a very crucial step because now I can ship I can release every day and things of that kind but it's all very well to release every day but you're still responding a little bit passively as to what to build the next step takes that a little bit further and says now the team itself is going to decide what to build and experiment this is where the experimentation comes in of how the build things the full stage is more a speculation about the agile future so I tend to skip over that and focus instead on that third stage and a term that often goes with it is hypothesis driven development and the idea here is that when looking to build something rather than try and come up with you know very many thoughts about what might build and fixing at it instead you say well let's try something let's get it out there let people use it see how people react and then decide ah well I have a further go down that path or maybe even kill that feature entirely and go somewhere else that requires the ability to be able to try things out and see how people respond to things so that you can conduct experiments carefully and that's why experimentation is important and now over to Toby so in order to explore this idea that event-driven systems can help with experimentation we're gonna run through a bit of an experience report and that experience report is in the context of the company I'm working with at the moment and called be social and be social is a digital challenge bank looking to kind of disrupt the banking industry and build apps that customers love and that provide kind of value to customers in the banking space that they haven't really experienced before but in order to disrupt and try different things and build something the customers love we decided to adopt this this approach of of kind of experimenting trying different things learning from from real-life situations and real-life customers and helped having that help drive our our kind of product development process but obviously in order to experiment we need systems that help us experiment lots of our experiments were quick and easy prototypes or throw away things but at the same time they're trying to build a bank and we needed to kind of define some kind of system that both met the needs of us being a bank but also was easy to change easy to experiment with evolutionary in its nature so where did we start we could have started with a month this is typically sage advice from Martin start with the monolith because you don't really know your system yet you don't know your domain keep it all in one box keep it simple and easy to change however in our case we chose not to do that we had a pretty mature team they built lots of micro service implementations before and they had lots of experience of operating deploying scaling those kinds of systems and so instead we decided not to go ahead with mother and well my regular advice is to build a monolith first you know I do take that I think till we did the right thing with his team but I do want to caution teams out there do not do this unless you really understand how to build distributed service based systems and you've got some real experience in doing this because although this team has worked out well with this approach most of the times I've seen people attempt to do this it's not been pretty so be careful about going down that path and there were kind of two two parts to that that risk I think there was the when you when you move to something like Microsoft architecture you push a lot of the complexity into the deployment model into the kind of gaps between the services and the other part is not real being able to easily model your domain when it's already distributed and you've got kind of network network partitions between all of the pieces we actually mitigated the first by investing really heavily in infrastructure as code and automation and so we decided to kind of build open source implementations of all of our all about kind of deployment infrastructure and and kind of contain a runtime and everything so we byte by kind of automating all that away we made it very cheap and easy to take this approach so we could spin up services very fast throw them away very fast and kind of in the order of minutes or hours rather than weeks or months and that also helped in this in terms of being able to experiment because we could you know it didn't really matter how long lived something was because it was so cheap to kind of build him and get rid of and I should point out one of the nice things about working we thought works is that when we come up with stuff like this we tend to open-source it rather than to keep it quiet and so I'm really pleased to see that the team has done that ok so then where did we start what was our first evolution of the architecture we decided to go with Mike services we built we were building kind of mobile apps digital everything kind of through through a mobile device digital first and built a back-end for front-end for that to kind of cater to that device and then lots of small collaborating services can I dedicate it to one particular task and each of those services had their own dedicated database in our case we use Postgres it was just kind of simple and easy for us to use and our based on the cloud provider we were using and that that lets us kind of have persistence on a service by service basis in a cheap way and I really want to stress this is one of these important things about micro services the term micro services makes people often think oh they're just small services but there's some other characteristics that really make that the style well very important part of that is each service having its own datastore I've seen people build micro services where they're small services on top of a shared database for a start that's not micro services because it contravenes the definition and secondly just don't do that because it's really very very painful so important part of thinking if you're going to go down a micro service route there is a reason why you have your own database in each service at each of these services communicated synchronously we didn't really have huge load at this point in our evolution and so and it was it was a lot simpler for us to just have these services communicating synchronously so we kind of kept things simple in terms of the the API is that these services used initially it was a bit bit of a free-for-all we just kind of did whatever made sense but we quickly hit the limitations of this and found that it made it very difficult to evolve our API is over time if we weren't quite intentional about the way in which we did it so we moved very fast towards making everything an object nice and easy to expand and evolve as you go but then even from there we went further and decided to go with full hypermedia api's the reason being that we wanted to be able to kind of throw out new services see if they helped see if they were potentially replace them throw them away put other things in their place or move our entities around in our system landscape and so making using hypermedia meant that we were kind of location independent with where things lived in the overall system and we could we could make those kinds of moves so we found that hypermedia was something really valuable to us in terms of giving us flexibility in our architecture this is an interesting point I mean I'll come across a lot of people that are using restful systems and when they do that really what their meaning is they're saying well we think in terms of resources and we use the HTTP verbs and error codes and things of that kind which is all well and good but there's a common statement out there that says well we you don't really need to do this hypermedia think the clients can know all the URLs they don't really need to change very often but there are times where it really does make a difference I mean here in a case it allows the team to change their service boundaries around I've known another case where it was really helped a huge amount with scaling because I ran into big scaling issues and we were able to shift everything up to the edge serve edge servers and they're able to do that really easily because they have the these hypermedia links so don't neglect the value of having these full URLs the ideal test with this is that clients should only know a single URL which is the entry point to your whole system every other URL should be gained by interrogating initial URL in order to lead you to all the other paths that gives you then the flexibility complete flexibility about where you place the various services then in terms of persistence we kept things very simple we were using the kind of JSON capabilities of postgrads and we just dumped everything about our entities into the database in a pretty basic JSON structure and actually we only kept the snapshots we kept the latest version of everything so we were throwing a lot of information away but at this point we hadn't really thought any better and so we just kind of had really kind of basic high-level information about the entities in the system so what did this do for us this gave us the ability to build things very fast throw them out there see whether they works whether they contributed to their overall system and the overall product and then if they didn't throw them away and it didn't really matter that much didn't really take a lot of unpicking to do so because they were already small discreet and independent it meant that we could use the right tool for the job so anytime a service had a particular need and needed a certain type of persistence or maybe a certain language or library or whatever we could easily kind of use that language use that library without you know complexity of it all being in one box inside one process or whatever else and if we managed to come up with experiments that were successful and actually keep those bits of our systems because we'd already teased out kind of scaling boundaries and levers for scaling we could easily scale up or improve those services independently as we needed to however the whole goal was to be able to experiment and make measurements and understand our customers behavior because we weren't really storing deep or fine-grained information about our customers and what they were doing we had no way to measure anything and we didn't really have any way to capture customer intent so we decided to look at alternative or extensions to our architecture evolutions of our architecture that meant that we would keep the things on the left and gain the things on the right so we looked into event sourcing so Jay mentioned event sourcing very briefly and in his talk I'll explain a little bit more about how I see that term defined and it begins really with very much the way that he talked about this you are – between events and tables so if we imagine the system without any events I've got a user I've got some customer information and that customer says hey I'd like to change my address what happens you just make the change in the database blow away the old record put a new record in in the event sourcing world we make us an important but subtle but important change to this if when somebody says change my address the first thing we do is we create a first-class event record and that event record is stored then we process that vent record in order to make the change to our current state essentially what we've now got is those two things that Jay talked about we have the tables the application state and we have this history log of all the changes that ever change to it okay that's kind of event standard amending kind of stuff the key thing that makes something event sourced is that I should be able to confidently at any time just blow away my application state I can then comfortably record restore it from the lock and rebuild the state again so quick question how many people have actually built a system using this kind of event source approach rate and blow where your current state few people how many people have used a system doing this raise your hands a few people okay how many people write software for a living in which case I'm very worried that you said you haven't used a system that's used as approach because every software developer uses a system that uses this approach what's it called version control get they're not actually synonymous but but these days it feels like it this is what version control is right at any time I blow away my working tree I create one anything I can all think of all the advantages all the wonderful things that you can do with a version control system that's what event sourcing allows you to do with your business data in fact is an even older event source system that's been around since all medieval times which is accounting systems they record every transaction you add them up in order to get a balance that's effectively event sourcing and those are the two examples I like to use it and then source and when you try to think about how to use it or the advantages or all the rest of it fall back to accounting systems and to version control and of course the advantage of accounting systems is that makes sense to non geeks but version control is probably the more familiar one for us as a day-to-day basis so when you're thinking of events or things how do we bring the advantages of what we can do with get to a business data so what did you do with it well the second evaluation of our system was exactly this we decided to switch to an event sourcing approach so actually the the kind of architecture at this level didn't really change a great deal by in terms of our persistence we kind of radically changed things under the covers and decided that I mean still relatively simple we still made use of all of the kind of JSON storage features of our underlying database but kept a bit more information this time in terms of the events that made up their different interactions in our system and so we basically stored the complete content of any request as made by a consumer of service against some type and whenever we wanted to reproduce the entities that we had before we would project based on all of the entities that pertain to that sorry and all of the events that pertain to that entity or entities so this gave us services that each encapsulated and owned a set of events that pertain to their aggregate root and that kind of helped a lot because we could then see what was going on we could measure things we could count things we could inspect those payloads and understand what our customers had had past us what we were being given but they were still relatively separate they were they were kind of encapsulated within their services so we got visibility into the business events as they were taking place and measure ability of our customers behaviour but we were still lacking our kind of cross service event visibility and we needed this because for some reason for some things it was to do with the Express measurements themselves we wanted to be able to measure things that had kind of maybe complex into relations across the suite of services and we also have functional requirements that needed this in terms of being able to notify say push notifications to the user based on things that are going on or notifications between services so that say a customer deactivating their account can propagate through the system we also didn't have any way to kind of aggregate up so there were certain things we wanted to do that were truly aggregates across many services of different things that had been going on and it was very difficult to do this so the only way we could really run these experiments make these measurements across service was to dig into each database pull a bunch of data out and kind of manually play around with it which worked for a while but we found that we needed to evolve further okay so the third evolution the goal was to get cross service event visibility and keep everything else if we could so initially we looked at the idea of having event feeds where those feeds were just adjacent endpoint exposed by each service that gave a list of every event that had happened in some nice easy to consume JSON form and then consumers would look at these event feeds and pull them pulling information as and when they needed to on whatever frequency they needed to and reacting to those events and this worked okay it gave us some benefits it was nice and human readable we could kind of detect or look into problems quite easily debug what was going on or maybe write little scripts to understand events that were happening in the in the live services and we could choose there are windows for those events so how often we've how frequently we polled but we could also scan back through all of history and so if we needed to spin up a new service that had a new responsibility or was it running a different experiment it was very easy for us to rebuild its state of the world even for things that happened in it before it even existed and so that was kind of a really nice feature of this approach however on the in terms of downsides all of the the kind of coordination logic in terms of having competing consumers trying to churn through this event feed lived in the consumer and every time we span up a new consumer we have to handle the same old problems of kind of how do we make sure that we only process our events once or how do we make sure that we can kind of spread the load across many instances of those services and that was becoming kind of painful so in the end we scrapped this and decided to look into something a bit different and so we looked at the idea of an event bus and the goal here was to push a lot of the complexity around managing the competing consumers problem the coordination of those events and the long-term storage of those events and replayability of those events into some other system that just kind of handled that complexity itself so that consumers could much more easily consume that information without having to do a whole bunch of work on their own so the question is of course where would we find something like an event bus anyone got any ideas wouldn't it be great behind us yeah oh yes so why did you pick Kafka no yes actually we we looked at Kafka we were looking at a number of different things and we looked at a message queue infrastructure but the trouble there was that we didn't have kind of nice a nice way to spin up a new service and resume from all of history even if that service was brand-new and we looked at some of the kind of cloud offerings we were using Amazon so we're looking at Kinesis but with Cleese's we didn't really have a great deal of control over retention times or well control over anything to be honest and so instead we decided to roll our like roller own deploy our own and go with Kafka as our underlying event bus infrastructure so this was good we continued but it seemed like we had a number of different options about how we could connect our services together through Kafka and one approach would have been to look directly at our data stores hook up with something like Catholic connect so that it could kind of pull that information out and feed it into into Kafka so that our consumers can consume our stuff we didn't feel very comfortable with this because we wanted the flexibility to have different data stores maybe evolve and change data stores over time and also to kind of map events over time so that we could unfold the underlying representations whilst keeping a consistent view of those events across across the wire and so we thought maybe we could instead push events from our services directly into Kafka and then consumers would read those and everyone would be happy but again we didn't quite feel comfortable with this because we wanted to retain our our separate persistent stores is the kind of source of truth for all of those events we didn't want Kafka to become our persistence layer because we kind of wanted to be able to throw things away and rebuild them and have kind of independent you know single point of failure that kind of thing and the risk here is that you write an event to the database and then for some reason failed to write it into Kafka and so you end up having to build all sorts of retry logic to make sure that events are synchronized between those two locations so we weren't too fond of this approach either in the end the approach we went with was to keep our event feeds in a slightly varied variant form but to keep those event feeds in kind of Jason endpoints and have Caffrey Connect pull that at whatever frequency we needed to so we kept our kind of human readability of those things and we had resilience here so if the transaction in this case the transaction service was down for a while and then came back calf could connect would catch up and vice versa and so we took this approach instead where we kind of had maybe a couple more moving parts but it gave us a whole bunch of additional benefits so now our kind of service landscape evolved it still had all of its kind of synchronous communications and we're a synchronous communication work for us we kept it but we now had the ability to have cross service event visibility and things reacting to events in the system at a distance effectively so we got cost service event visibility we got the ability to aggregate up events when we needed to the only thing we were lacking at this point and which was still lacking to this day in fact is ad hoc data analysis so we would like to be able to kind of look for something we want to measure that's going on in in the production light system that we had never thought of before and do so with tooling that actually helps us to do that so in terms of future relations we now have event feeds provided by all of our services full visibility into everything that's going on in the system we have a nice easy way of moving those events around but what we've found is that this gives us a huge amount of flexibility in terms of where to go next so we could introduce a data lake and all of the kind of big data tooling that goes with that if we want this kind of ad-hoc analysis or much bigger analysis of data or there's a number of different streaming technologies out there that we can use so patching spark will give us this kind of streaming approach or Kefka streams itself wrapped up by another little service say so we we haven't yet got to this stage but what we're finding is that having invested in Kafka having kind of put this infrastructure in place now we have a whole bunch of flexibility opening up that didn't exist before but there is actually a bit of a warning here it is nice to say well we've got all these events flying around we can find them all all sorts of clients can reach in and do analytics on there's all sorts of interesting stuff that can happen a kind of a much more ad hoc I'm losing the word for it of way of pulling things together but there is a danger we actually discussed this at the footworks radar meeting a few weeks ago one of the problems you get is that people will actually use and start relying on internal messages and the problem with people using and relying on them is that it can stop people from changing them because if you change an internal event you end up breaking the system you never even heard of and then you continue with that cycle a couple of times and people are scared to change anything so there's a real trade-off here you have to be very careful as to if you're gonna reach in and start playing around with internal stuff it really needs to be out of low quality of service guaranties you've got to be aware that things can break but there's a dangerous balancing act not dangerous balancing act a tricky balancing act to play with that so be wary of that when you're trying to expose things that really ought to be internal it can lead a coupling and dependency that can be actually quite difficult okay so what are the key takeaways from all of this we found that events provide us with measurability and they capture a customer intent and this is really useful when it comes to being able to experiment with your with your customer base but with your system as well but in order to make this work events should be encapsulated and treated as first class within your within your system and that if you go so far as implementing this kind of approach with full cross service event visibility and then something like Kafka handles a huge amount of the complexity for you and provides lots of future flexibility as well so overall we're pretty happy with it okay thank you and if you have any more questions do feel free to come and talk to us we're going to be around for both days of the conference but if you actually want an answer to a question you probably better to go with tow because he actually knows what's going on thank you you

Author:

5 thoughts on “Martin Fowler + Toby Clemson | Kafka Summit 2018 Keynote (Experimentation Using Event-based Systems)”

  • Evdokim Covaci says:

    Great talk. I have one question though. 14:53, do I understand correctly that using git as an example of Event Sourcing is not technically right? Indeed git does not store deltas but rather snapshots of files for every particular moment. This is a nice post about git internals: https://www.linkedin.com/pulse/git-internals-how-works-kaushik-rangadurai/

  • Can someone help with this please: Is there a video that explains why each microservice should have its own data store? (what Mr Fowler talks about at the 9:00 minute mark)

  • ConstantChange says:

    so glad to hear about the use of hypermedia! hypermedia is analogous to location transparency in reactive systems, without it, your doing poor mans HTTP RPC.

Leave a Reply

Your email address will not be published. Required fields are marked *