Content here is by Michael Still mikal@stillhq.com. All opinions are my own.
See recent comments. RSS feed of all comments.


Sun, 03 Nov 2013



Comparing alembic with sqlalchemy migrate

    In the last few days there has been a discussion on the openstack-dev mailing list about converting nova to alembic. Nova currently uses sqlalchemy migrate for its schema migrations. I would consider myself a sceptic of this change, but I want to be a well educated sceptic so I thought I should take a look at an existing alembic user, in this case neutron. There is also at least one session on database changes at the Icehouse summit this coming week, and I wanted to feel prepared for those conversations.

    I should start off by saying that I'm not particularly opposed to alembic. We definitely have problems with migrate, but I am not sure that these problems are addressed by alembic in the way that we'd hope. I think we need to dig deeper into the issues we face with migrate to understand if alembic is a good choice.

    sqlalchemy migrate

    There are two problems with migrate that I see us suffering from at the moment. The first is that migrate is no longer maintained by upstream. I can see why this is bad, although there are other nova dependencies that the OpenStack team maintains internally. For example, the various oslo libraries and the oslo incubator. I understand that reducing the amount of code we maintain is good, but migrate is stable and relatively static. Any changes made will be fixes for security issues or feature changes that the OpenStack project wants. This relative stability means that we're unlikely to see gate breakages because of unexpected upstream changes. It also means that when we want to change how migrate works for our convenience, we don't need to spend time selling upstream on that change.

    The other problem I see is that its really fiddly to land database migrations in nova at the moment. Migrations are a linear stream though time implemented in the form of a sequential number. So, if the current schema version is 227, then my new migration would be implemented by adding the following files to the git repository:

      184_implement_funky_feature.py
      184_sqlite_downgrade.sql
      184_sqlite_upgrade.sql
      


    In this example, the migration is called "implement_funky_feature", and needs custom sqlite upgrades and downgrades. Those sqlite specific files are optional.

    Now the big problem here is that if there is more than one patch competing for the next migration number (which is quite common), then only one patch can win. The others will need to manually rebase their change by renaming these files and then have to re-attempt the code review process. This is very annoying, especially because migration numbers are baked into our various migration tests.

    "Each" migration also has migration tests, which reside in nova/tests/db/test_migrations.py. I say each in quotes because we haven't been fantastic about actually adding tests for all our migrations, so that is imperfect at best. When you miss out on a migration number, you also need to update your migration tests to have the new version number in them.

    If we ignore alembic for a moment, I think we can address this issue within migrate relatively easily. The biggest problem at the moment is that migration numbers are derived from the file naming scheme. If instead they came from a configuration file, then when you needed to change the migration number for your patch it would be a one line change in a configuration file, instead of a selection of file renames and some changes to tests. Consider a configuration file which looks like this:

      mikal@e7240:~/src/openstack/nova/nova/db/sqlalchemy/migrate_repo/versions$ cat versions.json | head
      {
          "133": [
              "folsom.py"
          ], 
          "134": [
              "add_counters_to_bw_usage_cache.py"
          ], 
          "135": [
              "add_node_to_instances.py"
          ], 
      ...
      


    Here, the only place the version number appears is in this versions.json configuration file. For each version, you just list the files present for the migration. In each of the cases here its just the python migration, but it could just as easily include sqlite specific migrations in the array of filenames.

    Then we just need a very simple change to migrate to prefer the config file if it is present:

      diff --git a/migrate/versioning/version.py b/migrate/versioning/version.py index d5a5be9..cee1e66 100644 --- a/migrate/versioning/version.py +++ b/migrate/versioning/version.py @@ -61,22 +61,31 @@ class Collection(pathed.Pathed): """ super(Collection, self).__init__(path) - # Create temporary list of files, allowing skipped version numbers. - files = os.listdir(path) - if '1' in files: - # deprecation - raise Exception('It looks like you have a repository in the old ' - 'format (with directories for each version). ' - 'Please convert repository before proceeding.') - - tempVersions = dict() - for filename in files: - match = self.FILENAME_WITH_VERSION.match(filename) - if match: - num = int(match.group(1)) - tempVersions.setdefault(num, []).append(filename) - else: - pass # Must be a helper file or something, let's ignore it. + # NOTE(mikal): If there is a versions.json file, use that instead of + # filesystem numbering + json_path = os.path.join(path, 'versions.json') + if os.path.exists(json_path): + with open(json_path) as f: + tempVersions = json.loads(f.read()) + + else: + # Create temporary list of files, allowing skipped version numbers. + files = os.listdir(path) + if '1' in files: + # deprecation + raise Exception('It looks like you have a repository in the ' + 'old format (with directories for each ' + 'version). Please convert repository before ' + 'proceeding.') + + tempVersions = dict() + for filename in files: + match = self.FILENAME_WITH_VERSION.match(filename) + if match: + num = int(match.group(1)) + tempVersions.setdefault(num, []).append(filename) + else: + pass # Must be a helper file or something, let's ignore it. # Create the versions member where the keys # are VerNum's and the values are Version's.


    There are some tweaks required to test_migrations.py as well, but they are equally trivial. As an aside, I wonder what people think about moving the migration tests out of the test tree and into the versions directory so that they are beside the migrations. This would make it clearer which migrations lack tests, and would reduce the length of test_migrations.py, which is starting to get out of hand at 3,478 lines.

    There's one last thing I want to say about migrate migrations before I move onto discussing alembic. One of the features of migrate is that schema migrations are linear, which I think is a feature not a limitation. In the Havana (and presumably Icehouse) releases there has been significant effort from Mirantis and Rackspace Australia to fix bugs in database migrations in nova. To be frank, we do a poor job of having reliable migrations, even in the relatively simple world of linear migrations. I strongly feel we'd do an even worse job if we had non-linear migrations, and I think we need to require that all migrations be sequential as a matter of policy. Perhaps one day when we're better at writing migrations we can vary that, but I don't think we're ready for it yet.

    Alembic

    An example of an existing user of alembic in openstack is neutron, so I took a look at their code to work out what migrations in nova using alembic might look like. First off, here's the work flow for adding a new migration:

    First off, have a read of neutron/db/migration/README. The process involves more tools than nova developers will be used to, its not a simple case of just adding a manually written file to the migrations directory. First off, you need access to the neutron-db-manage tool to write a migration, so setup neutron.

    Just as an aside, the first time I tried to write this blog post I was on an aeroplane, with no network connectivity. Its is frustrating that writing a new database migration requires network connectivity if you don't already have the neutron tools setup in your development environment. Even more annoyingly, you need to have a working neutron configuration in order to be able to add a new migration, which slowed me down a fair bit when I was trying this out. In the end it seems the most expedient way to do this is just to run up a devstack with neutron configured.

    Now we can add a new migration:

      $ neutron-db-manage --config-file /etc/neutron/neutron.conf \
      --config-file /etc/neutron/plugins/ml2/ml2_conf.ini \
      revision -m "funky new database migration" \
      --autogenerate
      No handlers could be found for logger "neutron.common.legacy"
      INFO  [alembic.migration] Context impl MySQLImpl.
      INFO  [alembic.migration] Will assume non-transactional DDL.
      INFO  [alembic.autogenerate] Detected removed table u'arista_provisioned_tenants'
      INFO  [alembic.autogenerate] Detected removed table u'ml2_vxlan_allocations'
      INFO  [alembic.autogenerate] Detected removed table u'cisco_ml2_nexusport_bindings'
      INFO  [alembic.autogenerate] Detected removed table u'ml2_vxlan_endpoints'
      INFO  [alembic.autogenerate] Detected removed table u'arista_provisioned_vms'
      INFO  [alembic.autogenerate] Detected removed table u'ml2_flat_allocations'
      INFO  [alembic.autogenerate] Detected removed table u'routes'
      INFO  [alembic.autogenerate] Detected removed table u'cisco_ml2_credentials'
      INFO  [alembic.autogenerate] Detected removed table u'ml2_gre_allocations'
      INFO  [alembic.autogenerate] Detected removed table u'ml2_vlan_allocations'
      INFO  [alembic.autogenerate] Detected removed table u'servicedefinitions'
      INFO  [alembic.autogenerate] Detected removed table u'servicetypes'
      INFO  [alembic.autogenerate] Detected removed table u'arista_provisioned_nets'
      INFO  [alembic.autogenerate] Detected removed table u'ml2_gre_endpoints'
        Generating /home/mikal/src/openstack/neutron/neutron/db/migration/alembic_migrations/
      versions/297033515e04_funky_new_database_m.py...done
      


    This command has allocated us a migration id, in this case 297033515e04. Interestingly, the template migration drops all of the tables for the ml2 driver, which is a pretty interesting choice of default.

    There are a bunch of interesting headers in the migration python file which you need to know about:

      """funky new database migration
      
      Revision ID: 297033515e04
      Revises: havana
      Create Date: 2013-11-04 17:12:31.692133
      
      """
      
      # revision identifiers, used by Alembic.
      revision = '297033515e04'
      down_revision = 'havana'
      
      # Change to ['*'] if this migration applies to all plugins
      
      migration_for_plugins = [
          'neutron.plugins.ml2.plugin.Ml2Plugin'
      ]
      


    The developer README then says that you can check your migration is linear with this command:

      $ neutron-db-manage --config-file /etc/neutron/neutron.conf \
      --config-file /etc/neutron/plugins/ml2/ml2_conf.ini check_migration
      


    In my case it is fine because I'm awesome. However, it is also a little worrying that you need a tool to hold your hand to verify this because its too hard to read through the migrations to verify it yourself.

    So how does alembic go with addressing the concerns we have with the nova database migrations? Well, alembic is currently supported by an upstream other than OpenStack developers, so alembic addresses that concern. I should also say that alembic is obviously already in use by other OpenStack projects, so I think it would be a big ask to move to something other than alembic.

    Alembic does allow linear migrations as well, but its not enforced by the tool itself (in other words, non-linear migrations are supported by the tooling). That means there's another layer of checking required by developers in order to maintain a linear migration stream, and I worry that will introduce another area in which we can make errors and accidentally end up with non-linear migrations. In fact, in the example of multiple patches competing to be the next one in the line alembic is worse, because the headers in the migration file would need to be updated to ensure that linear migrations are maintained.

    Conclusion

    I'm still not convinced alembic is a good choice for nova, but I look forward to a lively discussion at the design summit about this.

    Tags for this post: openstack icehouse migrate alembic db migrations
    Related posts: Exploring a single database migration; On Continuous Integration testing for Nova DB

posted at: 22:52 | path: /openstack/icehouse | permanent link to this entry


Sat, 02 Nov 2013



On Continuous Integration testing for Nova DB

    To quote Homer Simpson: "All my life I've had one dream, to achieve my many goals.".

    One of my more recent goals is a desire to have real continuous integration testing for database migrations in Nova. You see, at the moment, database migrations can easily make upgrades painful for deployers, normally by taking a very long time to run. This is partially because we test on trivial datasets on our laptops, but it is also because it is hard to predict the scale of the various dimensions in the database -- for example: perhaps one deployment has lots of instances; whilst another might have a smaller number of instances but a very large number of IP addresses.

    The team I work with at Rackspace Australia has therefore been cooking up a scheme to try and fix this. For example, Josh Hesketh has been working on what we call Turbo Hipster, which he has blogged about. We've started off with a prototype to prove we can get meaningful testing results, which is running now.

    Since we finished the prototype we've been working on a real implementation, which is known as Turbo Hipster. I know it's an odd name, but we couldn't decide what to call it, so we just took a suggestion from the github project namer. Its just an added advantage that the OpenStack Infra team think that the name is poking fun at them. Turbo Hipster reads the gerrit event stream, and then uses our own zuul to run tests and report results to gerrit. We need our own zuul because we want to able to offer federated testing later, and it isn't fair to expect the Infra team to manage that for us. There's nothing special about the tests we're running; our zuul is capable of running other tests if people are interested in adding more, although we'd have to talk about if it makes more sense for you to just run your own zuul.

    Generally I keep an eye on the reports and let developers know when there are problems with their patchset. I don't want to link to where the reports live just yet. Right now, there are some problems which stop me from putting our prototype in a public place, though. Consider a migration that takes some form of confidential data out of the database and just logs it. Sure, we'd pick this up in code review, but by then we might have published test logs with confidential information. This is especially true because we want to be able to run tests against real production databases, both ones donated to run on our test infrastructure and ones where a federated worker is running somewhere else.

    We have therefore started work on a database anonymization tool, which we named Fuzzy Happiness (see earlier comment about us being bad at naming things). This tool takes markup in the sqlalchemy models file and uses that to decide what values to anonymize (and how). Fuzzy Happiness is what prompted me to write this blog post: Nova reviewers are about to see a patch with strange markup in it, and I wanted something to point at to explain what we're trying to do.

    Once we have anonymization working there is one last piece we need, which is database scaling. Perhaps the entire size of your database gives away things you don't want leaked into gerrit. This tool is tentatively codenamed Elastic Duckface, and we'll tell you more about it just as soon as we've written it.

    I'd be very interested in comments on any of this work, so please do reach out if you have thoughts.

    Tags for this post: openstack turbo_hipster fuzzy_happiness db ci anonymization
    Related posts: Comparing alembic with sqlalchemy migrate; Nova database continuous integration

posted at: 13:10 | path: /openstack | permanent link to this entry


Fri, 02 Aug 2013



Exploring a single database migration

    Yesterday I was having some troubles with a database migration download step, and a Joshua Hesketh suggested I step through the migrations one at a time and see what they were doing to my sqlite test database. That's a great idea, but it wasn't immediately obvious to me how to do it. Now that I've figured out the steps required, I thought I'd document them here.

    First off we need a test environment. I'm hacking on nova at the moment, and tend to build throw away test environments in the cloud because its cheap and easy. So, I created a new Ubuntu 12.04 server instance in Rackspace's Sydney data center, and then configured it like this:

      $ sudo apt-get update
      $ sudo apt-get install -y git python-pip git-review libxml2-dev libxml2-utils
      libxslt-dev libmysqlclient-dev pep8 postgresql-server-dev-9.1 python2.7-dev
      python-coverage python-netaddr python-mysqldb python-git virtualenvwrapper
      python-numpy virtualenvwrapper sqlite3
      $ source /etc/bash_completion.d/virtualenvwrapper
      $ mkvirtualenv migrate_204
      $ toggleglobalsitepackages
      


    Simple! I should note here that we probably don't need the virtualenv because this machine is disposable, but its still a good habit to be in. Now I need to fetch the code I am testing. In this case its from my personal fork of nova, and the git location to fetch will obviously change for other people:

      $ git clone http://github.com/mikalstill/nova
      


    Now I can install the code under test. This will pull in a bunch of pip dependencies as well, so it takes a little while:

      $ cd nova
      $ python setup.py develop
      


    Next we have to configure nova because we want to install specific database schema versions.

      $ mkdir /etc/nova
      $ sudo mkdir /etc/nova
      $ sudo vim /etc/nova/nova.conf
      $ sudo chmod -R ugo+rx /etc/nova
      


    The contents of my nova.conf looks like this:

      $ cat /etc/nova/nova.conf
      [DEFAULT]
      sql_connection = sqlite:////tmp/foo.sqlite
      


    Now I can step up to the version before the one I am testing:

      $ nova-manage db sync --version 203
      


    You do the same thing but with a different version number to step somewhere else. Its also pretty easy to get the schema for a table under sqlite. I just do this:

      $ sqlite3 /tmp/foo.sqlite
      SQLite version 3.7.9 2011-11-01 00:52:41
      Enter ".help" for instructions
      Enter SQL statements terminated with a ";"
      sqlite> .schema instances
      CREATE TABLE "instances" (
              created_at DATETIME,
              updated_at DATETIME,
      [...]
      


    So there you go.

    Disclaimer -- I wouldn't recommend upgrading to a specific version like this for real deployments, because the models in the code base wont match the tables. If you wanted to do that you'd need to work out what git commit added the version after the one you've installed, and then checkout the commit before that commit.

    Tags for this post: openstack tips rackspace nova database migrations sqlite
    Related posts: Merged in Havana: fixed ip listing for single hosts; Merged in Havana: configurable iptables drop actions in nova; Michael's surprisingly unreliable predictions for the Havana Nova release; Havana Nova PTL elections; Upgrade problems with the new Fixed IP quota; Nova database continuous integration

posted at: 18:37 | path: /openstack/tips | permanent link to this entry


Wed, 03 Jul 2013



Nova database continuous integration

    I've had some opportunity recently to spend a little quality time off line, and I spent some of that time working on a side project I've wanted to do for a while -- continuous integration testing of nova database migrations. Now, the code isn't perfect at the moment, but I think its an interesting direction to take and I will keep pursuing it.

    One of the problems nova developers have is that we don't have a good way of determining whether a database migration will be painful for deployers. We can eyeball code reviews, but whether code looks reasonable or not, its still hard to predict how it will perform on real data. Continuous integration is the obvious solution -- if we could test patch sets on real databases as part of the code review process, then reviewers would have more data about whether to approve a patch set or not. So I did that.

    At the moment the CI implementation I've built isn't posting to code reviews, but that's because I want to be confident that the information it gathers is accurate before wasting other reviewers' time. You can see results at openstack.stillhq.com/ci. For now, I am keeping an eye on the test results and posting manually to reviews when an error is found -- that has happened twice so far.

    The CI tests work by restoring a MySQL database to a known good state, upgrading that database from Folsom to Grizzly (if needed). It then runs the upgrades already committed to trunk, and then the proposed patch set. Timings for each step are reported -- for example with my biggest test database the upgrade from Folsom to Grizzly takes between about 7 and 9 minutes to run, which isn't too bad. You can see an example log at here.

    I'd be interested in know if anyone else has sample databases they'd like to see checks run against. If so, reach out to me and we can make it happen.

    Tags for this post: openstack rackspace database ci mysql
    Related posts: MythBuntu 8.10 just made me sad; Exploring a single database migration; Time to document my PDF testing database; Managing MySQL the Slack Way: How Google Deploys New MySQL Servers; I won a radio shark and headphones!; Conference Wireless not working yet?

posted at: 03:30 | path: /openstack | permanent link to this entry


Fri, 26 Apr 2013



Merged in Havana: fixed ip listing for single hosts

posted at: 00:56 | path: /openstack/havana | permanent link to this entry


Fri, 19 Apr 2013



Michael's surprisingly unreliable predictions for the Havana Nova release

    I should start out by saying that because OpenStack is an open source project, it is hard to know exactly what will land in Havana -- the developers are volunteers, and sometimes things get in the way of them doing the work they intended. However, these are the notes I wrote up on the high points of the summit for me -- I didn't see all the same sessions as other nova developers, so hopefully others will pitch in with their notes as well.

    Scheduler

    The scheduler seems to be a point of planned work for a lot of people in this release, with talk about having more scheduling code in the common library, and of adding new filter types. There is definite interest in being able to schedule by methods we don't currently support -- things like rack or PDU diversity, or trying to collocate a tenants machines together. HP is also interested in being able to sell dedicated machines to tenants -- in other words, they would guarantee that only one tenants instances appeared on a machine in return for a fee. At the moment this requires setting up a host aggregate for the tenant.

    Feeding additional data into scheduling decisions

    There is also interest in being able to feed more scheduling information to the nova-scheduler. For example, ceilometer intends to start collecting monitoring data from nova-compute nodes, and perhaps it might inform nova-scheduler that a machine is running hot or has a degraded RAID array. This might also be the source of PDU or CRAC failure information which might affect scheduling decisions -- these later two examples are interesting because they are information where it doesn't make sense to get it from the compute node, the correct location for this information is a data center wide system, not an individual machine. There is concern about nova-scheduler depending on other systems, so these updates from ceilometer will probably be advisory updates, with nova-scheduler degrading gracefully if they are not present or are stale.

    Mothballing

    This was almost instantly renamed to "shelving", but "swallow / spew" was also considered. This is a request that Rackspace sees from customers -- basically the ability to stop a virtual machine, but keep the UUID and IP addresses associated with the machine as well as the block device mapping. The proposal is to implement this as a snapshot of the machine, and a new machine state. The local disk files for the instance might get deleted if the resources are needed. This would feel like a reboot of an instance to a user.

    This is of interest for workloads like "Black Friday" web servers. You could bring a whole bunch up, configure security groups, load balancers, and the applications on the instances and then shelve the instance. When you need the instance to handle load, you'd then unshelve the instance and once it was booted it would just magically start serving. Expect to see shelves instances be cheaper than a running instance, but not free. This is mostly because IP addresses are scarce. Restarting a shelved instance might take a while if the snapshot has to be fetched to a compute node. If you need a more "instant on" bursting capacity, then just leave instances idling and pay full price.

    Deferred instance file delete

    This is a nice to have requirement for shelving instances, but it is useful for other things as well. This is the ability to delay the deletion of instance files when an instance is torn down. This might end up being expressed as "keep these files for at least X days, unless you are tight on disk resources". I can see other reasons this would be useful -- for example helping support people rescue data from instances users tore down and now want back. It also defers the disk IO from deleting the files until its absolutely necessary. We could also perhaps detect times when the disks are "relatively idle" and use those to clean up file systems.

    DNS in nova-network

    Expect to see the current DNS driver removed, as no one uses it as best as we can tell. This will be replaced with a simpler drive in nova-compute and the recommendation that deployers use quantum DNS if possible.

    Quantum

    There is continued work of making quantum the default networking engine for nova. There are still some missing features, but the list of absolutely blocking features is getting smaller. A lot of discussion centered around how to live upgrade clouds from nova-network to quantum. This is not an easy problem, but smart people are looking at it. The solution might involve moving compute nodes over to quantum, and then live migrating instances over to those compute nodes. However, we currently only support one network driver at a time in nova, so we will need to change some code here.

    Long running periodic tasks

    There will be a refactor of the periodic task code in nova this release to move periodic tasks which incur a lot of blocking IO into separate processes. These processes will be launched by nova-compute, and not be cron jobs or something like that. Most of the discussion was around how to do this safely (eventlet makes it exciting), which is nice in that it indicates some level of consensus that this is needed. The plan for now is to do this in nova-compute, but leave other nova components for later releases.

    Libvirt changes

    Libvirt is the compute driver I work on, so it's the only one I want to comment on here. The other drivers are doing interesting things as well, I just don't want to get details wrong by not understanding their efforts.

    First off, there should be some work done on better console logging in Havana. At the moment we use an unbounded file on disk. This will hopefully become a Unix domain socket managing a ring buffer of some form. The Unix domain socket leaves the option open of later making this serial console interactive, but that's not an immediate goal.

    There was a lot of talk about LXC support, and how we need to support file system attachments as well as block devices. There is also some cleanup that can be done for the LXC support in the libvirt to make the code cleaner, but it is not clear who will work on this.

    imagebackend.py will probably get refactored, but in ways that don't make a big difference to users but make it easier to code against (and therefore more reliable). I'm including it here just because I'm excited about that refactor making this code easier to understand.

    There was a lot of talk about live migration and the requirement for ssh between compute nodes. Operators don't love that compute nodes can talk to each other, but expect Havana to include some sort of on demand ssh key management, and a later release to proxy that traffic through something like nova-conductor.

    Incremental backups are of interest to deployers as well, but there is concern that glance needs more support for chains of images before we can do that.

    Conclusion

    The summit was fantastic once again, and the Foundation did an awesome job of hosting it. It was however a pretty tiring experience, and I'm sure I got some stuff here wrong, or missed things that others would consider important. It would be cool for other developers to write up summaries of what they saw at the summit as well.

    Tags for this post: openstack havana rackspace summit nova summary prediction
    Related posts: Merged in Havana: fixed ip listing for single hosts; Merged in Havana: configurable iptables drop actions in nova; Exploring a single database migration; Havana Nova PTL elections; Upgrade problems with the new Fixed IP quota; Faster pip installs

posted at: 23:20 | path: /openstack/havana | permanent link to this entry


Tue, 16 Apr 2013



Getting started with OpenStack development

posted at: 14:54 | path: /openstack | permanent link to this entry


Sun, 07 Apr 2013



Faster pip installs

posted at: 21:33 | path: /openstack/tips | permanent link to this entry


Sat, 30 Mar 2013



Merged in Havana: configurable iptables drop actions in nova

    LaunchPad bug 1013893 asked nicely if the drop action for iptables rules created by nova-network could be configured. The idea here is that you might want to do something other than a plain old drop -- for example logging before dropping. This has now been implemented in Havana.

    To configure the drop action, set the iptables_drop_action to the name of an already existing iptables target. Creating this target is not managed by nova, and you'll need to do it on every compute node. When iptables creates or deletes rules on compute nodes it will now use this new target. There's a bit of an upgrade problem here in that this will stop nova from deleting rules which use the old hard coded drop target. However, if an instance is torn down then all of its tables are torn down as well and rules will be deleted correctly, so this is only a problem if a security group is changed while the instance is running.

    It occurs to me that we can do better here, so I've sent off this review to handle the case where a rule is being removed and used the default drop action.

    For safety, I would recommend only using this flag on new compute nodes that have no instances running in order to make this simple.

    Tags for this post: openstack havana nova iptables rackspace
    Related posts: Merged in Havana: fixed ip listing for single hosts; Michael's surprisingly unreliable predictions for the Havana Nova release; Exploring a single database migration; Havana Nova PTL elections; Upgrade problems with the new Fixed IP quota; Faster pip installs

posted at: 21:13 | path: /openstack/havana | permanent link to this entry


Upgrade problems with the new Fixed IP quota

    In the last few weeks a new quota has been added to Nova covering Fixed IPs. This was done in response to LaunchPad bug 1125468, which was disclosed as CVE 2013-1838.

    To be honest I think there are some things the vulnerability management team learned the hard way with this disclosure. For example, we didn't realize that we needed to update python-novaclient to allow users to set the quota, or that adding a quota would require changes in Horizon. Both of these errors have been corrected.

    More importanly, the default value of the new quota was set to 10. I made this decision based on the default value of the instances quota coupled with a desire to protect deployments from denial of service. However, this decision combined with a failure to explicitly call out the new quota in the release notes for the Folsom stable release have resulted in some deployers experiencing upgrade problems. This was drawn to our attention by LaunchPad bug 1161190.

    We have therefore moved to set the default quota for fixed IPs to unlimited. If you want to protect yourself from a potential DoS, then you should seriously consider changing this default value in your deployment. This can be done with the quota_fixed_ips flag. The code reviews implementing this change are either merged, or under review depending on the release. At the time of writing this Havana and Grizzly have a fix merged, with Folsom and Essex still under review.

    I think this experience also reinforces the importance of testing all upgrades in a lab environment before doing them in production.

    Sorry for any inconvenience caused.

    Tags for this post: openstack nova quota fixed_ip vmt cve denial_of_service rackspace
    Related posts: Merged in Havana: fixed ip listing for single hosts; Merged in Havana: configurable iptables drop actions in nova; Michael's surprisingly unreliable predictions for the Havana Nova release; Exploring a single database migration; Havana Nova PTL elections; Faster pip installs

posted at: 16:11 | path: /openstack | permanent link to this entry


Wed, 13 Mar 2013



Havana Nova PTL elections

posted at: 08:34 | path: /openstack | permanent link to this entry


Fri, 04 Jan 2013



OpenStack at linux.conf.au 2013

    As some of you might know, I'm the Director for linux.conf.au 2013. I've tried really hard to not use my powers for evil and make the entire conference about OpenStack -- in fact I haven't pulled rank and demanded that specific content be included at all. However, the level of interest in OpenStack has grown so much since LCA 2012 that there is now a significant amount of OpenStack content in the conference without me having to do any of that.

    I thought I'd take a second to highlight some of the OpenStack content that I think is particularly interesting -- these are the talks I'll be going to if I have the time (which remains to be seen):

    Monday
    • Cloud Infrastructure, Distributed Storage and High Availability Miniconf: while not specifically about OpenStack, this miniconf is going to be a good warm up for all things IaaS at the conference. Here's a list of the talks for that miniconf:
        Delivering IaaS with Apache CloudStack - Joe Brockmeier
      • oVirt - Dan Macpherson
      • Aeolus - Dan Macpherson
      • Ops: From bare metal to cloud space - Phil Ingram
      • VMs on VLANs on Bridges on Bonds on many NICs - Kim Hawtin
      • OpenStack Swift Overview - John Dickinson
      • JORN and the rise and fall of clustering - Jamie Birse
      • MongoDB Replication & Replica Sets - Stephen Steneker
      • MariaDB Galera Cluster - Grant Allen
      • The Grand Distributed Storage Debate: GlusterFS and Ceph going head to head - Florian Haas, Sage Weil, Jeff Darcy


    Tuesday
    • The OpenStack Miniconf: this is a mostly-clear winner for Tuesday. Tristan Goode has been doing a fantastic job of organizing this miniconf, which might not be obvious to people who haven't been talking to him a couple of times a week about its progress like me. I think people will be impressed with the program, which includes:
      • Welcome and Introduction - Tristan Goode
      • Introduction to OpenStack - Joshua McKenty
      • Demonstration - Sina Sadeghi
      • NeCTAR Research Cloud: OpenStack in Production - Tom Fifeld
      • Bare metal provisioning with OpenStack - Devananda van der Veen
      • Intro to Swift for New Contributors - John Dickinson
      • All-around OpenStack storage with Ceph - Florian Haas
      • Writing API extensions for Nova - Christopher Yeoh
      • The OpenStack Metering Project - Angus Salkeld
      • Lightweight PaaS on the NCI OpenStack Cloud - Kevin Pulo
      • Enabling Compute Clusters atop OpenStack - Enis Afgan
      • Shared Panel with Open Government
    • The Open Government Miniconf: this is the other OpenStack relevant miniconf on Tuesday. This might seem like a bit of a stretch, but as best as I can tell there is massive interest in government at the moment in deploying cloud infrastructure, and now is the time to be convincing the decision makers that open clouds based on open source are the right way to go. OpenStack has a lot to offer in the private cloud space, and we need to as a community make sure that people are aware of the various options that are out there. This is why there is a shared panel at the end of the day with the OpenStack miniconf.


    Wednesday
      There aren't any OpenStack talks on Wednesday, but I am really hoping that someone will propose an OpenStack BoF via the wiki. I'd sure go to a BoF.


    Thursday
    • Playing with OpenStack Swift by John Dickinson
    • Ceph: Managing A Distributed Storage System At Scale by Sage Weil


    Friday
    • Openstack on Openstack - a single management API for all your servers by Robert Collins
    • Heat: Orchestrating multiple cloud applications on OpenStack using templates by Angus Salkeld and Steve Baker
    • How OpenStack Improves Code Quality with Project Gating and Zuul by James Blair
    • Ceph: object storage, block storage, file system, replication, massive scalability, and then some! by Tim Serong and Florian Haas


    So, if you're interested in OpenStack and haven't considered linux.conf.au 2013 as a conference you might be interested in, now would be a good time to reconsider before we sell out!

    Tags for this post: openstack conference lca2013 rackspace
    Related posts: Contact details for the Canberra LCA 2013 bid; Faster pip installs; Got Something to Say? The LCA 2013 CFP Opens Soon!; First day of setup for lca2013; Merged in Havana: fixed ip listing for single hosts; On conference t-shirts

posted at: 13:07 | path: /openstack | permanent link to this entry


Sat, 22 Dec 2012



Image handlers (in essex)

    George asks in the comments on my previous post about loop and nbd devices an interesting question about the behavior of this code on essex. I figured the question was worth bringing out into its own post so that its more visible. I've edited George's question lightly so that this blog post flows reasonably.
    Can you please explain the order (and conditions) in which the three methods are used? In my Essex installation, the "img_handlers" is not defined in nova.conf, so it takes the default value "loop,nbd,guestfs". However, nova is using nbd as the chose method.
    The handlers will be used in the order specified -- with the caveat that loop doesn't support Copy On Write (COW) images and will therefore be skipped if the libvirt driver is trying to create a COW image. Whether COW images are used is configured with the use_cow_images flag, which defaults to True. So, loop is being skipped because you're probably using COW images.
    My ssh keys are obtained by cloud-init, and still whenever I start a new instance I see in the nova-compute.logs this sequence of events:
    qemu-nbd -c /dev/nbd15 /var/lib/nova/instances/instance-0000076d/disk 
    kpartx -a /dev/nbd15 
    mount /dev/mapper/nbd15p1 /tmp/tmpxGBdT0 
    umount /dev/mapper/nbd15p1 
    kpartx -d /dev/nbd15 
    qemu-nbd -d /dev/nbd15 
    
    I don't understand why the mount of the first partition is necessary and what it happens when the partition is mounted.
    This is a bit harder than the first bit of the question. What I think is happening is that there are files being injected, and that's causing the mount. Just because the admin password isn't being inject doesn't mean that other things aren't being injected still. You'd be able to tell what's happening by grepping your logs for "Injecting .* into image" and seeing what shows up.

    Tags for this post: openstack loop nbd libvirt file_injection rackspace
    Related posts: Some quick operational notes for users of loop and nbd devices; Faster pip installs; Merged in Havana: fixed ip listing for single hosts; Merged in Havana: configurable iptables drop actions in nova; Michael's surprisingly unreliable predictions for the Havana Nova release; Exploring a single database migration

posted at: 15:51 | path: /openstack | permanent link to this entry


Sat, 15 Dec 2012



Some quick operational notes for users of loop and nbd devices

posted at: 16:28 | path: /openstack | permanent link to this entry


Sat, 08 Dec 2012



Moving on

    Thursday this week is my last day at Canonical. After a little over a year at Canonical, I'm moving on to the private cloud team at Rackspace -- my first day with Rackspace will be the 17th of December. I'm very excited to be joining Rackspace -- I'm excited by the project, the team, and the opportunity to make OpenStack even better. We've also talked about some interesting stuff we'd like to do in the Australian OpenStack community, but I'm going to hold off on talking about that until I've had a chance to settle in.

    I am appreciative of my time at Canonical -- when I joined I was unaware of the existence of OpenStack, and without Canonical I might never have found this awesome project that I love. I also had the chance to work with some really smart people who taught me a lot. This move is about spending more time on OpenStack than Canonical was able to allow.

    Tags for this post: openstack canonical rackspace
    Related posts: Faster pip installs; Taking over a launch pad project; Got Something to Say? The LCA 2013 CFP Opens Soon!; Slow git review uploads?; Merged in Havana: fixed ip listing for single hosts; On conference t-shirts

posted at: 12:56 | path: /openstack | permanent link to this entry


Tue, 10 Jul 2012



A first pass at glance replication

    A few weeks back I was tasked with turning up a new OpenStack region. This region couldn't share anything with existing regions because the plan was to test pre-release versions of OpenStack there, and if we shared something like glance then we would either have to endanger glance for all regions during testing, or not test glance. However, our users already have a favorite set of images uploaded to glance, and I really wanted to make it as easy as possible for them to use the new region -- I wanted all of their images to magically just appear there. What I needed was some form of glance replication.

    I'd sat in on the glance replication session at the Folsom OpenStack Design Summit. The NeCTAR use case at the bottom is exactly what I wanted, so its reassuring that other people wanted something like that too. However, no one was working on this feature. So I wrote it. In fact, because of the code review process I wrote it twice, but let's not dwell on that too much.

    So, as of change id I7dabbd6671ec75a0052db58312054f611707bdcf there is a very simple replicator script in glance/bin. Its not perfect, and I expect it will need to be extended a bunch, but its a start at least and I'm using it in production now so I am relatively confident its not totally wrong.




    The replicator supports the following commands at the moment:

    livecopy
    glance-replicator livecopy fromserver:port toserver:port
    
        Load the contents of one glance instance into another.
    
        fromserver:port: the location of the master glance instance.
        toserver:port:   the location of the slave glance instance.
    


    This is the main meat of the replicator. Take a copy of the fromserver, and dump it onto the toserver. Only images visible to the user running the replicator will be copied if you're using Keystone. Only images active on fromserver are copied across. The copy is done "on-the-wire", so there are no large temporary files on the machine running the replicator to clean up.

    dump
    glance-replicator dump server:port path
    
        Dump the contents of a glance instance to local disk.
    
        server:port: the location of the glance instance.
        path:        a directory on disk to contain the data.
    


    Do the same thing as livecopy, but dump the contents of the glance server to a directory on disk. This includes meta data and image data, and this directory is probably going to be quite large so be prepared.

    load
    glance-replicator load server:port path
    
        Load the contents of a local directory into glance.
    
        server:port: the location of the glance instance.
        path:        a directory on disk containing the data.
    


    Load a directory created by the dump command into a glance server. dump / load was originally written because I had two glance servers who couldn't talk to each other over the network for policy reasons. However, I could dump the data and move it to the destination network out of band. If you had a very large glance installation and were bringing up a new region at the end of a slow link, then this might be something you'd be interested in.

    compare
    glance-replicator compare fromserver:port toserver:port
    
        Compare the contents of fromserver with those of toserver.
    
        fromserver:port: the location of the master glance instance.
        toserver:port:   the location of the slave glance instance.
    


    What would a livecopy do? The compare command will show you the differences between the two servers, so its a bit like a dry run of the replication.

    size
    glance-replicator size 
    
        Determine the size of a glance instance if dumped to disk.
    
        server:port: the location of the glance instance.
    


    The size command will tell you how much disk is going to be used by image data in either a dump or a livecopy. It doesn't however know about redundancy costs with things like swift, so it just gives you the raw number of bytes that would be written to the destination.




    The glance replicator is very new code, so I wouldn't be too surprised if there are bugs out there or obvious features that are lacking. For example, there is no support for SSL at the moment. Let me know if you have any comments or encounter problems using the replicator.

    Tags for this post: openstack glance replication multi-region canonical
    Related posts: Further adventures with base images in OpenStack; Openstack compute node cleanup; Taking over a launch pad project; Got Something to Say? The LCA 2013 CFP Opens Soon!; Slow git review uploads?; On conference t-shirts

posted at: 16:09 | path: /openstack | permanent link to this entry


Tue, 10 Apr 2012



Folsom Dev Summit sessions

    I thought I should write up the dev summit sessions I am hosting now that the program is starting to look solid. This is mostly for my own benefit, so I have a solid understanding of where to start these sessions off. Both are short brainstorm sessions, so I am not intending to produce slide decks or anything like that. I just want to make sure there is something to kick discussion off.

    Image caching, where to from here (nova hypervisors)

    As of essex libvirt has an image cache to speed startup of new instances. This cache stores images direct from glance, as well as resized images. There is a periodic task which cleans up images in the cache which are no longer needed. The periodic task can also optionally detect images which have become corrupted on disk.

    So first off, do we want to implement this for other hypervisors as well? As mentioned in a recent blog post I'd like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner -- that makes it easier to document, and means that on-call operations people don't need to determine what hypervisor a compute node is running before starting to debug. However, that requires the other hypervisor implementations to change how they stage images for instance startup, and I think it bears further discussion.

    Additionally, the blueprint (https://blueprints.launchpad.net/nova/+spec/nova-image-cache-management) proposed that popular / strategic images could be pre-cached on compute nodes. Is this something we still want to do? What factors do we want to use for the reference implementation? I have a few ideas here that are listed in the blueprint, but most of them require talking to glance to implement. There is some hesitance in adding glance calls to a periodic task, because in a keystone'd implementation that would require an admin token in the nova configuration file. Is there a better way to do this, or is it ok to rely on glance in a periodic task?

    Ops pain points (nova other)

    Apart from my own ideas (better instance logging for example), I'm very interested in hearing from other people about what we can do to make nova easier for ops people to run. This is especially true for relatively easy to implement things we can get done in Folsom. This blueprint for deployer friendly configuration files is a good example of changes which don't look too hard to implement, but that would make the world a better place for opsen. There are many other examples of blueprints in this space, including:



    What else can we be doing to make life better for opsen? I'm especially interested in getting people who actually run openstack in the wild into the room to tell us what is painful for them at the moment.

    Tags for this post: openstack canonical folsom image_cache_management sre
    Related posts: Reflecting on Essex; Further adventures with base images in OpenStack; Openstack compute node cleanup; Managing MySQL the Slack Way: How Google Deploys New MySQL Servers; I won a radio shark and headphones!; Conference Wireless not working yet?

posted at: 17:25 | path: /openstack | permanent link to this entry


Thu, 05 Apr 2012



Reflecting on Essex

    This post is kind of long, and a little self indulgent. However, I really wanted to spend some time thinking about what I did for the Essex release cycle, and what I want to do for the Folsom release. I spent Essex mostly hacking on things in isolation, except for when Padraig Brady and I were hacking in a similar space. I'd like to collaborate more for Folsom, and I'm hoping talking about what I'm interested in doing in public might help with that.

    I came relatively late to the Essex development cycle, having never even heard of OpenStack before joining Canonical. We can talk about how I'd worked in the cloud space for six years and yet wasn't aware of the open source implementations at some other time.

    My initial introduction to OpenStack was being paged for compute nodes which were continually running out of disk. I googled around a bit and discovered that cached images for instances were never cleaned up (to start an instance, an image is fetched from glance, possibly has its format converted, is resized, and then an instance started with that resulting image, all those images were never being cleaned up). I filed bug 904532 as my absolute first interaction with the OpenStack community. Scott Moser kindly pointed me at the blueprint for how to actually fix the problem.

    (Remind me if Phil Day comes to the OpenStack developer summit that I should sit down with him at some point and see how what close what was actually implemented got to what he wrote in that blueprint. I suspect we've still got a fair way to go, but I'll talk more about that later in this post).

    This was a pivotal moment. I'd just spent the last six years writing python code to manage largish cloud clusters, and here was a bug which was hurting me in a python package intended to manage clusters very similar to those I had been running. I should just fix the bug, right?

    It turns out that the OpenStack core developers are super easy to work with. I'd say that the code review process certainly feels like it was modelled on Google's but in general the code reviewers are nicer with their comments that what I'm used to. This makes it much easier to motivate yourself to go and spend some more time hacking that a deeply negative review would. I think Vish is especially worthy of a shout out as being an amazing person to work with. He's helpful, patient, and very smart.

    In the end I wrote the image cache manager which ships in Essex. Its not perfect, but its a lot better than what came before, and its a good basis to build on. There is some remaining tech debt for image cache management which I intend to work on for Folsom. First off, the image cache only works for libvirt instances at the moment. I'd like to pull all the other hypervisors into line as best as possible. There are hooks in the virtualization driver for this, but no one has started this work as best as I am aware. To be completely honest I'd like to see the image cache manager become common code and have all the hypervisors deal with this in exactly the same manner -- that makes it easier to document, and means that on-call operations people don't need to determine what hypervisor a compute node is running before starting to debug. This is something I very much want to sit down with other nova developers and talk about at the summit.

    The next step for image cache management is tracked in a very bare bones blueprint. The original blueprint envisaged that it would be desirable to pre-cache some images on all nodes. For example, a cloud host might want to offer slightly faster startup times for some images by ensuring they are pre-cached. I've been thinking about this a lot, and I can see other use cases here as well. For example, if you have mission critical instances and you wanted to tolerate a glance failure, then perhaps you want to pre-cache a class of images that serve those mission critical instances. The intention is to provide an interface and default implementation for the pre-caching logic, and then let users go wild working out their own requirements.

    The hardest bit of the pre-caching will be reducing the interactions with glance I suspect. The current feeling is that calling glance from a periodic task is a bit scary, and has been actively avoided for Essex. This is especially true if Keystone is enabled, as the periodic task wont have an admin context unless we pull that from the config file. However, if you're trying to determine what images are mission critical, then you really need to talk to glance. I guess another option would be to have a table of such things in nova's database, but that feels wrong to me. We're going to have to talk about this bit more.

    (It would be interesting as well to talk about the relative priority of instances as well. If a cluster is experiencing outages, then perhaps some customers would pay more to have their instances be the last killed off or something. Or perhaps I have instances which are less critical than others, so I want the cluster to degrade in an understood manner.)

    That leads logically onto a scheduler change I would like to see. If I have a set of compute nodes I know already have the image for a given instance, shouldn't I prefer to start instances on those nodes instead of fetching the image to yet more compute nodes? In fact, if I already have a correctly resized COW base image for an instance on a given node, then it would make sense to run a new instance on that node as well. We need to be careful here, because you wouldn't want to run all of a given class of instance on a small set of compute nodes, but if the image was something like a default Ubuntu image, then it would make sense. I'd be interested in hearing what other people think of doing something like this.

    Another thing I've tried to focus on for Essex is making OpenStack easier for operators to run. That started off relatively simply, by adding an option for log messages to specify what instance a message relates to. This means that when a user queries the state of their instance, the admin can now just grep for the instance UUID, and run from there. Its not perfect yet, in that not all messages use this functionality, but that's some tech debt that I will take on in Folsom. If you're a nova developer, then please pass instance= in your log messages where relevant!

    This logging functionality isn't perfect, because if you only have the instance UUID in the method you're writing, it wont work. It expects full instance dicts because of the way the formatting code works. This is kind of ironic in that the default logging format only includes the UUID. In Folsom I'll also extend this code so that the right thing happens with UUIDs as well.

    Another simple logging tweak I wrote is that tracebacks now have the time and instance included in them. This makes it much easier for admins to determine the context of a traceback in their logs. It should be noted that both of these changes was relatively trivial, but trivial things can often make it much easier for others.

    There are two sessions at the Folsom dev summit talking about how to make OpenStack easier for operators to run. One was from me, and the other is from Duncan McGreggor. Neither has been accepted yet, but if I notice that Duncan's was accepted I'll drop mine. I'm very very interested in what operations staff feel is currently painful, because having something which is easy to scale and manage is vital to adoption. This is also the core of what I did at Google, and I feel I can make a real contribution here.

    I know I've come relatively late to the OpenStack party, but there's heaps more to do here and I'm super enthused to be working on code that I can finally show people again.

    Tags for this post: openstack canonical essex folsom image_cache_management sre
    Related posts: Folsom Dev Summit sessions; Further adventures with base images in OpenStack; Openstack compute node cleanup; Managing MySQL the Slack Way: How Google Deploys New MySQL Servers; I won a radio shark and headphones!; Conference Wireless not working yet?

posted at: 18:19 | path: /openstack | permanent link to this entry


Fri, 03 Feb 2012



Wow, qemu-img is fast

    I wanted to determine if its worth putting ephemeral images into the libvirt cache at all. How expensive are these images to create? They don't need to come from the image service, so it can't be too bad, right? It turns out that qemu-img is very very fast at creating these images, based on the very small data set of my laptop with an ext4 file system...

      mikal@x220:/data/temp$ time qemu-img create -f raw disk 10g
      Formatting 'disk', fmt=raw size=10737418240 
      
      real	0m0.315s
      user	0m0.000s
      sys	0m0.004s
      
      mikal@x220:/data/temp$ time qemu-img create -f raw disk 100g
      Formatting 'disk', fmt=raw size=107374182400 
      
      real	0m0.004s
      user	0m0.000s
      sys	0m0.000s
      


    Perhaps this is because I am using ext4, which does funky extents things when allocating blocks. However, the only ext3 file system I could find at my place is my off site backup disks, which are USB3 attached instead of the SATA2 that my laptop uses. Here's the number from there:

      $ time qemu-img create -f raw disk 100g
      Formatting 'disk', fmt=raw size=107374182400 
      
      real	0m0.055s
      user	0m0.000s
      sys	0m0.004s
      


    So still very very fast. Perhaps its the mkfs that's slow? Here's a run of creating a ext4 file system inside that 100gb file I just made on my laptop:

      $ time mkfs.ext4 disk 
      mke2fs 1.41.14 (22-Dec-2010)
      disk is not a block special device.
      Proceed anyway? (y,n) y
      warning: Unable to get device geometry for disk
      Filesystem label=
      OS type: Linux
      Block size=4096 (log=2)
      Fragment size=4096 (log=2)
      Stride=0 blocks, Stripe width=0 blocks
      6553600 inodes, 26214400 blocks
      1310720 blocks (5.00%) reserved for the super user
      First data block=0
      Maximum filesystem blocks=0
      800 block groups
      32768 blocks per group, 32768 fragments per group
      8192 inodes per group
      Superblock backups stored on blocks: 
      	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
      	4096000, 7962624, 11239424, 20480000, 23887872
      
      Writing inode tables: done                            
      Creating journal (32768 blocks): done
      Writing superblocks and filesystem accounting information: done
      
      This filesystem will be automatically checked every 36 mounts or
      180 days, whichever comes first.  Use tune2fs -c or -i to override.
      
      real	0m4.083s
      user	0m0.096s
      sys	0m0.136s
      


    That time includes the time it took me to hit the 'y' key, as I couldn't immediately find a flag to stop prompting.

    In conclusion, there is nothing slow here. I don't see why we'd want to cache ephemeral disks and use copy on write for them at all. Its very cheap to just create a new one each time, and it makes the code much simpler.

    Tags for this post: openstack qemu ephemeral mkfs swap speed canonical
    Related posts: Further adventures with base images in OpenStack; Openstack compute node cleanup; Taking over a launch pad project; Speed limit; Got Something to Say? The LCA 2013 CFP Opens Soon!; Slow git review uploads?

posted at: 17:16 | path: /openstack | permanent link to this entry


Thu, 02 Feb 2012



Slow git review uploads?

posted at: 16:53 | path: /openstack | permanent link to this entry