Fess up. You know it was you.

  • tquid@sh.itjust.works
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    6 months ago

    One time I was deleting a user from our MySQL-backed RADIUS database.

    DELETE * FROM PASSWORDS;

    And yeah, if you don’t have a WHERE clause? It just deletes everything. About 60,000 records for a decent-sized ISP.

    That afternoon really, really sucked. We had only ad-hoc backups. It was not a well-run business.

    Now when I interview sysadmins (or these days devops), I always ask about their worst cock-up. It tells you a lot about a candidate.

    • cobysev@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 months ago

      I was a sysadmin in the US Air Force for 20 years. One of my assignments was working at the headquarters for AFCENT (Air Forces Central Command), which oversees every deployed base in the middle east. Specifically, I worked on a tier 3 help desk, solving problems that the help desks at deployed bases couldn’t figure out.

      Normally, we got our issues in tickets forwarded to us from the individual base’s Communications Squadron (IT squadron at a base). But one day, we got a call from the commander of a base’s Comm Sq. Apparently, every user account on the base has disappeared and he needed our help restoring accounts!

      The first thing we did was dig through server logs to determine what caused it. No sense fixing it if an automated process was the cause and would just undo our work, right?

      We found one Technical Sergeant logged in who had run a command to delete every single user account in the directory tree. We sought him out and he claimed he was trying to remove one individual, but accidentally selected the tree instead of the individual. It just so happened to be the base’s tree, not an individual office or squadron.

      As his rank implies, he’s supposed to be the technical expert in his field. But this guy was an idiot who shouldn’t have been touching user accounts in the first place. Managing user accounts in an Airman job; a simple job given to our lowest-ranking members as they’re learning how to be sysadmins. And he couldn’t even do that.

      It was a very large base. It took 3 days to recover all accounts from backup. The Technical Sergeant had his admin privileges revoked and spent the rest of his deployment sitting in a corner, doing administrative paperwork.

    • RacerX@lemm.eeOP
      link
      fedilink
      arrow-up
      0
      ·
      6 months ago

      Always skeptical of people that don’t own up to mistakes. Would much rather they own it and speak to what they learned.

      • chameleon@kbin.social
        link
        fedilink
        arrow-up
        1
        ·
        6 months ago

        It’s difficult because you have a 50/50 of having a manager that doesn’t respect mistakes and will immediately get you fired for it (to the best of their abilities), versus one that considers such a mistake to be very expensive training.

        I simply can’t blame people for self-defense. I interned at a ‘non-profit’ where there had apparently been a revolving door of employees being fired for making entirely reasonable mistakes and looking back at it a dozen years later, it’s no surprise that nobody was getting anything done in that environment.

        • ilinamorato@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          6 months ago

          Incredibly short-sighted, especially for a nonprofit. You just spent some huge amount of time and money training a person to never make that mistake again, why would you throw that investment away?

  • Quazatron@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    6 months ago

    Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.

      • ilinamorato@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        6 months ago

        “Stop” is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you’re done testing or developing something for the day but you’ll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it’s just like power cycling a computer.

        “Terminate” is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It’s the “nuke it from orbit” option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you’re very, very lucky, you restore from backups (or an AMI).

      • Quazatron@lemmy.world
        link
        fedilink
        arrow-up
        1
        ·
        6 months ago

        Noob was told to change some parameters on an AWS EC2 instance, requiring a stop/start. Selected terminate instead, killing the instance.

        Crappy company, running production infrastructure in AWS without giving proper training and securing a suitable backup process.

        • tslnox@reddthat.com
          link
          fedilink
          arrow-up
          2
          ·
          6 months ago

          Maybe there should be some warning message… Maybe a question requiring you to manually type “yes I want it” or something.

          • synae[he/him]@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            6 months ago

            Maybe an entire feature that disables it so you can’t do it accidentally, call it “termination protection” or something

  • Kata1yst@kbin.social
    link
    fedilink
    arrow-up
    2
    ·
    6 months ago

    It was the bad old days of sysadmin, where literally every critical service ran on an iron box in the basement.

    I was on my first oncall rotation. Got my first call from helpdesk, exchange was down, it’s 3AM, and the oncall backup and Exchange SMEs weren’t responding to pages.

    Now I knew Exchange well enough, but I was new to this role and this architecture. I knew the system was clustered, so I quickly pulled the documentation and logged into the cluster manager.

    I reviewed the docs several times, we had Exchange server 1 named something thoughtful like exh-001 and server 2 named exh-002 or something.

    Well, I’d reviewed the docs and helpdesk and stakeholders were desperate to move forward, so I initiated a failover from clustered mode with 001 as the primary, instead to unclustered mode pointing directly to server 10.x.x.xx2

    What’s that you ask? Why did I suddenly switch to the IP address rather than the DNS name? Well that’s how the servers were registered in the cluster manager. Nothing to worry about.

    Well… Anyone want to guess which DNS name 10.x.x.xx2 was registered to?

    Yeah. Not exh-002. For some crazy legacy reason the DNS names had been remapped in the distant past.

    So anyway that’s how I made a 15 minute outage into a 5 hour one.

    On the plus side, I learned a lot and didn’t get fired.

  • zubumafu_420@infosec.pub
    link
    fedilink
    arrow-up
    2
    ·
    6 months ago

    Early in my career as a cloud sysadmin, shut down the production database server of a public website for a couple of minutes accidentally. Not that bad and most users probably just got a little annoyed, but it didn’t go unnoticed by management 😬 had to come up with a BS excuse that it was a false alarm.

    Because of the legacy OS image of the server, simply changing the disk size in the cloud management portal wasn’t enough and it was necessary to make changes to the partition table via command line. I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod. Everything went smoothly except on the moment I had to shut down and delete the newly created VM, I instead shut down the original prod VM because they had similar names.

    Put everything back in place, and eventually resized the original prod VM, but not without almost suffering a heart attack. At least I didn’t go as far as deleting the actual database server :D

    • marito@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      I tried to change ONE record in the production db but I forgot the WHILE clause, ended up changing over 2 MILLION records instead. Three hour production shutdown. Fun times.

  • sexual_tomato@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    2
    ·
    6 months ago

    I didn’t call out a specific dimension on a machined part; instead I left it to the machinist to understand and figure out what needed to be done without explicitly making it clear.

    That part was a 2 ton forging with two layers of explosion-bonded cladding on one side. The machinist faced all the way through a cladding layer before realizing something was off.

    The replacement had a 6 month lead time.

    • Buglefingers@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      That’s hilarious, actually pretty recently I “caused” a line stop because a marker feature (for visuals at assembly, so pretty meaningless dimension overall) was very much over dimensioned (we talking depth, rad, width, location from step) and to top it off instead of a spot drill just doing a .01 plunge they interpolated it! (Why I have zero clue). So it was leaving dwell marks for at least the past 10 months and because it was over dimensioned it all of them had to be put on hold because DOD demands perfection (aircraft engine parts)

    • RacerX@lemm.eeOP
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      By breaking production, I’m referring to a situation where someone, most likely in a technical job, broke a system that was intended to be responsible for the operation for some kind of service. Most of the responses here, which have been great to read, are about messing up things like software, databases, servers and other hardware.

      Stuff happens and we all make mistakes. It’s what you take away from the experience that matters.

  • FigMcLargeHuge@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 months ago

    Was doing two deployments at the same time. On the first one, I got to the point where I had to clear the cache. I was typing out the command to remove the temp folder, and looked down at the other deployment instructions I had in front of me, and typed the folder for the prod deployments and hit enter, deleting all of the currently installed code. It was a clustered machine, and the other machine removed it’s files within milliseconds. When I realized what I had done, I just jumped up from my desk and said out loud “I’m fired!!” over and over. Once I calmed down, I had to get back on the call and ask everyone to check their apps. Sure enough they were all failing. I told them what I had done, and we immediately went to the clustered machine and files were gone there too. It took about 8 hours for the backup team to restore everything. They kept having to go find tapes to put in the machine, and it took way longer than anyone expected. Once we got the files restored, well we determined that we were all back to the previous day, and everyone’s work from that night was all gone, so we had to start the nights deployments over. I got grilled about it, and had to write a script to clear the cache from that point on. No more manually removing files. The other thing that came out of this for the good was no more doing two deployments at the same time. I told them exactly what happened and that when you push people like this, mistakes get made.

  • EmasXP@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    6 months ago

    Two things pop up

    • I once left an alert() asking “what the fuck?”. That was mostly laughed upon, so no worry.
    • I accidentally dropped the production database and replaced it by the staging one. That was not laughed upon.
    • TeenieBopper@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      I once dropped a table in the production database. I did not replace it with the same table from staging.

      On the bright side, we discovered our vendor wasn’t doing daily backups.

  • Rob Bos@lemmy.ca
    link
    fedilink
    arrow-up
    1
    ·
    6 months ago

    Plugged a serial cable into a UPS that was not expecting RS232. Took down the entire server room. Beyoop.

      • Rob Bos@lemmy.ca
        link
        fedilink
        arrow-up
        1
        ·
        6 months ago

        This was 2001 at a shoestring dialup ISP that also did consulting and had a couple small software products. So no.

  • Thelsim@sh.itjust.works
    link
    fedilink
    arrow-up
    1
    ·
    6 months ago

    Well first of, in a properly managed environment/team there’s never a single point of failure… *ahem*… that being said…

    The worst I ever did was lose a whole bunch of irreplaceable data because of… things. I can’t go into detail on that one. I did have a back plan for this kind of thing, but it was never implemented because my teammates thought it was a waste of time to cover for such a minuscule chance of a screw-up. I guess they didn’t know me too well back then :)

  • Futs@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    6 months ago

    Advertised an OS deployment to the ‘All Wokstations’ collection by mistake. I only realized after 30 minutes when peoples workstations started rebooting. Worked right through the night recovering and restoring about 200 machines.

  • Nomecks@lemmy.ca
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    6 months ago

    There was a nasty bug with some storage system software that I had the bad fortune to find, which resulted in me deleting 6.4TB of live VMs. All just gone in a flash. It took months to restore everything.

  • pastermil@sh.itjust.works
    link
    fedilink
    arrow-up
    1
    ·
    6 months ago

    I acidentally destroyed the production system completely thru improper partition resize. We got the database snapshot, but it’s in that server as well. After scrambling around for half a day, I managed to recover some of the older data dumps.

    So I spun up the new server from scratch, restored the database with some slightly outdated dump, installed the code (which was thankfully managed thru git), and configured everything to run all in an hour or two.

    The best part: everybody else knows this as some trivial misconfiguration. This happened in 2021.

  • treechicken@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    6 months ago

    I once “biased for action” and removed some “unused” NS records to “fix” a flakey DNS resolution issue without telling anyone on a Friday afternoon before going out to dinner with family.

    Turns out my fix did not work and those DNS records were actually important. Checked on the website halfway into the meal and freaked the fuck out once I realized the site went from resolving 90% of the time to not resolving at all. The worst part was when I finally got the guts to report I messed up on the group channel, DNS was somehow still resolving for both our internal monitoring and for everyone else who tried manually. My issue got shoo-shoo’d away, and I was left there not even sure of what to do next.

    I spent the rest of my time on my phone, refreshing the website and resolving domain names in an online Dig tool over and over again, anxiety growing, knowing I couldn’t do anything to fix my “fix” while I was outside.

    Once I came home I ended up reversing everything I did which seemed to bring it back to the original flakey state. Learned the value of SOPs and taking things slow after that (and also to not screw with DNS).

    If this story has a happy ending, it’s that we did eventually fix the flakey DNS issue later, going through a more rigorous review this time. On the other hand, how and why I, a junior at the time, became the de facto owner of an entire product’s DNS infra remains a big mystery to me.

    • Burninator05@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      Hopefully you learned a rule I try to live by despite not listing it: “no significant changes on Friday, no changes at all on Friday afternoon”.

  • slazer2au@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 months ago

    I took down an ISPfor a couple hours because I forgot the ‘add’ keyword at the end of a Cisco configuration line

    • sloppy_diffuser@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 months ago

      That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.